Voice Typing Anywhere with whisper.cpp
Set up GPU-accelerated voice typing on Arch Linux with whisper.cpp
Press a key, speak, press again - transcription appears wherever your cursor is. Fully offline, GPU-accelerated.
Prerequisites
- Arch Linux with NVIDIA GPU (I used an RTX 3050 Laptop GPU, compute capability 8.6)
- Niri or another Wayland compositor (adapt the keybind for other compositors)
- A microphone
sudo pacman -S cuda wtype sdl2
cuda— NVIDIA CUDA toolkitwtype— Wayland keyboard input simulator (like xdotool but for Wayland)sdl2— needed forwhisper-streamandwhisper-command(optional, the toggle script usespw-recordinstead)
CUDA installs to /opt/cuda/bin. Add it to your PATH or source /etc/profile.d/cuda.sh.
Building
GPU (CUDA)
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
# Match -DCMAKE_CUDA_ARCHITECTURES to your GPU
# Find yours: nvidia-smi --query-gpu=compute_cap --format=csv,noheader
cmake -B build \
-DGGML_CUDA=1 \
-DCMAKE_CUDA_ARCHITECTURES="86" \
-DWHISPER_SDL2=ON
cmake --build build -j6 --config Release
If your system GCC is too new (16+), nvcc will fail parsing C++23 headers. Set the host compiler to an older GCC:
cmake -B build \
-DGGML_CUDA=1 \
-DCMAKE_CUDA_ARCHITECTURES="86" \
-DWHISPER_SDL2=ON \
-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-15
CPU only
cmake -B build -DWHISPER_SDL2=ON
cmake --build build -j --config Release
CPU performance depends on your hardware. On a modern chip with AVX2, small.en can transcribe an 11-second clip in 3–8 seconds. On older hardware or a Raspberry Pi, use tiny.en and expect longer times. The CUDA path is roughly 10–30× faster.
RAM usage during build
nvcc is memory-heavy. On 14 GB RAM with 12 threads, -j6 is safe. With 8 GB, use -j2.
Verify GPU is active
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav --no-timestamps
Look for:
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 3770 MiB):
Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
Download a model
| Model | Disk | VRAM | Inference (GPU, 11s clip) | CPU (Ryzen 5 5600H) |
|---|---|---|---|---|
tiny.en | 75 MB | ~273 MB | ~400ms | ~2s |
base.en | 142 MB | ~388 MB | ~650ms | ~5s |
small.en | 466 MB | ~852 MB | ~776ms | ~8s |
medium.en | 1.5 GB | ~2.1 GB | ~1240ms | ~20s+ |
bash ./models/download-ggml-model.sh medium.en
Models are saved to models/ggml-<name>.bin.
The toggle script
Save to ~/.config/scripts/whisper-transcribe.sh:
#!/bin/bash
set -euo pipefail
export WAYLAND_DISPLAY="${WAYLAND_DISPLAY:-wayland-1}"
export XDG_RUNTIME_DIR="${XDG_RUNTIME_DIR:-/run/user/1000}"
MODEL="${WHISPER_MODEL:-$HOME/random/whisper.cpp/models/ggml-medium.en.bin}"
TMPFILE="/tmp/whisper-record.wav"
PIDFILE="/tmp/whisper-recording.pid"
WHISPER="$HOME/random/whisper.cpp/build/bin/whisper-cli"
# --- stop ---
if [ -f "$PIDFILE" ]; then
pid=$(cat "$PIDFILE")
kill "$pid" 2>/dev/null || true
wait "$pid" 2>/dev/null || true
rm -f "$PIDFILE"
notify-send -t 2000 "Whisper" "Transcribing..." || true
text=$("$WHISPER" -m "$MODEL" -f "$TMPFILE" --no-timestamps -l en \
--suppress-nst -nth 0.6 2>/dev/null || true)
text=$(printf '%s' "$text" | tr '\n' ' ' | \
sed 's/^[[:space:]]*//;s/[[:space:]]*$//;s/[[:space:]]\+/ /g')
rm -f "$TMPFILE"
if [ -n "$text" ]; then
wtype "$text" || true
notify-send -t 2000 "Whisper" "Done" || true
fi
exit 0
fi
# --- start ---
pw-record --rate=16000 --channels=1 "$TMPFILE" >/dev/null 2>&1 &
echo $! > "$PIDFILE"
disown
notify-send -t 2000 "Whisper" "Active" || true
chmod +x ~/.config/scripts/whisper-transcribe.sh
How it works
The script uses /tmp/whisper-recording.pid to track state.
- First press — no PID file exists, so
pw-recordstarts capturing 16kHz mono audio to a temp WAV, and the PID is written to the lock file. - Second press — the PID file is found, so
pw-recordis killed,whisper-clitranscribes the WAV, andwtypetypes the result into the focused window.
Notifications are sent at each stage via notify-send.
The whisper flags
--no-timestamps— output only the transcribed text--suppress-nst— suppress non-speech tokens ([BLANK_AUDIO],[MUSIC], etc.)-nth 0.6— no-speech threshold. If whisper’s confidence that the audio is silence exceeds this, it outputs nothing instead of hallucinating
Without --suppress-nst and -nth 0.6, whisper will transcribe background noise into random symbols and numbers.
Niri keybind
In ~/.config/niri/config.kdl, inside the binds { } block:
Mod+Space { spawn "sh" "/home/radhey/.config/scripts/whisper-transcribe.sh"; }
Niri hot-reloads config automatically.
Why toggle, not hold-to-talk
Niri doesn’t support key-release binds (as of v26.04). If your compositor does (Hyprland has bindr), you can adapt the script to start on press and stop+transcribe on release.
Customizing
To switch models, set an environment variable or edit the script directly:
export WHISPER_MODEL="$HOME/random/whisper.cpp/models/ggml-small.en.bin"
Known issues
Discord (Electron) types symbols instead of letters
wtype uses the Wayland zwp_virtual_keyboard_v1 protocol. Electron apps sometimes misinterpret these key events — letters come out as =, +, -, or digits.
Fix: Use ydotool, which injects events at the kernel level via uinput and works in every application:
sudo pacman -S ydotool
sudo usermod -aG input $USER # log out and back in
# Start the daemon (add to spawn-at-startup in niri config)
ydotoold &
# Replace the wtype line in the script with:
ydotool type --key-delay 10 "$text"
Build causes OOM
Reduce the -j count. On 14 GB RAM use -j6, on 8 GB use -j2 or -j1.
CMake can’t find CUDA
export PATH="/opt/cuda/bin:$PATH"