Voice Typing Anywhere with whisper.cpp

Press a key, speak, press again - transcription appears wherever your cursor is. Fully offline, GPU-accelerated.

Prerequisites

Arch Linux with NVIDIA GPU (I used an RTX 3050 Laptop GPU, compute capability 8.6)
Niri or another Wayland compositor (adapt the keybind for other compositors)
A microphone

sudo pacman -S cuda wtype sdl2

cuda — NVIDIA CUDA toolkit
wtype — Wayland keyboard input simulator (like xdotool but for Wayland)
sdl2 — needed for whisper-stream and whisper-command (optional, the toggle script uses pw-record instead)

CUDA installs to /opt/cuda/bin. Add it to your PATH or source /etc/profile.d/cuda.sh.

Building

GPU (CUDA)

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp

# Match -DCMAKE_CUDA_ARCHITECTURES to your GPU
# Find yours: nvidia-smi --query-gpu=compute_cap --format=csv,noheader
cmake -B build \
  -DGGML_CUDA=1 \
  -DCMAKE_CUDA_ARCHITECTURES="86" \
  -DWHISPER_SDL2=ON

cmake --build build -j6 --config Release

If your system GCC is too new (16+), nvcc will fail parsing C++23 headers. Set the host compiler to an older GCC:

cmake -B build \
  -DGGML_CUDA=1 \
  -DCMAKE_CUDA_ARCHITECTURES="86" \
  -DWHISPER_SDL2=ON \
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-15

CPU only

cmake -B build -DWHISPER_SDL2=ON
cmake --build build -j --config Release

CPU performance depends on your hardware. On a modern chip with AVX2, small.en can transcribe an 11-second clip in 3–8 seconds. On older hardware or a Raspberry Pi, use tiny.en and expect longer times. The CUDA path is roughly 10–30× faster.

RAM usage during build

nvcc is memory-heavy. On 14 GB RAM with 12 threads, -j6 is safe. With 8 GB, use -j2.

Verify GPU is active

./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav --no-timestamps

Look for:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 3770 MiB):
  Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes

Download a model

Model	Disk	VRAM	Inference (GPU, 11s clip)	CPU (Ryzen 5 5600H)
`tiny.en`	75 MB	~273 MB	~400ms	~2s
`base.en`	142 MB	~388 MB	~650ms	~5s
`small.en`	466 MB	~852 MB	~776ms	~8s
`medium.en`	1.5 GB	~2.1 GB	~1240ms	~20s+

bash ./models/download-ggml-model.sh medium.en

Models are saved to models/ggml-<name>.bin.

The toggle script

Save to ~/.config/scripts/whisper-transcribe.sh:

#!/bin/bash
set -euo pipefail

export WAYLAND_DISPLAY="${WAYLAND_DISPLAY:-wayland-1}"
export XDG_RUNTIME_DIR="${XDG_RUNTIME_DIR:-/run/user/1000}"

MODEL="${WHISPER_MODEL:-$HOME/random/whisper.cpp/models/ggml-medium.en.bin}"
TMPFILE="/tmp/whisper-record.wav"
PIDFILE="/tmp/whisper-recording.pid"
WHISPER="$HOME/random/whisper.cpp/build/bin/whisper-cli"

# --- stop ---
if [ -f "$PIDFILE" ]; then
    pid=$(cat "$PIDFILE")
    kill "$pid" 2>/dev/null || true
    wait "$pid" 2>/dev/null || true
    rm -f "$PIDFILE"

    notify-send -t 2000 "Whisper" "Transcribing..." || true

    text=$("$WHISPER" -m "$MODEL" -f "$TMPFILE" --no-timestamps -l en \
           --suppress-nst -nth 0.6 2>/dev/null || true)
    text=$(printf '%s' "$text" | tr '\n' ' ' | \
           sed 's/^[[:space:]]*//;s/[[:space:]]*$//;s/[[:space:]]\+/ /g')
    rm -f "$TMPFILE"

    if [ -n "$text" ]; then
        wtype "$text" || true
        notify-send -t 2000 "Whisper" "Done" || true
    fi
    exit 0
fi

# --- start ---
pw-record --rate=16000 --channels=1 "$TMPFILE" >/dev/null 2>&1 &
echo $! > "$PIDFILE"
disown

notify-send -t 2000 "Whisper" "Active" || true

chmod +x ~/.config/scripts/whisper-transcribe.sh

How it works

The script uses /tmp/whisper-recording.pid to track state.

First press — no PID file exists, so pw-record starts capturing 16kHz mono audio to a temp WAV, and the PID is written to the lock file.
Second press — the PID file is found, so pw-record is killed, whisper-cli transcribes the WAV, and wtype types the result into the focused window.

Notifications are sent at each stage via notify-send.

The whisper flags

--no-timestamps — output only the transcribed text
--suppress-nst — suppress non-speech tokens ([BLANK_AUDIO], [MUSIC], etc.)
-nth 0.6 — no-speech threshold. If whisper’s confidence that the audio is silence exceeds this, it outputs nothing instead of hallucinating

Without --suppress-nst and -nth 0.6, whisper will transcribe background noise into random symbols and numbers.

Niri keybind

In ~/.config/niri/config.kdl, inside the binds { } block:

Mod+Space { spawn "sh" "/home/radhey/.config/scripts/whisper-transcribe.sh"; }

Niri hot-reloads config automatically.

Why toggle, not hold-to-talk

Niri doesn’t support key-release binds (as of v26.04). If your compositor does (Hyprland has bindr), you can adapt the script to start on press and stop+transcribe on release.

Customizing

To switch models, set an environment variable or edit the script directly:

export WHISPER_MODEL="$HOME/random/whisper.cpp/models/ggml-small.en.bin"

Known issues

Discord (Electron) types symbols instead of letters

wtype uses the Wayland zwp_virtual_keyboard_v1 protocol. Electron apps sometimes misinterpret these key events — letters come out as =, +, -, or digits.

Fix: Use ydotool, which injects events at the kernel level via uinput and works in every application:

sudo pacman -S ydotool
sudo usermod -aG input $USER   # log out and back in

# Start the daemon (add to spawn-at-startup in niri config)
ydotoold &

# Replace the wtype line in the script with:
ydotool type --key-delay 10 "$text"

Build causes OOM

Reduce the -j count. On 14 GB RAM use -j6, on 8 GB use -j2 or -j1.

CMake can’t find CUDA

export PATH="/opt/cuda/bin:$PATH"