Speech Engine Comparison

whisper.cpp vs Whisper vs VOSK vs Remote API on Linux

If you are choosing a Linux speech-to-text engine, this page gives a practical side-by-side comparison focused on latency, hardware support, install footprint, privacy boundary, and real desktop usage.

whisper.cpp

Speed: Fastest startup + low latency
Hardware: CPU + AMD/Intel/NVIDIA GPU
Accuracy: High (best overall balance)
Footprint: Small models available (~74MB tiny)
Best for: Most users who want strong speed + quality

Whisper (OpenAI)

Speed: Slower install and startup
Hardware: CPU or NVIDIA CUDA
Accuracy: High
Footprint: Large dependency footprint (~2.3GB)
Best for: Users already standardized on PyTorch stack

VOSK

Speed: Very fast realtime on low-end systems
Hardware: CPU
Accuracy: Good for lightweight use
Footprint: Very lightweight (~40MB model)
Best for: Older hardware and minimal-resource environments

Remote API

Speed: Depends on server + network latency
Hardware: Client CPU + remote Whisper server
Accuracy: Depends on remote model
Footprint: No local model required
Best for: Powerful LAN servers or shared transcription backends

Engine	Speed	Hardware	Accuracy	Footprint	Best for
whisper.cpp	Fastest startup + low latency	CPU + AMD/Intel/NVIDIA GPU	High (best overall balance)	Small models available (~74MB tiny)	Most users who want strong speed + quality
Whisper (OpenAI)	Slower install and startup	CPU or NVIDIA CUDA	High	Large dependency footprint (~2.3GB)	Users already standardized on PyTorch stack
VOSK	Very fast realtime on low-end systems	CPU	Good for lightweight use	Very lightweight (~40MB model)	Older hardware and minimal-resource environments
Remote API	Depends on server + network latency	Client CPU + remote Whisper server	Depends on remote model	No local model required	Powerful LAN servers or shared transcription backends

Switching Between Engines

You can switch between whisper.cpp, Whisper, VOSK, and Remote API from Settings. v0.10.1+ safely stops recognition before switching to prevent crashes. v0.12.0 adds Remote API configuration under Advanced settings for compatible transcription servers.

When to pick whisper.cpp

Choose whisper.cpp when you want the best speed-to-accuracy ratio and broad hardware support. It is the default in Vocalinux for a reason. Safe engine switching - v0.10.1+ stops recognition before switching to prevent crashes.

When to pick Whisper

Choose OpenAI Whisper if your environment already depends on PyTorch/CUDA workflows and you prefer that runtime profile.

When to pick VOSK

Choose VOSK on older laptops, low-RAM systems, or lightweight VMs where small model size and minimal overhead matter most.

When to pick Remote API

Choose Remote API when a trusted server has stronger hardware, larger models, or a shared Whisper backend. Use local engines when your voice data must stay entirely on-device.

Remote setup

Next steps

Install by distro:Ubuntu,Fedora,Arch Linux.
Use interactive install to detect your hardware and pick the best engine defaults.
After install, tune model size, VAD sensitivity, or Remote API settings for your preferred latency and accuracy level.