How It Works
VoxBar Whisper uses Faster-Whisper — a highly optimised implementation of OpenAI's Whisper model, rewritten in CTranslate2 for dramatically faster inference. Whisper is the most widely tested and proven speech recognition model in the world, powering transcription for millions of users. VoxBar Whisper takes that foundation and adds aggressive anti-hallucination tuning to eliminate the phantom text that plagues default Whisper deployments.
Here's what happens, step by step:
- Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
- Buffers 3 seconds of audio into an in-memory buffer (configurable via quality presets)
- Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
- Feeds the raw audio array directly to Faster-Whisper — no temp WAV file needed (Faster-Whisper accepts numpy arrays)
- Silero VAD pre-filters the audio — Voice Activity Detection strips out silence before the model even sees it, eliminating the #1 source of Whisper hallucinations
- The model transcribes with full anti-hallucination settings:
temperature=0.0— deterministic output, no "creative" text generationcondition_on_previous_text=False— prevents hallucination cascadingcompression_ratio_threshold=2.4— rejects garbled/repeated outputlog_prob_threshold=-1.0— rejects low-confidence segmentsno_speech_threshold=0.6— strong silence detection- Post-transcription hallucination filter catches known patterns ("thank you", "subscribe", "[music]", repeated words)
- Clean text is appended to your textbox
- Repeats forever — each chunk is completely independent
The Anti-Hallucination Stack
Default Whisper is notorious for generating phantom text during silence — "Thank you for watching", "Subscribe", "[Music]", or simply repeating the same phrase over and over. VoxBar Whisper solves this with a three-layer defence:
- Silero VAD (pre-filter) — strips silence before the model processes anything
- Whisper's built-in thresholds (during inference) —
compression_ratio,log_prob, andno_speechthresholds reject bad output - Pattern matching (post-filter) — catches known hallucination phrases and repetition patterns
This means VoxBar Whisper produces significantly cleaner output than any off-the-shelf Whisper implementation.
Recording Limits
VoxBar Whisper Has No Recording Limit
Like VoxBar AI and Ultra, VoxBar Whisper runs natively with no Docker, no server, and no network connections. Each 3-second chunk is independent.
Flush-on-Stop
When you press Stop, VoxBar Whisper transcribes any remaining audio still in the buffer. This means you never lose the last few words of a sentence — even if you stop mid-speech, the final chunk is processed and committed before shutdown.
Auto-Stop Behaviour
- Silence timeout: 60 seconds of no detected speech
- Check interval: Every 5 seconds
Memory & Resource Footprint
| Resource | Usage (Base model) | Usage (Small model) | Behaviour Over Time |
|---|---|---|---|
| GPU VRAM | ~1GB (CUDA) | ~2GB (CUDA) | ✅ Fixed — never grows |
| CPU mode | Moderate CPU usage | Higher CPU usage | ✅ Works without any GPU |
| RAM | ~300MB | ~500MB | ✅ Stable |
| Disk | Zero temp files | Zero temp files | ✅ Audio is processed from memory — no disk I/O |
| Network | None | None | ✅ Completely offline |
Note: Faster-Whisper accepts raw numpy arrays directly — unlike VoxBar AI and Ultra which write temp WAV files for NeMo, VoxBar Whisper does zero disk I/O during transcription.
Quality Presets
VoxBar Whisper is the only model in the suite with user-selectable quality presets, letting users trade accuracy for speed:
| Preset | Model Size | Chunk Duration | Best For |
|---|---|---|---|
| ⚡ Speed | tiny (39M params) |
2 seconds | Quick notes, brainstorming, low-power machines |
| ⚖ Balanced | base (74M params) |
3 seconds | General use (default) |
| 🎯 Accuracy | small (244M params) |
4 seconds | Important documents, meetings |
Switching presets tears down and rebuilds the engine — the model is swapped out in real-time, no restart needed. The user can switch mid-session from the Settings menu.
Architecture Advantage
What makes VoxBar Whisper special: It's built on the most battle-tested speech model in existence. OpenAI's Whisper was trained on 680,000 hours of multilingual audio — more training data than any other ASR model. Combined with CTranslate2's optimised inference engine, it delivers:
- 99 language support — the widest language coverage of any VoxBar model
- Cross-platform hardware support — NVIDIA, AMD, Intel, or pure CPU
- Multiple model sizes — users choose their own accuracy/speed trade-off
- Mature ecosystem — extensive community testing, known behaviour, predictable results
What users DON'T have to worry about:
- ❌ No GPU required — works on pure CPU (int8 quantised)
- ❌ No Docker — runs natively
- ❌ No internet connection — completely offline
- ❌ No temp files — processes audio directly from memory
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever
What users DO need to know:
- ⚠️ Text arrives in chunks (every 2-4 seconds depending on preset)
- ⚠️ Accuracy varies by model size — tiny is fast but rough; small is slower but much better
- ⚠️ Still needs real-world tuning — hallucination filter has been implemented but needs more testing against diverse audio inputs
- ⚠️ First launch downloads model files (~75MB for base, ~500MB for small — cached after that)
Accuracy & Speed
| Metric | Speed Preset | Balanced Preset | Accuracy Preset |
|---|---|---|---|
| Model | tiny (39M) | base (74M) | small (244M) |
| Chunk | 2 seconds | 3 seconds | 4 seconds |
| WER | ~12-15% | ~8-10% | ~5-7% |
| VRAM (GPU) | ~0.5GB | ~1GB | ~2GB |
| CPU viable | ✅ Fast | ✅ Usable | ⚠️ Slow |
| Languages | 99 | 99 | 99 |
| Punctuation | Basic | Moderate | Good |
99 Languages
VoxBar Whisper is the only model in the suite that supports 99 languages out of the box. While VoxBar Pro (Voxtral) supports multiple languages, and VoxBar AI (Canary) supports a handful, Whisper's language coverage is unmatched. For non-English users, this may be the best option regardless of hardware.
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | ❌ Not required | Any GPU for acceleration |
| GPU (NVIDIA) | ✅ Supported (CUDA, float16) | Any NVIDIA GPU |
| GPU (AMD) | ✅ Supported | Any AMD GPU |
| GPU (Intel) | ✅ Supported | Intel integrated |
| CPU-only | ✅ Supported (int8 quantised) | Modern multi-core CPU |
| RAM | 4GB | 8GB+ |
| Disk | ~75MB (base) to ~500MB (small) | SSD recommended |
| OS | Windows 10/11 | Windows 10/11 |
| Software | Python 3.10+ | pip install faster-whisper |
| Docker | ❌ Not required | — |
License & Attribution
| Detail | Value |
|---|---|
| Model | OpenAI Whisper (via faster-whisper / CTranslate2) |
| Creator | OpenAI (model), Guillaume Klein (faster-whisper) |
| License | MIT (fully commercial, no restrictions) |
| Attribution | Not required |
| Distribution | Can be bundled and sold commercially with zero restrictions |
VoxBar Whisper has the most permissive license in the entire suite. MIT license means zero attribution requirements, zero restrictions on commercial use, and zero legal concerns.
Where It Fits in the Suite
| Feature | VoxBar Pro | VoxBar AI | VoxBar Ultra | VoxBar Lite | VoxBar Whisper |
|---|---|---|---|---|---|
| Accuracy | ★★★★★ | ★★★★★ | ★★★★★ | ★★★☆☆ | ★★★★☆ |
| GPU Required | Yes | Yes | Yes | No | No |
| Languages | Multi | Multi | English | English | 99 languages |
| CPU-only | ❌ | ❌ | ❌ | ✅ | ✅ |
| AMD support | Docker | ❌ | ❌ | ✅ | ✅ |
| Quality presets | ❌ | ❌ | ❌ | ❌ | ✅ (3 levels) |
| Anti-hallucination | N/A | Basic filter | None | Needs work | 3-layer defence |
| License | Apache 2.0 | CC-BY-4.0 | CC-BY-4.0 | Apache 2.0 | MIT (most permissive) |
| Best for | Premium live | Long sessions | Fast English | Any hardware | Multilingual, universal |
Bottom line: VoxBar Whisper is the Swiss Army knife of the suite. It may not be the fastest or most accurate at any single thing, but it covers more ground than any other model — 99 languages, any hardware, configurable quality, the most permissive license, and the most battle-tested model on earth. For multilingual users or those who want maximum flexibility, it's the smart choice.