VoxBar Whisper — Powered by Faster-Whisper (OpenAI Whisper, CTranslate2 Optimised)

How It Works

VoxBar Whisper uses Faster-Whisper — a highly optimised implementation of OpenAI's Whisper model, rewritten in CTranslate2 for dramatically faster inference. Whisper is the most widely tested and proven speech recognition model in the world, powering transcription for millions of users. VoxBar Whisper takes that foundation and adds aggressive anti-hallucination tuning to eliminate the phantom text that plagues default Whisper deployments.

Here's what happens, step by step:

Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
Buffers 3 seconds of audio into an in-memory buffer (configurable via quality presets)
Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
Feeds the raw audio array directly to Faster-Whisper — no temp WAV file needed (Faster-Whisper accepts numpy arrays)
Silero VAD pre-filters the audio — Voice Activity Detection strips out silence before the model even sees it, eliminating the #1 source of Whisper hallucinations
The model transcribes with full anti-hallucination settings:
temperature=0.0 — deterministic output, no "creative" text generation
condition_on_previous_text=False — prevents hallucination cascading
compression_ratio_threshold=2.4 — rejects garbled/repeated output
log_prob_threshold=-1.0 — rejects low-confidence segments
no_speech_threshold=0.6 — strong silence detection
Post-transcription hallucination filter catches known patterns ("thank you", "subscribe", "[music]", repeated words)
Clean text is appended to your textbox
Repeats forever — each chunk is completely independent

The Anti-Hallucination Stack

Default Whisper is notorious for generating phantom text during silence — "Thank you for watching", "Subscribe", "[Music]", or simply repeating the same phrase over and over. VoxBar Whisper solves this with a three-layer defence:

Silero VAD (pre-filter) — strips silence before the model processes anything
Whisper's built-in thresholds (during inference) — compression_ratio, log_prob, and no_speech thresholds reject bad output
Pattern matching (post-filter) — catches known hallucination phrases and repetition patterns

This means VoxBar Whisper produces significantly cleaner output than any off-the-shelf Whisper implementation.

Recording Limits

VoxBar Whisper Has No Recording Limit

Like VoxBar AI and Ultra, VoxBar Whisper runs natively with no Docker, no server, and no network connections. Each 3-second chunk is independent.

Flush-on-Stop

When you press Stop, VoxBar Whisper transcribes any remaining audio still in the buffer. This means you never lose the last few words of a sentence — even if you stop mid-speech, the final chunk is processed and committed before shutdown.

Auto-Stop Behaviour

Silence timeout: 60 seconds of no detected speech
Check interval: Every 5 seconds

Memory & Resource Footprint

Resource	Usage (Base model)	Usage (Small model)	Behaviour Over Time
GPU VRAM	~1GB (CUDA)	~2GB (CUDA)	✅ Fixed — never grows
CPU mode	Moderate CPU usage	Higher CPU usage	✅ Works without any GPU
RAM	~300MB	~500MB	✅ Stable
Disk	Zero temp files	Zero temp files	✅ Audio is processed from memory — no disk I/O
Network	None	None	✅ Completely offline

Note: Faster-Whisper accepts raw numpy arrays directly — unlike VoxBar AI and Ultra which write temp WAV files for NeMo, VoxBar Whisper does zero disk I/O during transcription.

Quality Presets

VoxBar Whisper is the only model in the suite with user-selectable quality presets, letting users trade accuracy for speed:

Preset	Model Size	Chunk Duration	Best For
⚡ Speed	`tiny` (39M params)	2 seconds	Quick notes, brainstorming, low-power machines
⚖ Balanced	`base` (74M params)	3 seconds	General use (default)
🎯 Accuracy	`small` (244M params)	4 seconds	Important documents, meetings

Switching presets tears down and rebuilds the engine — the model is swapped out in real-time, no restart needed. The user can switch mid-session from the Settings menu.

Architecture Advantage

What makes VoxBar Whisper special: It's built on the most battle-tested speech model in existence. OpenAI's Whisper was trained on 680,000 hours of multilingual audio — more training data than any other ASR model. Combined with CTranslate2's optimised inference engine, it delivers:

99 language support — the widest language coverage of any VoxBar model
Cross-platform hardware support — NVIDIA, AMD, Intel, or pure CPU
Multiple model sizes — users choose their own accuracy/speed trade-off
Mature ecosystem — extensive community testing, known behaviour, predictable results

What users DON'T have to worry about:
- ❌ No GPU required — works on pure CPU (int8 quantised)
- ❌ No Docker — runs natively
- ❌ No internet connection — completely offline
- ❌ No temp files — processes audio directly from memory
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every 2-4 seconds depending on preset)
- ⚠️ Accuracy varies by model size — tiny is fast but rough; small is slower but much better
- ⚠️ Still needs real-world tuning — hallucination filter has been implemented but needs more testing against diverse audio inputs
- ⚠️ First launch downloads model files (~75MB for base, ~500MB for small — cached after that)

Accuracy & Speed

Metric	Speed Preset	Balanced Preset	Accuracy Preset
Model	tiny (39M)	base (74M)	small (244M)
Chunk	2 seconds	3 seconds	4 seconds
WER	~12-15%	~8-10%	~5-7%
VRAM (GPU)	~0.5GB	~1GB	~2GB
CPU viable	✅ Fast	✅ Usable	⚠️ Slow
Languages	99	99	99
Punctuation	Basic	Moderate	Good

99 Languages

VoxBar Whisper is the only model in the suite that supports 99 languages out of the box. While VoxBar Pro (Voxtral) supports multiple languages, and VoxBar AI (Canary) supports a handful, Whisper's language coverage is unmatched. For non-English users, this may be the best option regardless of hardware.

Hardware Requirements

Requirement	Minimum	Recommended
GPU	❌ Not required	Any GPU for acceleration
GPU (NVIDIA)	✅ Supported (CUDA, float16)	Any NVIDIA GPU
GPU (AMD)	✅ Supported	Any AMD GPU
GPU (Intel)	✅ Supported	Intel integrated
CPU-only	✅ Supported (int8 quantised)	Modern multi-core CPU
RAM	4GB	8GB+
Disk	~75MB (base) to ~500MB (small)	SSD recommended
OS	Windows 10/11	Windows 10/11
Software	Python 3.10+	pip install faster-whisper
Docker	❌ Not required	—

License & Attribution

Detail	Value
Model	OpenAI Whisper (via faster-whisper / CTranslate2)
Creator	OpenAI (model), Guillaume Klein (faster-whisper)
License	MIT (fully commercial, no restrictions)
Attribution	Not required
Distribution	Can be bundled and sold commercially with zero restrictions

VoxBar Whisper has the most permissive license in the entire suite. MIT license means zero attribution requirements, zero restrictions on commercial use, and zero legal concerns.

Where It Fits in the Suite

Feature	VoxBar Pro	VoxBar AI	VoxBar Ultra	VoxBar Lite	VoxBar Whisper
Accuracy	★★★★★	★★★★★	★★★★★	★★★☆☆	★★★★☆
GPU Required	Yes	Yes	Yes	No	No
Languages	Multi	Multi	English	English	99 languages
CPU-only	❌	❌	❌	✅	✅
AMD support	Docker	❌	❌	✅	✅
Quality presets	❌	❌	❌	❌	✅ (3 levels)
Anti-hallucination	N/A	Basic filter	None	Needs work	3-layer defence
License	Apache 2.0	CC-BY-4.0	CC-BY-4.0	Apache 2.0	MIT (most permissive)
Best for	Premium live	Long sessions	Fast English	Any hardware	Multilingual, universal

Bottom line: VoxBar Whisper is the Swiss Army knife of the suite. It may not be the fastest or most accurate at any single thing, but it covers more ground than any other model — 99 languages, any hardware, configurable quality, the most permissive license, and the most battle-tested model on earth. For multilingual users or those who want maximum flexibility, it's the smart choice.