#5 — Entry

VoxBar Whisper

The world's most proven speech model. 99 languages. Runs on anything.

Powered by Faster-Whisper (OpenAI Whisper, CTranslate2 Optimised)

How It Works

VoxBar Whisper uses Faster-Whisper — a highly optimised implementation of OpenAI's Whisper model, rewritten in CTranslate2 for dramatically faster inference. Whisper is the most widely tested and proven speech recognition model in the world, powering transcription for millions of users. VoxBar Whisper takes that foundation and adds aggressive anti-hallucination tuning to eliminate the phantom text that plagues default Whisper deployments.

Here's what happens, step by step:

  1. Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
  2. Buffers 3 seconds of audio into an in-memory buffer (configurable via quality presets)
  3. Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
  4. Feeds the raw audio array directly to Faster-Whisper — no temp WAV file needed (Faster-Whisper accepts numpy arrays)
  5. Silero VAD pre-filters the audio — Voice Activity Detection strips out silence before the model even sees it, eliminating the #1 source of Whisper hallucinations
  6. The model transcribes with full anti-hallucination settings:
  7. temperature=0.0 — deterministic output, no "creative" text generation
  8. condition_on_previous_text=False — prevents hallucination cascading
  9. compression_ratio_threshold=2.4 — rejects garbled/repeated output
  10. log_prob_threshold=-1.0 — rejects low-confidence segments
  11. no_speech_threshold=0.6 — strong silence detection
  12. Post-transcription hallucination filter catches known patterns ("thank you", "subscribe", "[music]", repeated words)
  13. Clean text is appended to your textbox
  14. Repeats forever — each chunk is completely independent

The Anti-Hallucination Stack

Default Whisper is notorious for generating phantom text during silence — "Thank you for watching", "Subscribe", "[Music]", or simply repeating the same phrase over and over. VoxBar Whisper solves this with a three-layer defence:

  1. Silero VAD (pre-filter) — strips silence before the model processes anything
  2. Whisper's built-in thresholds (during inference) — compression_ratio, log_prob, and no_speech thresholds reject bad output
  3. Pattern matching (post-filter) — catches known hallucination phrases and repetition patterns

This means VoxBar Whisper produces significantly cleaner output than any off-the-shelf Whisper implementation.

Recording Limits

VoxBar Whisper Has No Recording Limit

Like VoxBar AI and Ultra, VoxBar Whisper runs natively with no Docker, no server, and no network connections. Each 3-second chunk is independent.

Flush-on-Stop

When you press Stop, VoxBar Whisper transcribes any remaining audio still in the buffer. This means you never lose the last few words of a sentence — even if you stop mid-speech, the final chunk is processed and committed before shutdown.

Auto-Stop Behaviour

  • Silence timeout: 60 seconds of no detected speech
  • Check interval: Every 5 seconds

Memory & Resource Footprint

Resource Usage (Base model) Usage (Small model) Behaviour Over Time
GPU VRAM ~1GB (CUDA) ~2GB (CUDA) ✅ Fixed — never grows
CPU mode Moderate CPU usage Higher CPU usage ✅ Works without any GPU
RAM ~300MB ~500MB ✅ Stable
Disk Zero temp files Zero temp files ✅ Audio is processed from memory — no disk I/O
Network None None ✅ Completely offline

Note: Faster-Whisper accepts raw numpy arrays directly — unlike VoxBar AI and Ultra which write temp WAV files for NeMo, VoxBar Whisper does zero disk I/O during transcription.

Quality Presets

VoxBar Whisper is the only model in the suite with user-selectable quality presets, letting users trade accuracy for speed:

Preset Model Size Chunk Duration Best For
Speed tiny (39M params) 2 seconds Quick notes, brainstorming, low-power machines
Balanced base (74M params) 3 seconds General use (default)
🎯 Accuracy small (244M params) 4 seconds Important documents, meetings

Switching presets tears down and rebuilds the engine — the model is swapped out in real-time, no restart needed. The user can switch mid-session from the Settings menu.

Architecture Advantage

What makes VoxBar Whisper special: It's built on the most battle-tested speech model in existence. OpenAI's Whisper was trained on 680,000 hours of multilingual audio — more training data than any other ASR model. Combined with CTranslate2's optimised inference engine, it delivers:

  • 99 language support — the widest language coverage of any VoxBar model
  • Cross-platform hardware support — NVIDIA, AMD, Intel, or pure CPU
  • Multiple model sizes — users choose their own accuracy/speed trade-off
  • Mature ecosystem — extensive community testing, known behaviour, predictable results

What users DON'T have to worry about:
- ❌ No GPU required — works on pure CPU (int8 quantised)
- ❌ No Docker — runs natively
- ❌ No internet connection — completely offline
- ❌ No temp files — processes audio directly from memory
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every 2-4 seconds depending on preset)
- ⚠️ Accuracy varies by model sizetiny is fast but rough; small is slower but much better
- ⚠️ Still needs real-world tuning — hallucination filter has been implemented but needs more testing against diverse audio inputs
- ⚠️ First launch downloads model files (~75MB for base, ~500MB for small — cached after that)

Accuracy & Speed

Metric Speed Preset Balanced Preset Accuracy Preset
Model tiny (39M) base (74M) small (244M)
Chunk 2 seconds 3 seconds 4 seconds
WER ~12-15% ~8-10% ~5-7%
VRAM (GPU) ~0.5GB ~1GB ~2GB
CPU viable ✅ Fast ✅ Usable ⚠️ Slow
Languages 99 99 99
Punctuation Basic Moderate Good

99 Languages

VoxBar Whisper is the only model in the suite that supports 99 languages out of the box. While VoxBar Pro (Voxtral) supports multiple languages, and VoxBar AI (Canary) supports a handful, Whisper's language coverage is unmatched. For non-English users, this may be the best option regardless of hardware.

Hardware Requirements

Requirement Minimum Recommended
GPU ❌ Not required Any GPU for acceleration
GPU (NVIDIA) ✅ Supported (CUDA, float16) Any NVIDIA GPU
GPU (AMD) ✅ Supported Any AMD GPU
GPU (Intel) ✅ Supported Intel integrated
CPU-only ✅ Supported (int8 quantised) Modern multi-core CPU
RAM 4GB 8GB+
Disk ~75MB (base) to ~500MB (small) SSD recommended
OS Windows 10/11 Windows 10/11
Software Python 3.10+ pip install faster-whisper
Docker ❌ Not required

License & Attribution

Detail Value
Model OpenAI Whisper (via faster-whisper / CTranslate2)
Creator OpenAI (model), Guillaume Klein (faster-whisper)
License MIT (fully commercial, no restrictions)
Attribution Not required
Distribution Can be bundled and sold commercially with zero restrictions

VoxBar Whisper has the most permissive license in the entire suite. MIT license means zero attribution requirements, zero restrictions on commercial use, and zero legal concerns.

Where It Fits in the Suite

Feature VoxBar Pro VoxBar AI VoxBar Ultra VoxBar Lite VoxBar Whisper
Accuracy ★★★★★ ★★★★★ ★★★★★ ★★★☆☆ ★★★★☆
GPU Required Yes Yes Yes No No
Languages Multi Multi English English 99 languages
CPU-only
AMD support Docker
Quality presets ✅ (3 levels)
Anti-hallucination N/A Basic filter None Needs work 3-layer defence
License Apache 2.0 CC-BY-4.0 CC-BY-4.0 Apache 2.0 MIT (most permissive)
Best for Premium live Long sessions Fast English Any hardware Multilingual, universal

Bottom line: VoxBar Whisper is the Swiss Army knife of the suite. It may not be the fastest or most accurate at any single thing, but it covers more ground than any other model — 99 languages, any hardware, configurable quality, the most permissive license, and the most battle-tested model on earth. For multilingual users or those who want maximum flexibility, it's the smart choice.