How It Works
VoxBar AI uses NVIDIA's Speech-Augmented Language Model (SALM) — a 2.5 billion parameter model that combines a speech encoder with a full Qwen large language model. This means it doesn't just hear sounds and guess words — it understands context, producing transcription with natural punctuation and intelligent word choices.
Here's what happens, step by step:
- Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
- Buffers 1.5 seconds of audio into a small in-memory buffer (~96KB)
- Checks for silence — if the RMS energy is below 0.03, the chunk is discarded (no wasted processing)
- Writes a tiny temp WAV file (~48KB) to your system temp folder
- Feeds the WAV to the SALM model with a structured chat prompt: "Transcribe this audio to English"
- The model generates text using its LLM backbone — producing accurate, contextual transcription
- Token output is decoded via the model's tokenizer into readable text
- Temp file is immediately deleted — nothing accumulates on disk
- Text is appended to your textbox
- Repeats forever — each chunk is completely independent
Every chunk is a self-contained operation. There's no growing context window, no accumulating state, no connection to maintain. The GPU just processes one tiny WAV file at a time, indefinitely.
Recording Limits
VoxBar AI Has No Recording Limit
VoxBar AI can record continuously for as long as you want — hours, all day if you need it.
Unlike VoxBar Pro, which runs inside a Docker container with WebSocket connections that can drop, VoxBar AI runs natively on your machine. There is no server process, no container, no network connection involved. It's just your Python process, the model in GPU memory, and your microphone.
Why It Runs Forever
- Each 1.5-second chunk is completely independent — no state carries over
- GPU memory is fixed — the same model processes the same size input every time
- No WebSocket, no Docker, no server to crash or restart
- No context window that fills up or degrades
Auto-Stop Behaviour
- Silence timeout: 15 minutes (900 seconds) of no detected speech
- Check interval: Every 10 seconds
- This means you can pause for a long coffee break, step away from your desk, or sit in a quiet meeting — VoxBar AI will keep waiting patiently for up to 15 minutes before auto-stopping
Real-World Testing (2026-02-17)
During live testing, VoxBar AI ran continuously for over 50 minutes of natural dictation with zero interruptions, zero restarts, and zero degradation. The text quality at minute 50 was identical to minute 1.
Memory & Resource Footprint
| Resource | Usage | Behaviour Over Time |
|---|---|---|
| GPU VRAM | ~6-8GB fixed | ✅ Never grows — same model, same chunk size, forever |
| RAM | ~500MB (Python process + model overhead) | ✅ Stable — only the text string grows (negligible) |
| Disk | Zero accumulation | ✅ Temp WAV files deleted immediately after each chunk |
| Network | None | ✅ Completely offline — no internet, no localhost, no sockets |
If you left VoxBar AI running for a 3-hour meeting, it would just keep transcribing every 1.5 seconds without ever stopping, restarting, or degrading. That's a genuine advantage over container-based solutions.
Architecture Advantage
What makes VoxBar AI special: It combines a FastConformer speech encoder with a Qwen 2.5B large language model. This is fundamentally different from traditional ASR models that just map sounds to words. The LLM backbone means:
- Context-aware transcription — it understands what you're saying, not just what sounds you make
- Natural punctuation — periods, commas, and question marks appear where they should
- Intelligent word choices — homophones and ambiguous sounds are resolved using language understanding
- No hallucination on silence — the silence filter (0.03 RMS threshold) prevents phantom text
What users DON'T have to worry about:
- ❌ No Docker required — runs natively, no containers
- ❌ No internet connection — completely offline
- ❌ No WebSocket connections — nothing to drop or reconnect
- ❌ No session limits — record for hours without interruption
- ❌ No cloud processing — your voice never leaves your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever
What users DO need to know:
- ⚠️ Text arrives in chunks (every ~1.5 seconds), not word-by-word like VoxBar Pro
- ⚠️ NVIDIA GPU required — the SALM model needs CUDA
- ⚠️ 6-8GB VRAM — needs a mid-range or better NVIDIA GPU
- ⚠️ First launch downloads ~5GB model files (cached after that)
Accuracy & Speed
| Metric | Value |
|---|---|
| Delivery | Chunked — text appears every ~1.5 seconds |
| Latency | ~1.5-2 seconds from speech to text (chunk processing time) |
| Word Error Rate | ~5.6% (benchmark) — real-world accuracy matches VoxBar Pro |
| Inference Speed | 418x real-time |
| Punctuation | Yes — context-aware, natural placement |
| Capitalisation | Automatic, intelligent |
| Hallucination Filter | Built-in — discards filler words (um, uh) and noise patterns |
Real-World Accuracy
During live testing, VoxBar AI captured natural dictation with zero editing required. Every word was captured accurately, including technical terms, proper nouns, and conversational speech. The accuracy is functionally identical to VoxBar Pro (Voxtral) — the only difference is delivery speed.
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 6GB VRAM | NVIDIA with 8GB+ VRAM |
| GPU (AMD) | ❌ Not supported | — |
| GPU (Apple) | ❌ Not supported | — |
| RAM | 16GB | 16GB+ |
| Disk | ~5GB for model (cached in ~/.cache) | SSD recommended |
| OS | Windows 10/11 | Windows 11 |
| Software | Python 3.10+, PyTorch 2.6+, NeMo trunk | Included in venv |
| Docker | ❌ Not required | — |
License & Attribution
| Detail | Value |
|---|---|
| Model | nvidia/canary-qwen-2.5b |
| Creator | NVIDIA |
| License | CC-BY-4.0 (commercially usable with attribution) |
| Attribution | Required — credit NVIDIA in product documentation |
| Distribution | Can be bundled and sold commercially |
Compared to VoxBar Pro
| Feature | VoxBar Pro | VoxBar AI |
|---|---|---|
| Accuracy | ★★★★★ | ★★★★★ (identical in practice) |
| Text delivery | Real-time (word by word) | Chunked (every 1.5 seconds) |
| Docker required | Yes | No |
| Session stability | May need reconnection after 15-45 min | Runs indefinitely — tested 50+ minutes |
| VRAM | ~8-10GB | ~6-8GB |
| Silence timeout | 5 minutes | 15 minutes |
| Setup complexity | Docker + container management | Single Python environment |
| Best for | Live presentations, watching text appear | Long dictation, meetings, "set and forget" |
Bottom line: VoxBar AI delivers the same accuracy as the flagship, with none of the Docker complexity, and it never stops. For users who don't need word-by-word real-time display, VoxBar AI is arguably the more practical choice.