#4 — Budget NVIDIA

VoxBar Flash

NVIDIA accuracy on a budget. Half the VRAM. All the intelligence.

Powered by NVIDIA Canary 1B Flash (FastConformer ASR)

How It Works

VoxBar Flash uses NVIDIA's Canary 1B Flash — a compact, fast ASR model from the same Canary family as VoxBar AI's 2.5B powerhouse, but at a fraction of the resource cost. While VoxBar AI uses the SALM (Speech-Augmented Language Model) architecture with a full LLM backbone, Canary 1B Flash uses the standard NeMo ASR pipeline — the same proven architecture as VoxBar Ultra's Parakeet TDT.

Here's what happens, step by step:

  1. Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
  2. Buffers 2 seconds of audio into a small in-memory buffer
  3. Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
  4. Writes a tiny temp WAV file to your system temp folder
  5. Feeds the WAV to Canary 1B Flash via NeMo's standard model.transcribe() API
  6. Single-pass transcription — the model processes the full chunk and returns complete text
  7. Temp file is immediately deleted — nothing accumulates on disk
  8. Text is appended to your textbox
  9. Repeats forever — each chunk is completely independent

The "Flash" Advantage

The name says it all — Canary 1B Flash is designed for speed. At just 1 billion parameters (vs 2.5B for VoxBar AI), it loads faster, processes faster, and uses roughly half the VRAM. It's the sweet spot between VoxBar Ultra's ultra-lightweight 0.6B Parakeet and VoxBar AI's heavyweight 2.5B SALM.

Recording Limits

VoxBar Flash Has No Recording Limit

Like all non-Docker VoxBar models, Flash runs natively on your machine with no container, no server, and no WebSocket. Each 2-second chunk is completely independent — the GPU processes the same model with the same input size every time.

Flush-on-Stop

When you press Stop, VoxBar Flash transcribes any remaining audio still in the buffer before shutting down. You never lose the last words of a sentence — even if you stop mid-speech.

Auto-Stop Behaviour

  • Silence timeout: 60 seconds of no detected speech
  • Check interval: Every 5 seconds

Memory & Resource Footprint

Resource Usage Behaviour Over Time
GPU VRAM ~3-4GB fixed ✅ Never grows — same model, same chunk size, forever
RAM ~400MB ✅ Stable
Disk Zero accumulation ✅ Temp WAV files deleted immediately after each chunk
Network None ✅ Completely offline

The VRAM Sweet Spot

VoxBar Flash sits perfectly in the VRAM gap between the other models:

Model VRAM
VoxBar Pro (Voxtral 4B) ~8-10GB
VoxBar AI (Canary Qwen 2.5B) ~6-8GB
VoxBar Flash (Canary 1B Flash) ~3-4GB
VoxBar Ultra (Parakeet TDT 0.6B) ~2GB

This makes VoxBar Flash ideal for users with mid-range NVIDIA GPUs — the RTX 3060 (6GB), GTX 1660 (6GB), or even the RTX 4060 (8GB) where you want to leave headroom for other applications.

Architecture Advantage

What makes VoxBar Flash special: It brings NVIDIA's Canary-family accuracy to a lower VRAM budget without requiring the cutting-edge NeMo trunk, PyTorch 2.6+, or SALM architecture that VoxBar AI demands.

Technical difference from VoxBar AI:

Aspect VoxBar AI (2.5B) VoxBar Flash (1B)
Architecture SALM (Speech + LLM) FastConformer ASR
API SALM.generate() ASRModel.transcribe()
NeMo version Trunk (bleeding edge) Stable (≥2.0)
PyTorch 2.6+ required 2.1+ works
Model loading speechlm2.models.SALM asr.models.ASRModel
Inference LLM token generation Single-pass ASR

This means VoxBar Flash is easier to install, more stable, and more compatible with existing NeMo setups. It doesn't need the experimental SALM codebase.

What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline
- ❌ No bleeding-edge dependencies — works with stable NeMo
- ❌ No special virtual environment — standard pip install
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every ~2 seconds)
- ⚠️ NVIDIA GPU required — needs CUDA (no AMD or Apple support)
- ⚠️ 3-4GB VRAM — needs at least a mid-range NVIDIA GPU
- ⚠️ First launch downloads ~2GB model files (cached after that)

Accuracy & Speed

Metric Value
Delivery Chunked — text appears every ~2 seconds
Latency ~1 second processing time per chunk
Word Error Rate ~5-7% (between Parakeet's 1.69% and Whisper base's 8-10%)
Punctuation Yes — built-in, automatic
Capitalisation Yes — built-in, automatic
Languages English (primary), multilingual potential

Where It Sits on Accuracy

Canary 1B Flash delivers better accuracy than Whisper and comparable accuracy to Parakeet in real-world use, while using more VRAM than Parakeet but less than VoxBar AI. It's the balanced middle ground.

Hardware Requirements

Requirement Minimum Recommended
GPU NVIDIA with 3GB VRAM NVIDIA with 4GB+ VRAM
GPU (AMD) ❌ Not supported
GPU (Apple) ❌ Not supported
RAM 8GB 16GB
Disk ~2GB for model (cached in ~/.cache) SSD recommended
OS Windows 10/11 Windows 11
Software Python 3.10+, NeMo ≥2.0 pip install nemo_toolkit['asr']
Docker ❌ Not required
PyTorch 2.1+ (standard, not bleeding edge) Standard CUDA install

License & Attribution

Detail Value
Model nvidia/canary-1b-flash
Creator NVIDIA
License CC-BY-4.0 (commercially usable with attribution)
Attribution Required — credit NVIDIA in product documentation
Distribution Can be bundled and sold commercially

Where It Fits in the Suite

Feature VoxBar Pro VoxBar AI VoxBar Ultra VoxBar Flash VoxBar Lite VoxBar Whisper
Accuracy ★★★★★ ★★★★★ ★★★★★ ★★★★☆ ★★★☆☆ ★★★★☆
VRAM ~8-10GB ~6-8GB ~2GB ~3-4GB 0GB 0-2GB
Docker Yes No No No No No
NeMo N/A Trunk Stable Stable N/A N/A
Languages Multi Multi English English+ English 99
Model family Mistral NVIDIA Canary NVIDIA Parakeet NVIDIA Canary Useful Sensors OpenAI
Best for Live streaming Long sessions Minimal VRAM Budget NVIDIA Any hardware Multilingual

Bottom line: VoxBar Flash is the budget NVIDIA option — it brings Canary-family intelligence to users who can't spare 6-8GB of VRAM for the full 2.5B model but still want NVIDIA's accuracy advantage over Whisper and Moonshine. It's easier to install than VoxBar AI (no trunk, no special venv), more accurate than Whisper, and sits in the sweet spot for users with 4-6GB GPUs.