VoxBar Flash — Powered by NVIDIA Canary 1B Flash (FastConformer ASR)

How It Works

VoxBar Flash uses NVIDIA's Canary 1B Flash — a compact, fast ASR model from the same Canary family as VoxBar AI's 2.5B powerhouse, but at a fraction of the resource cost. While VoxBar AI uses the SALM (Speech-Augmented Language Model) architecture with a full LLM backbone, Canary 1B Flash uses the standard NeMo ASR pipeline — the same proven architecture as VoxBar Ultra's Parakeet TDT.

Here's what happens, step by step:

Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
Buffers 2 seconds of audio into a small in-memory buffer
Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
Writes a tiny temp WAV file to your system temp folder
Feeds the WAV to Canary 1B Flash via NeMo's standard model.transcribe() API
Single-pass transcription — the model processes the full chunk and returns complete text
Temp file is immediately deleted — nothing accumulates on disk
Text is appended to your textbox
Repeats forever — each chunk is completely independent

The "Flash" Advantage

The name says it all — Canary 1B Flash is designed for speed. At just 1 billion parameters (vs 2.5B for VoxBar AI), it loads faster, processes faster, and uses roughly half the VRAM. It's the sweet spot between VoxBar Ultra's ultra-lightweight 0.6B Parakeet and VoxBar AI's heavyweight 2.5B SALM.

Recording Limits

VoxBar Flash Has No Recording Limit

Like all non-Docker VoxBar models, Flash runs natively on your machine with no container, no server, and no WebSocket. Each 2-second chunk is completely independent — the GPU processes the same model with the same input size every time.

Flush-on-Stop

When you press Stop, VoxBar Flash transcribes any remaining audio still in the buffer before shutting down. You never lose the last words of a sentence — even if you stop mid-speech.

Auto-Stop Behaviour

Silence timeout: 60 seconds of no detected speech
Check interval: Every 5 seconds

Memory & Resource Footprint

Resource	Usage	Behaviour Over Time
GPU VRAM	~3-4GB fixed	✅ Never grows — same model, same chunk size, forever
RAM	~400MB	✅ Stable
Disk	Zero accumulation	✅ Temp WAV files deleted immediately after each chunk
Network	None	✅ Completely offline

The VRAM Sweet Spot

VoxBar Flash sits perfectly in the VRAM gap between the other models:

Model	VRAM
VoxBar Pro (Voxtral 4B)	~8-10GB
VoxBar AI (Canary Qwen 2.5B)	~6-8GB
VoxBar Flash (Canary 1B Flash)	~3-4GB
VoxBar Ultra (Parakeet TDT 0.6B)	~2GB

This makes VoxBar Flash ideal for users with mid-range NVIDIA GPUs — the RTX 3060 (6GB), GTX 1660 (6GB), or even the RTX 4060 (8GB) where you want to leave headroom for other applications.

Architecture Advantage

What makes VoxBar Flash special: It brings NVIDIA's Canary-family accuracy to a lower VRAM budget without requiring the cutting-edge NeMo trunk, PyTorch 2.6+, or SALM architecture that VoxBar AI demands.

Technical difference from VoxBar AI:

Aspect	VoxBar AI (2.5B)	VoxBar Flash (1B)
Architecture	SALM (Speech + LLM)	FastConformer ASR
API	`SALM.generate()`	`ASRModel.transcribe()`
NeMo version	Trunk (bleeding edge)	Stable (≥2.0)
PyTorch	2.6+ required	2.1+ works
Model loading	`speechlm2.models.SALM`	`asr.models.ASRModel`
Inference	LLM token generation	Single-pass ASR

This means VoxBar Flash is easier to install, more stable, and more compatible with existing NeMo setups. It doesn't need the experimental SALM codebase.

What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline
- ❌ No bleeding-edge dependencies — works with stable NeMo
- ❌ No special virtual environment — standard pip install
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every ~2 seconds)
- ⚠️ NVIDIA GPU required — needs CUDA (no AMD or Apple support)
- ⚠️ 3-4GB VRAM — needs at least a mid-range NVIDIA GPU
- ⚠️ First launch downloads ~2GB model files (cached after that)

Accuracy & Speed

Metric	Value
Delivery	Chunked — text appears every ~2 seconds
Latency	~1 second processing time per chunk
Word Error Rate	~5-7% (between Parakeet's 1.69% and Whisper base's 8-10%)
Punctuation	Yes — built-in, automatic
Capitalisation	Yes — built-in, automatic
Languages	English (primary), multilingual potential

Where It Sits on Accuracy

Canary 1B Flash delivers better accuracy than Whisper and comparable accuracy to Parakeet in real-world use, while using more VRAM than Parakeet but less than VoxBar AI. It's the balanced middle ground.

Hardware Requirements

Requirement	Minimum	Recommended
GPU	NVIDIA with 3GB VRAM	NVIDIA with 4GB+ VRAM
GPU (AMD)	❌ Not supported	—
GPU (Apple)	❌ Not supported	—
RAM	8GB	16GB
Disk	~2GB for model (cached in ~/.cache)	SSD recommended
OS	Windows 10/11	Windows 11
Software	Python 3.10+, NeMo ≥2.0	pip install nemo_toolkit['asr']
Docker	❌ Not required	—
PyTorch	2.1+ (standard, not bleeding edge)	Standard CUDA install

License & Attribution

Detail	Value
Model	nvidia/canary-1b-flash
Creator	NVIDIA
License	CC-BY-4.0 (commercially usable with attribution)
Attribution	Required — credit NVIDIA in product documentation
Distribution	Can be bundled and sold commercially

Where It Fits in the Suite

Feature	VoxBar Pro	VoxBar AI	VoxBar Ultra	VoxBar Flash	VoxBar Lite	VoxBar Whisper
Accuracy	★★★★★	★★★★★	★★★★★	★★★★☆	★★★☆☆	★★★★☆
VRAM	~8-10GB	~6-8GB	~2GB	~3-4GB	0GB	0-2GB
Docker	Yes	No	No	No	No	No
NeMo	N/A	Trunk	Stable	Stable	N/A	N/A
Languages	Multi	Multi	English	English+	English	99
Model family	Mistral	NVIDIA Canary	NVIDIA Parakeet	NVIDIA Canary	Useful Sensors	OpenAI
Best for	Live streaming	Long sessions	Minimal VRAM	Budget NVIDIA	Any hardware	Multilingual

Bottom line: VoxBar Flash is the budget NVIDIA option — it brings Canary-family intelligence to users who can't spare 6-8GB of VRAM for the full 2.5B model but still want NVIDIA's accuracy advantage over Whisper and Moonshine. It's easier to install than VoxBar AI (no trunk, no special venv), more accurate than Whisper, and sits in the sweet spot for users with 4-6GB GPUs.