🥉 #3 — Mid-Range

VoxBar Ultra

The fastest, most efficient speech-to-text engine on the planet. Just 2GB VRAM.

Powered by NVIDIA Parakeet TDT 0.6B v2 (FastConformer + Token-and-Duration Transducer)

How It Works

VoxBar Ultra uses NVIDIA's Parakeet TDT 0.6B v2 — a compact but devastatingly fast ASR model built on the FastConformer architecture with a Token-and-Duration Transducer. This is a single-pass, non-autoregressive model — it doesn't generate text one word at a time like an LLM. It processes the entire audio chunk in one forward pass and outputs the full transcription instantly.

Here's what happens, step by step:

  1. Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
  2. Buffers 2 seconds of audio into a small in-memory buffer
  3. Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
  4. Writes a tiny temp WAV file to your system temp folder
  5. Feeds the WAV to Parakeet TDT via NeMo's model.transcribe() API
  6. The model processes the entire chunk in a single forward pass — no autoregressive token generation, no beam search delays
  7. Complete transcription is returned instantly — with punctuation and capitalisation included
  8. Temp file is immediately deleted — nothing accumulates on disk
  9. Text is appended to your textbox
  10. Repeats forever — each chunk is completely independent

The key difference from VoxBar AI: Parakeet is a dedicated ASR model, not an LLM. It doesn't "understand" language — it just maps audio to text with extraordinary precision and speed. This makes it significantly faster per chunk, at the cost of less contextual intelligence.

Recording Limits

VoxBar Ultra Has No Recording Limit

Like VoxBar AI, VoxBar Ultra runs natively on your machine with no Docker, no WebSocket, and no server process. Each 2-second chunk is completely independent — the model processes it, the temp file is deleted, and it moves on.

Why It Runs Forever

  • Each chunk is self-contained — no state carries between chunks
  • GPU memory is fixed at ~2GB — the smallest footprint of any GPU-accelerated VoxBar model
  • No network connections, no Docker, no server processes
  • The 0.6B parameter model is tiny — it never stresses your GPU

Auto-Stop Behaviour

  • Silence timeout: 60 seconds of no detected speech
  • Check interval: Every 5 seconds
  • Designed for active dictation sessions rather than passive meeting recording

Memory & Resource Footprint

Resource Usage Behaviour Over Time
GPU VRAM ~2GB fixed ✅ Never grows — smallest GPU footprint in the suite
RAM ~400MB (Python process + NeMo) ✅ Stable
Disk Zero accumulation ✅ Temp WAV files deleted immediately after each chunk
Network None ✅ Completely offline

VoxBar Ultra is the most resource-efficient GPU model in the entire suite. At just 2GB VRAM, it runs comfortably on entry-level NVIDIA GPUs (GTX 1650, RTX 3050, etc.) that can't fit the larger models. You can run VoxBar Ultra alongside games, video editing, or other GPU-intensive tasks without worrying about VRAM pressure.

Architecture Advantage

What makes VoxBar Ultra special: It holds the #1 accuracy benchmark on LibriSpeech at just 1.69% Word Error Rate — better than models 10x its size. The FastConformer + TDT architecture is purpose-built for speech recognition:

  • Single-pass inference — no autoregressive generation, no beam search. One forward pass = complete transcription
  • 3,386x real-time speed — it transcribes audio 3,386 times faster than you can speak it
  • Built-in punctuation and capitalisation — no post-processing needed
  • Word-level timestamps — every word is tagged with its exact position in time
  • Token-and-Duration Transducer — predicts both the text tokens AND their durations simultaneously, making it more accurate than standard CTC models

What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline
- ❌ No large VRAM requirements — just 2GB
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every ~2 seconds)
- ⚠️ NVIDIA GPU required — needs CUDA (no AMD or Apple support)
- ⚠️ English-focused — Parakeet TDT is optimised for English; multilingual support is limited
- ⚠️ First launch downloads ~1.2GB model files (cached after that)

Accuracy & Speed

Metric Value
Delivery Chunked — text appears every ~2 seconds
Latency ~0.5-1 second processing time per chunk (extremely fast)
Word Error Rate 1.69% (LibriSpeech benchmark — best in class)
Inference Speed 3,386x real-time
Punctuation Yes — built-in, automatic
Capitalisation Yes — built-in, automatic
Languages English (primary), limited multilingual
Timestamps Word-level timestamps available

The Speed Advantage

Parakeet TDT processes each audio chunk in a fraction of a second. A 2-second audio clip is transcribed in roughly 0.6 milliseconds. This means the bottleneck isn't the model — it's how fast audio arrives. VoxBar Ultra feels snappier than VoxBar AI because there's almost zero processing delay once a chunk is ready.

Hardware Requirements

Requirement Minimum Recommended
GPU NVIDIA with 2GB VRAM NVIDIA with 4GB+ VRAM
GPU (AMD) ❌ Not supported
GPU (Apple) ❌ Not supported
RAM 8GB 16GB
Disk ~1.2GB for model (cached in ~/.cache) SSD recommended
OS Windows 10/11 Windows 11
Software Python 3.10+, NeMo toolkit pip install nemo_toolkit['asr']
Docker ❌ Not required

License & Attribution

Detail Value
Model nvidia/parakeet-tdt-0.6b-v2
Creator NVIDIA
License CC-BY-4.0 (commercially usable with attribution)
Attribution Required — credit NVIDIA in product documentation
Distribution Can be bundled and sold commercially

Where It Fits in the Suite

Feature VoxBar Pro VoxBar AI VoxBar Ultra
Accuracy ★★★★★ ★★★★★ ★★★★★ (1.69% WER — best benchmark)
Text delivery Real-time Every 1.5s Every 2s
Processing speed Streaming 418x real-time 3,386x real-time
VRAM ~8-10GB ~6-8GB ~2GB
Docker Yes No No
Languages Multi Multi English-focused
Best for Live presentations Long dictation Fast English transcription on any NVIDIA GPU

Bottom line: VoxBar Ultra is the speed and efficiency king. If you have an NVIDIA GPU — even a modest one — and you primarily work in English, VoxBar Ultra gives you benchmark-leading accuracy at a fraction of the resource cost. It's the model that punches way above its weight.