๐Ÿฅ‡ 9.6 Arena โ€” Flagship

VoxBar Pro Docker

Real-time voice transcription. Words appear as you speak.

Powered by Voxtral Real-Time 4B (Mistral AI)

$59 one-time ยท ๐Ÿ”ฅ LAUNCH: $29.50 with code EARLYBIRD

How It Works

VoxBar Pro uses a fundamentally different architecture from every other VoxBar model. Instead of recording audio chunks and batch-processing them, it streams your voice in real-time to a local AI server running on your own machine.

Here's what happens, step by step:

  1. Opens your microphone via sounddevice โ€” captures audio at 16kHz, 4096-sample blocks
  2. Converts each audio block from Float32 to Int16 PCM, then Base64-encodes it (~8KB per block)
  3. Streams the encoded audio over a WebSocket connection to a local vLLM server running the Voxtral-Mini-4B-Realtime model
  4. The AI server processes audio continuously โ€” it doesn't wait for a chunk to finish. It listens to your voice in real-time and generates transcription tokens as it goes
  5. Transcription deltas arrive back over the same WebSocket โ€” individual words or word fragments, as fast as the model can produce them
  6. Each delta is immediately appended to your textbox โ€” you see words appearing as you speak them, with almost no visible delay

This is true real-time streaming โ€” not chunked, not batched. The model hears you and types simultaneously, like a live stenographer.

Recording Limits

Session Behaviour

VoxBar Pro Docker streams audio over a local WebSocket connection to a vLLM server running inside Docker. This architecture delivers the lowest-latency real-time experience in the suite โ€” but it means the session depends on a persistent connection between your microphone client and the local inference server.

Practical Session Length

  • 15โ€“45 minutes of continuous dictation is the sweet spot โ€” the transcription quality stays perfect throughout
  • For very long sessions (1+ hours), the Docker container may occasionally need a reconnection โ€” simply stop and start recording to refresh
  • Start-stop usage is ideal โ€” hit record when you need to capture, stop when you're thinking or reading. This keeps the KV cache fresh

Auto-Stop Behaviour

  • Silence timeout: 5 minutes of no detected speech triggers auto-stop
  • This prevents wasted GPU resources if you walk away

For Extended Sessions

If you need hours of uninterrupted transcription, VoxBar Pro Native (F16) and VoxBar Pro Kyutai 2.6B offer the same S-tier accuracy with even greater session stability โ€” they run natively without Docker or WebSocket dependencies.

Memory & Resource Footprint

Resource Usage Behaviour Over Time
GPU VRAM ~14GB (Voxtral 4B model + vLLM server + KV cache) Stable during normal usage
RAM ~2-4GB (Docker container + vLLM server) Stable
Disk Zero temp files No accumulation โ€” audio is streamed, never written to disk
Network WebSocket (localhost only) All traffic stays on your machine โ€” zero internet

What makes VoxBar Pro Docker unique: It delivers word-by-word real-time transcription at the highest accuracy in the suite (9.6 Arena). While VoxBar Pro Native and Kyutai 1B also offer real-time streaming, Pro Docker achieves this with the full Voxtral 4B model running via a dedicated vLLM inference server โ€” the experience feels like having a live stenographer.

What users DON'T have to worry about:
- โŒ No internet connection required โ€” the AI server runs entirely on your machine
- โŒ No cloud processing โ€” your voice never leaves your hardware
- โŒ No API keys โ€” the model runs locally via Docker
- โŒ No usage limits โ€” unlimited transcription, forever

What users DO need to know:
- โš ๏ธ Requires Docker Desktop โ€” the vLLM server runs inside a Docker container
- โš ๏ธ Higher VRAM usage โ€” 4B parameter model + vLLM server needs ~14GB GPU memory

Accuracy & Speed

Metric Value
Arena Score 9.6 combined โ€” S-tier (highest rated)
Delivery True real-time โ€” words appear as you speak
Latency <500ms from speech to text on screen
Multilingual Yes โ€” 13 languages supported
Punctuation Context-aware, appears naturally
Capitalisation Automatic, intelligent

Hardware Requirements

Requirement Minimum Recommended
GPU NVIDIA with 16GB VRAM NVIDIA with 16GB+ VRAM
RAM 16GB 32GB
Disk ~14GB for model + Docker SSD recommended
OS Windows 10/11, Linux Windows 11
Software Docker Desktop installed Docker Desktop running

License & Attribution

Detail Value
Model Voxtral-Mini-4B-Realtime-2602
Creator Mistral AI
License Apache 2.0 (fully commercial)
Attribution Not required (but appreciated)
Distribution Can be bundled and sold commercially

Pro Docker vs Pro Native

Both run the same Voxtral 4B model. The difference is how it's deployed:

Feature Pro Docker Pro Native
Arena Score 9.6 combined 9.5 combined
Docker required Yes No
VRAM usage ~14GB ~8.5GB
Install Docker Desktop + pull image One-click
Session stability May need reconnection Rock solid
AMD GPU Not supported Not supported
macOS See Mac Models โ€” native Apple Metal build available separately
Best for Power users who want the highest arena score and real-time streaming Windows users who want simplicity and lower VRAM

VoxBar Pro Docker is for power users who want the absolute highest arena score (9.6) with true real-time streaming. VoxBar Pro Native is for Windows users who want the same S-tier accuracy with a simpler setup and 40% less VRAM.