VoxBar Pro Docker — Voxtral 4B Real-Time Transcription

How It Works

VoxBar Pro uses a fundamentally different architecture from every other VoxBar model. Instead of recording audio chunks and batch-processing them, it streams your voice in real-time to a local AI server running on your own machine.

Here's what happens, step by step:

Opens your microphone via sounddevice — captures audio at 16kHz, 4096-sample blocks
Converts each audio block from Float32 to Int16 PCM, then Base64-encodes it (~8KB per block)
Streams the encoded audio over a WebSocket connection to a local vLLM server running the Voxtral-Mini-4B-Realtime model
The AI server processes audio continuously — it doesn't wait for a chunk to finish. It listens to your voice in real-time and generates transcription tokens as it goes
Transcription deltas arrive back over the same WebSocket — individual words or word fragments, as fast as the model can produce them
Each delta is immediately appended to your textbox — you see words appearing as you speak them, with almost no visible delay

This is true real-time streaming — not chunked, not batched. The model hears you and types simultaneously, like a live stenographer.

Recording Limits

Session Behaviour

VoxBar Pro Docker streams audio over a local WebSocket connection to a vLLM server running inside Docker. This architecture delivers the lowest-latency real-time experience in the suite — but it means the session depends on a persistent connection between your microphone client and the local inference server.

Practical Session Length

15–45 minutes of continuous dictation is the sweet spot — the transcription quality stays perfect throughout
For very long sessions (1+ hours), the Docker container may occasionally need a reconnection — simply stop and start recording to refresh
Start-stop usage is ideal — hit record when you need to capture, stop when you're thinking or reading. This keeps the KV cache fresh

Auto-Stop Behaviour

Silence timeout: 5 minutes of no detected speech triggers auto-stop
This prevents wasted GPU resources if you walk away

For Extended Sessions

If you need hours of uninterrupted transcription, VoxBar Pro Native (F16) and VoxBar Pro Kyutai 2.6B offer the same S-tier accuracy with even greater session stability — they run natively without Docker or WebSocket dependencies.

Memory & Resource Footprint

Resource	Usage	Behaviour Over Time
GPU VRAM	~14GB (Voxtral 4B model + vLLM server + KV cache)	Stable during normal usage
RAM	~2-4GB (Docker container + vLLM server)	Stable
Disk	Zero temp files	No accumulation — audio is streamed, never written to disk
Network	WebSocket (localhost only)	All traffic stays on your machine — zero internet

What makes VoxBar Pro Docker unique: It delivers word-by-word real-time transcription at the highest accuracy in the suite (9.6 Arena). While VoxBar Pro Native and Kyutai 1B also offer real-time streaming, Pro Docker achieves this with the full Voxtral 4B model running via a dedicated vLLM inference server — the experience feels like having a live stenographer.

What users DON'T have to worry about:
- ❌ No internet connection required — the AI server runs entirely on your machine
- ❌ No cloud processing — your voice never leaves your hardware
- ❌ No API keys — the model runs locally via Docker
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Requires Docker Desktop — the vLLM server runs inside a Docker container
- ⚠️ Higher VRAM usage — 4B parameter model + vLLM server needs ~14GB GPU memory

Accuracy & Speed

Metric	Value
Arena Score	9.6 combined — S-tier (highest rated)
Delivery	True real-time — words appear as you speak
Latency	<500ms from speech to text on screen
Multilingual	Yes — 13 languages supported
Punctuation	Context-aware, appears naturally
Capitalisation	Automatic, intelligent

Hardware Requirements

Requirement	Minimum	Recommended
GPU	NVIDIA with 16GB VRAM	NVIDIA with 16GB+ VRAM
RAM	16GB	32GB
Disk	~14GB for model + Docker	SSD recommended
OS	Windows 10/11, Linux	Windows 11
Software	Docker Desktop installed	Docker Desktop running

License & Attribution

Detail	Value
Model	Voxtral-Mini-4B-Realtime-2602
Creator	Mistral AI
License	Apache 2.0 (fully commercial)
Attribution	Not required (but appreciated)
Distribution	Can be bundled and sold commercially

Pro Docker vs Pro Native

Both run the same Voxtral 4B model. The difference is how it's deployed:

Feature	Pro Docker	Pro Native
Arena Score	9.6 combined	9.5 combined
Docker required	Yes	No
VRAM usage	~14GB	~8.5GB
Install	Docker Desktop + pull image	One-click
Session stability	May need reconnection	Rock solid
AMD GPU	Not supported	Not supported
macOS	See Mac Models — native Apple Metal build available separately
Best for	Power users who want the highest arena score and real-time streaming	Windows users who want simplicity and lower VRAM

VoxBar Pro Docker is for power users who want the absolute highest arena score (9.6) with true real-time streaming. VoxBar Pro Native is for Windows users who want the same S-tier accuracy with a simpler setup and 40% less VRAM.