How It Works
VoxBar Pro uses a fundamentally different architecture from every other VoxBar model. Instead of recording audio chunks and batch-processing them, it streams your voice in real-time to a local AI server running on your own machine.
Here's what happens, step by step:
- Opens your microphone via sounddevice โ captures audio at 16kHz, 4096-sample blocks
- Converts each audio block from Float32 to Int16 PCM, then Base64-encodes it (~8KB per block)
- Streams the encoded audio over a WebSocket connection to a local vLLM server running the Voxtral-Mini-4B-Realtime model
- The AI server processes audio continuously โ it doesn't wait for a chunk to finish. It listens to your voice in real-time and generates transcription tokens as it goes
- Transcription deltas arrive back over the same WebSocket โ individual words or word fragments, as fast as the model can produce them
- Each delta is immediately appended to your textbox โ you see words appearing as you speak them, with almost no visible delay
This is true real-time streaming โ not chunked, not batched. The model hears you and types simultaneously, like a live stenographer.
Recording Limits
Session Behaviour
VoxBar Pro Docker streams audio over a local WebSocket connection to a vLLM server running inside Docker. This architecture delivers the lowest-latency real-time experience in the suite โ but it means the session depends on a persistent connection between your microphone client and the local inference server.
Practical Session Length
- 15โ45 minutes of continuous dictation is the sweet spot โ the transcription quality stays perfect throughout
- For very long sessions (1+ hours), the Docker container may occasionally need a reconnection โ simply stop and start recording to refresh
- Start-stop usage is ideal โ hit record when you need to capture, stop when you're thinking or reading. This keeps the KV cache fresh
Auto-Stop Behaviour
- Silence timeout: 5 minutes of no detected speech triggers auto-stop
- This prevents wasted GPU resources if you walk away
For Extended Sessions
If you need hours of uninterrupted transcription, VoxBar Pro Native (F16) and VoxBar Pro Kyutai 2.6B offer the same S-tier accuracy with even greater session stability โ they run natively without Docker or WebSocket dependencies.
Memory & Resource Footprint
| Resource | Usage | Behaviour Over Time |
|---|---|---|
| GPU VRAM | ~14GB (Voxtral 4B model + vLLM server + KV cache) | Stable during normal usage |
| RAM | ~2-4GB (Docker container + vLLM server) | Stable |
| Disk | Zero temp files | No accumulation โ audio is streamed, never written to disk |
| Network | WebSocket (localhost only) | All traffic stays on your machine โ zero internet |
What makes VoxBar Pro Docker unique: It delivers word-by-word real-time transcription at the highest accuracy in the suite (9.6 Arena). While VoxBar Pro Native and Kyutai 1B also offer real-time streaming, Pro Docker achieves this with the full Voxtral 4B model running via a dedicated vLLM inference server โ the experience feels like having a live stenographer.
What users DON'T have to worry about:
- โ No internet connection required โ the AI server runs entirely on your machine
- โ No cloud processing โ your voice never leaves your hardware
- โ No API keys โ the model runs locally via Docker
- โ No usage limits โ unlimited transcription, forever
What users DO need to know:
- โ ๏ธ Requires Docker Desktop โ the vLLM server runs inside a Docker container
- โ ๏ธ Higher VRAM usage โ 4B parameter model + vLLM server needs ~14GB GPU memory
Accuracy & Speed
| Metric | Value |
|---|---|
| Arena Score | 9.6 combined โ S-tier (highest rated) |
| Delivery | True real-time โ words appear as you speak |
| Latency | <500ms from speech to text on screen |
| Multilingual | Yes โ 13 languages supported |
| Punctuation | Context-aware, appears naturally |
| Capitalisation | Automatic, intelligent |
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 16GB VRAM | NVIDIA with 16GB+ VRAM |
| RAM | 16GB | 32GB |
| Disk | ~14GB for model + Docker | SSD recommended |
| OS | Windows 10/11, Linux | Windows 11 |
| Software | Docker Desktop installed | Docker Desktop running |
License & Attribution
| Detail | Value |
|---|---|
| Model | Voxtral-Mini-4B-Realtime-2602 |
| Creator | Mistral AI |
| License | Apache 2.0 (fully commercial) |
| Attribution | Not required (but appreciated) |
| Distribution | Can be bundled and sold commercially |
Pro Docker vs Pro Native
Both run the same Voxtral 4B model. The difference is how it's deployed:
| Feature | Pro Docker | Pro Native |
|---|---|---|
| Arena Score | 9.6 combined | 9.5 combined |
| Docker required | Yes | No |
| VRAM usage | ~14GB | ~8.5GB |
| Install | Docker Desktop + pull image | One-click |
| Session stability | May need reconnection | Rock solid |
| AMD GPU | Not supported | Not supported |
| macOS | See Mac Models โ native Apple Metal build available separately | |
| Best for | Power users who want the highest arena score and real-time streaming | Windows users who want simplicity and lower VRAM |
VoxBar Pro Docker is for power users who want the absolute highest arena score (9.6) with true real-time streaming. VoxBar Pro Native is for Windows users who want the same S-tier accuracy with a simpler setup and 40% less VRAM.