Millisecond-speed voice transcription. Words appear as you speak.
Powered by Kyutai STT 1B — a 1-billion parameter decoder-only transformer with Mimi neural codec for true frame-by-frame streaming. English + French.
$39 one-time · 🔥 LAUNCH: $19.50 with code EARLYBIRD
🎤 Transcribe your mic — or 🔊 listen to your system audio (meetings, podcasts, videos). All 100% local. Nothing leaves your machine.
True frame-by-frame streaming — 12.5 tokens per second, processed as you speak.
Audio is captured at 24kHz in tiny 80ms frames from your microphone — or switch to system audio mode to capture anything playing on your PC (meetings, videos, podcasts). No virtual cables needed.
Each 80ms audio frame is encoded by Kyutai's Mimi codec into 32 parallel codebook streams — capturing both the meaning of speech and its acoustic characteristics. Mimi operates at 12.5 Hz with causal streaming, producing tokens the instant audio arrives.
A 1-billion parameter autoregressive model converts Mimi's audio tokens into text using greedy decoding — no sampling randomness, just the most confident prediction every frame. With zero-delay alignment between audio and text streams, text tokens emerge with minimal latency and built-in punctuation and capitalisation.
Each text token is decoded and displayed immediately — you see words forming milliseconds after you speak them. During natural pauses, the model outputs padding tokens (silence markers) until speech resumes, keeping the pipeline alive without producing phantom text.
The model's attention cache is a fixed-capacity ring buffer — pre-allocated at startup, never growing. Following Kyutai's official inference design, GPU memory stays locked at ~2.7GB indefinitely. No memory leaks, no slowdowns, no matter how long you run it.
| Metric | Value |
|---|---|
| Arena Score | 8.1 combined — Performance tier |
| WER (Word Error Rate) | Higher than 2.6B — optimised for speed over accuracy |
| Delivery | Frame-by-frame — words appear as you speak (<80ms) |
| Language | English + French |
| Punctuation | Context-aware, generated by the model |
| Capitalisation | Automatic, intelligent |
| Resource | Usage | Behaviour Over Time |
|---|---|---|
| GPU VRAM | ~2.7GB (Kyutai STT 1B) | Stable — fixed-capacity ring buffer, never grows |
| RAM | ~1-2GB (Python process) | Stable |
| Disk | Zero temp files | Audio processed in memory, never written to disk |
| Network | None | Fully offline — no internet required |
VoxBar Kyutai 1B uses a fixed-capacity ring buffer for its attention cache. GPU memory stays locked at ~2.7GB indefinitely — the smallest footprint in the lineup. Record for hours without interruption.
Silence timeout: 5 minutes of no detected speech triggers auto-stop.
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 3GB VRAM | NVIDIA with 4GB+ VRAM |
| RAM | 8GB | 16GB |
| Disk | ~3GB (model + app) | SSD |
| OS | Windows 10/11 | Windows 11 |
| Software | Python 3.11+ | Included in installer |
VoxBar™ Kyutai 1B is powered by Kyutai STT 1B, created by Kyutai Labs (Paris) and licensed under CC-BY-4.0.
VoxBar™ is an independent product by Conjure Labs Limited and is not affiliated with, endorsed by, or sponsored by Kyutai Labs.
| Feature | Kyutai 1B | Kyutai 2.6B |
|---|---|---|
| Arena Score | 8.1 combined | 9.4 combined |
| VRAM | ~2.7GB | ~5.8GB |
| Latency | <80ms frame-by-frame | ~2s chunk delay |
| Languages | English + French | English |
| Price | $39 | $59 |
| Best for | Budget GPUs (3-4GB), real-time speed | Higher accuracy, system audio capture |
One-time purchase. Lifetime license. 2 machines. Zero cloud.
Coming SoonSecure checkout via Lemon Squeezy / Stripe