VoxBar Kyutai 1B — Real-Time Streaming Transcription

How It Works

True frame-by-frame streaming — 12.5 tokens per second, processed as you speak.

🎤

Captures your audio

Audio is captured at 24kHz in tiny 80ms frames from your microphone — or switch to system audio mode to capture anything playing on your PC (meetings, videos, podcasts). No virtual cables needed.

🔊

Mimi neural audio codec

Each 80ms audio frame is encoded by Kyutai's Mimi codec into 32 parallel codebook streams — capturing both the meaning of speech and its acoustic characteristics. Mimi operates at 12.5 Hz with causal streaming, producing tokens the instant audio arrives.

🧠

1B decoder-only transformer

A 1-billion parameter autoregressive model converts Mimi's audio tokens into text using greedy decoding — no sampling randomness, just the most confident prediction every frame. With zero-delay alignment between audio and text streams, text tokens emerge with minimal latency and built-in punctuation and capitalisation.

✍️

Words appear instantly

Each text token is decoded and displayed immediately — you see words forming milliseconds after you speak them. During natural pauses, the model outputs padding tokens (silence markers) until speech resumes, keeping the pipeline alive without producing phantom text.

🔒

Fixed-capacity memory — runs forever

The model's attention cache is a fixed-capacity ring buffer — pre-allocated at startup, never growing. Following Kyutai's official inference design, GPU memory stays locked at ~2.7GB indefinitely. No memory leaks, no slowdowns, no matter how long you run it.

Accuracy & Speed

Metric	Value
Arena Score	8.1 combined — Performance tier
WER (Word Error Rate)	Higher than 2.6B — optimised for speed over accuracy
Delivery	Frame-by-frame — words appear as you speak (<80ms)
Language	English + French
Punctuation	Context-aware, generated by the model
Capitalisation	Automatic, intelligent

Memory & Resource Footprint

Resource	Usage	Behaviour Over Time
GPU VRAM	~2.7GB (Kyutai STT 1B)	Stable — fixed-capacity ring buffer, never grows
RAM	~1-2GB (Python process)	Stable
Disk	Zero temp files	Audio processed in memory, never written to disk
Network	None	Fully offline — no internet required

Recording Limits

♾️

No Recording Limit

VoxBar Kyutai 1B uses a fixed-capacity ring buffer for its attention cache. GPU memory stays locked at ~2.7GB indefinitely — the smallest footprint in the lineup. Record for hours without interruption.

⏱️

Auto-Stop Behaviour

Silence timeout: 5 minutes of no detected speech triggers auto-stop.

Why VoxBar Kyutai Is Different

What you DON'T need

✖No internet connection — everything runs locally

✖No cloud processing — your voice never leaves your machine

✖No API keys — the model downloads once and runs offline forever

✖No usage limits — unlimited transcription, forever

✖No subscriptions — one-time purchase, lifetime license

What makes it unique

✔True frame-by-frame streaming — not chunked, not batched

✔System audio capture — transcribe meetings, YouTube, podcasts directly from your PC's audio output

✔Built-in punctuation — periods, commas generated by the model itself

✔Lightest GPU footprint (~2.7GB VRAM) — uses a fixed-capacity memory system that never grows, keeping GPU usage flat indefinitely

✔Semantic VAD — intelligently detects speech vs silence, auto-recovers after pauses

✔Overlay Mode — transparent overlay sits on top of any app with adjustable transparency and font sizes

✔Mid-text editing — click anywhere in your text to insert new speech at that position

✔Voice commands — say "delete" to remove highlighted text, use voice punctuation and formatting

✔Self-recovering — built-in staleness monitor auto-resets the pipeline if it stalls, no manual restart needed

Hardware Requirements

Requirement	Minimum	Recommended
GPU	NVIDIA with 3GB VRAM	NVIDIA with 4GB+ VRAM
RAM	8GB	16GB
Disk	~3GB (model + app)	SSD
OS	Windows 10/11	Windows 11
Software	Python 3.11+	Included in installer

License & Attribution

VoxBar™ Kyutai 1B is powered by Kyutai STT 1B, created by Kyutai Labs (Paris) and licensed under CC-BY-4.0.

VoxBar™ is an independent product by Conjure Labs Limited and is not affiliated with, endorsed by, or sponsored by Kyutai Labs.

Kyutai 1B vs Kyutai 2.6B

Feature	Kyutai 1B	Kyutai 2.6B
Arena Score	8.1 combined	9.4 combined
VRAM	~2.7GB	~5.8GB
Latency	<80ms frame-by-frame	~2s chunk delay
Languages	English + French	English
Price	$39	$59
Best for	Budget GPUs (3-4GB), real-time speed	Higher accuracy, system audio capture

Ready for real-time streaming transcription?

One-time purchase. Lifetime license. 2 machines. Zero cloud.

Coming Soon

Secure checkout via Lemon Squeezy / Stripe