VoxBar Pro Kyutai 2.6B — Chunk-Based Transcription

How It Works

Chunk-based processing with Mimi neural codec — 2.6 billion parameters optimised for English accuracy.

🎤

Captures your audio

Audio is captured at 24kHz in tiny 80ms frames from your microphone — or switch to system audio mode to capture anything playing on your PC (meetings, videos, podcasts). No virtual cables needed.

🔊

Mimi neural audio codec

Each 80ms audio frame is encoded by Kyutai's Mimi codec into 32 parallel codebook streams — capturing both the meaning of speech and its acoustic characteristics. Mimi operates at 12.5 Hz with causal streaming, producing tokens the instant audio arrives.

🧠

2.6B decoder-only transformer

A 2.6-billion parameter autoregressive model converts Mimi's audio tokens into text using greedy decoding — no sampling randomness, just the most confident prediction every frame. With its larger parameter count, the 2.6B model achieves superior English accuracy (6.4% WER) with built-in punctuation and capitalisation.

✍️

Text arrives with minimal delay

The 2.6B model processes audio in chunks, with text appearing roughly ~2 seconds behind your speech. This short delay is the trade-off for the model's superior accuracy. During natural pauses, the model outputs padding tokens (silence markers) until speech resumes, keeping the pipeline alive without producing phantom text.

🔒

Fixed-capacity memory — runs forever

The model's attention cache is a fixed-capacity ring buffer — pre-allocated at startup, never growing. Following Kyutai's official inference design, GPU memory stays locked at ~5.8GB indefinitely. No memory leaks, no slowdowns, no matter how long you run it.

Accuracy & Speed

Metric	Value
Arena Score	9.4 combined — Professional tier
WER (Word Error Rate)	6.4% — best-in-class for the model size
Delivery	Chunk-based — text arrives ~2 seconds behind speech
Language	English
Punctuation	Context-aware, generated by the model
Capitalisation	Automatic, intelligent

Memory & Resource Footprint

Resource	Usage	Behaviour Over Time
GPU VRAM	~5.8GB (Kyutai STT 2.6B)	Stable — fixed-capacity ring buffer, never grows
RAM	~1-2GB (Python process)	Stable
Disk	Zero temp files	Audio processed in memory, never written to disk
Network	None	Fully offline — no internet required

Recording Limits

♾️

No Recording Limit

VoxBar Kyutai 2.6B uses a fixed-capacity ring buffer for its attention cache. GPU memory stays locked at ~5.8GB indefinitely — no memory leaks, no slowdowns. Record for hours without interruption.

⏱️

Auto-Stop Behaviour

Silence timeout: 5 minutes of no detected speech triggers auto-stop. The semantic VAD system intelligently distinguishes between actual silence and natural pauses in conversation.

🔄

Self-Recovering Pipeline

A built-in staleness monitor watches the pipeline. If inference stalls for any reason, VoxBar automatically resets and reconnects — no manual restart needed. This makes long sessions completely hands-free.

Why VoxBar Kyutai Is Different

What you DON'T need

✖No internet connection — everything runs locally

✖No cloud processing — your voice never leaves your machine

✖No API keys — the model downloads once and runs offline forever

✖No usage limits — unlimited transcription, forever

✖No subscriptions — one-time purchase, lifetime license

What makes it unique

✔True frame-by-frame streaming — not chunked, not batched

✔System audio capture — transcribe meetings, YouTube, podcasts directly from your PC's audio output

✔Built-in punctuation — periods, commas generated by the model itself

✔Lightweight (~5.8GB VRAM) — uses a fixed-capacity memory system that never grows, keeping GPU usage flat indefinitely

✔Semantic VAD — intelligently detects speech vs silence, auto-recovers after pauses

✔Overlay Mode — transparent overlay sits on top of any app with adjustable transparency and font sizes

✔Mid-text editing — click anywhere in your text to insert new speech at that position

✔Voice commands — say "delete" to remove highlighted text, use voice punctuation and formatting

✔Self-recovering — built-in staleness monitor auto-resets the pipeline if it stalls, no manual restart needed

Hardware Requirements

Requirement	Minimum	Recommended
GPU	NVIDIA with 6GB VRAM	NVIDIA with 8GB+ VRAM
RAM	8GB	16GB
Disk	~5GB (model + app)	SSD
OS	Windows 10/11	Windows 11
Software	Python 3.11+	Included in installer

License & Attribution

VoxBar™ Pro Kyutai 2.6B is powered by Kyutai STT 2.6B, created by Kyutai Labs (Paris) and licensed under CC-BY-4.0.

VoxBar™ is an independent product by Conjure Labs Limited and is not affiliated with, endorsed by, or sponsored by Kyutai Labs.

Kyutai 2.6B vs Pro Native

Feature	Kyutai 2.6B	Pro Native
Arena Score	9.4 combined	9.5 combined
VRAM	~5.8GB	~8.5GB
System Audio	Yes — capture meetings, videos	Microphone only
Languages	English	13 languages
Delivery	~2s delay	Sub-200ms real-time
Best for	Users with 6-8GB GPUs who want pro-grade English	Users with 10GB+ GPUs who want multilingual + speed

Ready for pro-grade English transcription?

One-time purchase. Lifetime license. 2 machines. Zero cloud.

Coming Soon

Secure checkout via Lemon Squeezy / Stripe