πŸ† 9.4 Arena β€” Professional Tier

VoxBar Pro Kyutai 2.6B

Pro-grade English transcription. The lowest VRAM in the Pro tier.

Powered by Kyutai STT 2.6B β€” a 2.6-billion parameter decoder-only transformer optimised for maximum English accuracy with Mimi neural audio codec.

$59 one-time Β· πŸ”₯ LAUNCH: $29.50 with code EARLYBIRD

🎀 Transcribe your mic β€” or πŸ”Š listen to your system audio (meetings, podcasts, videos). All 100% local. Nothing leaves your machine.

9.4
Arena Score
2.6B
Parameters
5.8GB
VRAM
100%
Local & Private

How It Works

Chunk-based processing with Mimi neural codec β€” 2.6 billion parameters optimised for English accuracy.

🎀

Captures your audio

Audio is captured at 24kHz in tiny 80ms frames from your microphone β€” or switch to system audio mode to capture anything playing on your PC (meetings, videos, podcasts). No virtual cables needed.

πŸ”Š

Mimi neural audio codec

Each 80ms audio frame is encoded by Kyutai's Mimi codec into 32 parallel codebook streams β€” capturing both the meaning of speech and its acoustic characteristics. Mimi operates at 12.5 Hz with causal streaming, producing tokens the instant audio arrives.

🧠

2.6B decoder-only transformer

A 2.6-billion parameter autoregressive model converts Mimi's audio tokens into text using greedy decoding β€” no sampling randomness, just the most confident prediction every frame. With its larger parameter count, the 2.6B model achieves superior English accuracy (6.4% WER) with built-in punctuation and capitalisation.

✍️

Text arrives with minimal delay

The 2.6B model processes audio in chunks, with text appearing roughly ~2 seconds behind your speech. This short delay is the trade-off for the model's superior accuracy. During natural pauses, the model outputs padding tokens (silence markers) until speech resumes, keeping the pipeline alive without producing phantom text.

πŸ”’

Fixed-capacity memory β€” runs forever

The model's attention cache is a fixed-capacity ring buffer β€” pre-allocated at startup, never growing. Following Kyutai's official inference design, GPU memory stays locked at ~5.8GB indefinitely. No memory leaks, no slowdowns, no matter how long you run it.

Accuracy & Speed

Metric Value
Arena Score 9.4 combined — Professional tier
WER (Word Error Rate) 6.4% — best-in-class for the model size
Delivery Chunk-based — text arrives ~2 seconds behind speech
Language English
Punctuation Context-aware, generated by the model
Capitalisation Automatic, intelligent

Memory & Resource Footprint

Resource Usage Behaviour Over Time
GPU VRAM ~5.8GB (Kyutai STT 2.6B) Stable — fixed-capacity ring buffer, never grows
RAM ~1-2GB (Python process) Stable
Disk Zero temp files Audio processed in memory, never written to disk
Network None Fully offline — no internet required

Recording Limits

♾️

No Recording Limit

VoxBar Kyutai 2.6B uses a fixed-capacity ring buffer for its attention cache. GPU memory stays locked at ~5.8GB indefinitely β€” no memory leaks, no slowdowns. Record for hours without interruption.

⏱️

Auto-Stop Behaviour

Silence timeout: 5 minutes of no detected speech triggers auto-stop. The semantic VAD system intelligently distinguishes between actual silence and natural pauses in conversation.

πŸ”„

Self-Recovering Pipeline

A built-in staleness monitor watches the pipeline. If inference stalls for any reason, VoxBar automatically resets and reconnects β€” no manual restart needed. This makes long sessions completely hands-free.

Why VoxBar Kyutai Is Different

What you DON'T need

No internet connection β€” everything runs locally
No cloud processing β€” your voice never leaves your machine
No API keys β€” the model downloads once and runs offline forever
No usage limits β€” unlimited transcription, forever
No subscriptions β€” one-time purchase, lifetime license

What makes it unique

True frame-by-frame streaming β€” not chunked, not batched
System audio capture β€” transcribe meetings, YouTube, podcasts directly from your PC's audio output
Built-in punctuation β€” periods, commas generated by the model itself
Lightweight (~5.8GB VRAM) β€” uses a fixed-capacity memory system that never grows, keeping GPU usage flat indefinitely
Semantic VAD β€” intelligently detects speech vs silence, auto-recovers after pauses
Overlay Mode β€” transparent overlay sits on top of any app with adjustable transparency and font sizes
Mid-text editing β€” click anywhere in your text to insert new speech at that position
Voice commands β€” say "delete" to remove highlighted text, use voice punctuation and formatting
Self-recovering β€” built-in staleness monitor auto-resets the pipeline if it stalls, no manual restart needed

Hardware Requirements

Requirement Minimum Recommended
GPU NVIDIA with 6GB VRAM NVIDIA with 8GB+ VRAM
RAM 8GB 16GB
Disk ~5GB (model + app) SSD
OS Windows 10/11 Windows 11
Software Python 3.11+ Included in installer

License & Attribution

VoxBarβ„’ Pro Kyutai 2.6B is powered by Kyutai STT 2.6B, created by Kyutai Labs (Paris) and licensed under CC-BY-4.0.

VoxBarβ„’ is an independent product by Conjure Labs Limited and is not affiliated with, endorsed by, or sponsored by Kyutai Labs.

Kyutai 2.6B vs Pro Native

Feature Kyutai 2.6B Pro Native
Arena Score 9.4 combined 9.5 combined
VRAM ~5.8GB ~8.5GB
System Audio Yes β€” capture meetings, videos Microphone only
Languages English 13 languages
Delivery ~2s delay Sub-200ms real-time
Best for Users with 6-8GB GPUs who want pro-grade English Users with 10GB+ GPUs who want multilingual + speed

Ready for pro-grade English transcription?

One-time purchase. Lifetime license. 2 machines. Zero cloud.

Coming Soon

Secure checkout via Lemon Squeezy / Stripe