⚡ 8.1 Arena — Performance Tier

VoxBar Kyutai 1B

Millisecond-speed voice transcription. Words appear as you speak.

Powered by Kyutai STT 1B — a 1-billion parameter decoder-only transformer with Mimi neural codec for true frame-by-frame streaming. English + French.

$39 one-time · 🔥 LAUNCH: $19.50 with code EARLYBIRD

🎤 Transcribe your mic — or 🔊 listen to your system audio (meetings, podcasts, videos). All 100% local. Nothing leaves your machine.

<80ms
Latency
1B
Parameters
EN + FR
Languages
2.7GB
VRAM

How It Works

True frame-by-frame streaming — 12.5 tokens per second, processed as you speak.

🎤

Captures your audio

Audio is captured at 24kHz in tiny 80ms frames from your microphone — or switch to system audio mode to capture anything playing on your PC (meetings, videos, podcasts). No virtual cables needed.

🔊

Mimi neural audio codec

Each 80ms audio frame is encoded by Kyutai's Mimi codec into 32 parallel codebook streams — capturing both the meaning of speech and its acoustic characteristics. Mimi operates at 12.5 Hz with causal streaming, producing tokens the instant audio arrives.

🧠

1B decoder-only transformer

A 1-billion parameter autoregressive model converts Mimi's audio tokens into text using greedy decoding — no sampling randomness, just the most confident prediction every frame. With zero-delay alignment between audio and text streams, text tokens emerge with minimal latency and built-in punctuation and capitalisation.

✍️

Words appear instantly

Each text token is decoded and displayed immediately — you see words forming milliseconds after you speak them. During natural pauses, the model outputs padding tokens (silence markers) until speech resumes, keeping the pipeline alive without producing phantom text.

🔒

Fixed-capacity memory — runs forever

The model's attention cache is a fixed-capacity ring buffer — pre-allocated at startup, never growing. Following Kyutai's official inference design, GPU memory stays locked at ~2.7GB indefinitely. No memory leaks, no slowdowns, no matter how long you run it.

Accuracy & Speed

Metric Value
Arena Score 8.1 combined — Performance tier
WER (Word Error Rate) Higher than 2.6B — optimised for speed over accuracy
Delivery Frame-by-frame — words appear as you speak (<80ms)
Language English + French
Punctuation Context-aware, generated by the model
Capitalisation Automatic, intelligent

Memory & Resource Footprint

Resource Usage Behaviour Over Time
GPU VRAM ~2.7GB (Kyutai STT 1B) Stable — fixed-capacity ring buffer, never grows
RAM ~1-2GB (Python process) Stable
Disk Zero temp files Audio processed in memory, never written to disk
Network None Fully offline — no internet required

Recording Limits

♾️

No Recording Limit

VoxBar Kyutai 1B uses a fixed-capacity ring buffer for its attention cache. GPU memory stays locked at ~2.7GB indefinitely — the smallest footprint in the lineup. Record for hours without interruption.

⏱️

Auto-Stop Behaviour

Silence timeout: 5 minutes of no detected speech triggers auto-stop.

Why VoxBar Kyutai Is Different

What you DON'T need

No internet connection — everything runs locally
No cloud processing — your voice never leaves your machine
No API keys — the model downloads once and runs offline forever
No usage limits — unlimited transcription, forever
No subscriptions — one-time purchase, lifetime license

What makes it unique

True frame-by-frame streaming — not chunked, not batched
System audio capture — transcribe meetings, YouTube, podcasts directly from your PC's audio output
Built-in punctuation — periods, commas generated by the model itself
Lightest GPU footprint (~2.7GB VRAM) — uses a fixed-capacity memory system that never grows, keeping GPU usage flat indefinitely
Semantic VAD — intelligently detects speech vs silence, auto-recovers after pauses
Overlay Mode — transparent overlay sits on top of any app with adjustable transparency and font sizes
Mid-text editing — click anywhere in your text to insert new speech at that position
Voice commands — say "delete" to remove highlighted text, use voice punctuation and formatting
Self-recovering — built-in staleness monitor auto-resets the pipeline if it stalls, no manual restart needed

Hardware Requirements

Requirement Minimum Recommended
GPU NVIDIA with 3GB VRAM NVIDIA with 4GB+ VRAM
RAM 8GB 16GB
Disk ~3GB (model + app) SSD
OS Windows 10/11 Windows 11
Software Python 3.11+ Included in installer

License & Attribution

VoxBar™ Kyutai 1B is powered by Kyutai STT 1B, created by Kyutai Labs (Paris) and licensed under CC-BY-4.0.

VoxBar™ is an independent product by Conjure Labs Limited and is not affiliated with, endorsed by, or sponsored by Kyutai Labs.

Kyutai 1B vs Kyutai 2.6B

Feature Kyutai 1B Kyutai 2.6B
Arena Score 8.1 combined 9.4 combined
VRAM ~2.7GB ~5.8GB
Latency <80ms frame-by-frame ~2s chunk delay
Languages English + French English
Price $39 $59
Best for Budget GPUs (3-4GB), real-time speed Higher accuracy, system audio capture

Ready for real-time streaming transcription?

One-time purchase. Lifetime license. 2 machines. Zero cloud.

Coming Soon

Secure checkout via Lemon Squeezy / Stripe