Tribute 10 min read

The Team Behind the Engine: Kyutai & Moshi

Every time you speak into Vox Bar and watch your words appear on screen, you're using technology built by a small team in Paris who believed AI should belong to everyone. This is their story — and our thank you.

The French Open-Science Lab

In November 2023, a new kind of AI research lab launched in Paris. Funded by almost €300 million from Iliad’s Xavier Niel, CMA CGM’s Rodolphe Saadé, and Eric Schmidt, Kyutai isn't a traditional startup. It's a non-profit open-science research laboratory dedicated entirely to building artificial general intelligence — and sharing everything they discover with the world.

Led by Patrick Pérez (former CEO of Valeo.ai) and bringing together top scientists from DeepMind, FAIR (Meta), and Inria, Kyutai’s mission is to democratise AI by releasing models, training code, and datasets completely open source.

Moshi: Re-imagining Voice AI

In July 2024, Kyutai shocked the AI community by unveiling Moshi — the first natively multimodal conversational AI that could listen and speak simultaneously in real time. Unlike previous voice assistants that used a clunky "Speech-to-Text → Text LLM → Text-to-Speech" pipeline, Moshi processed audio natively token-by-token.

The speech-to-text engine that powers Moshi's unparalleled real-time understanding is what we now call Kyutai STT.

Streaming Token-by-Token Transcription

Most transcription models (like Whisper) take chunks of audio, process them, and return a sentence all at once. This adds latency. The Kyutai STT engine works fundamentally differently.

As you speak, the model predicts text token by token as the audio arrives. This results in the most instantaneous transcription experience possible. In Vox Bar, the Kyutai 1B model posts text to your screen in under 0.5 seconds — a literal real-time connection from your microphone to your application.

🇫🇷

Merci, Kyutai

To the researchers at Kyutai — thank you for pushing the boundaries of real-time audio and keeping your groundbreaking work open source.

Vox Bar's lowest latency engine is built on Kyutai STT.

Experience Kyutai STT for yourself

Vox Bar brings Mistral's frontier transcription to your desktop. Private. Local. Yours.

Coming Soon Early Bird