The Team Behind the Engine: NVIDIA NeMo

Three of the six engines in the VoxBar suite are built on models created by one team: NVIDIA's NeMo Conversational AI group.

If you've ever used VoxBar Ultra, VoxBar Flash, or VoxBar AI, you've experienced their work firsthand. We wanted to take a moment to talk about who they are and what they've built.

What They Built

The NeMo team develops NVIDIA's open-source toolkit for speech, language, and multimodal AI. It's called NeMo — short for Neural Modules — and it's the framework that trains, fine-tunes, and deploys models like the ones we use.

From that framework, three model families power VoxBar:

Parakeet TDT 0.6B v2 — the engine behind VoxBar Ultra. A 600-million-parameter model that holds the #1 spot on the Hugging Face Open ASR Leaderboard with a 1.69% word error rate. It can transcribe 60 minutes of audio in roughly one second. That's not a typo.

Canary 1B Flash — the engine behind VoxBar Flash. An 883-million-parameter multilingual model supporting English, German, French, and Spanish, with both transcription and translation capabilities.

Canary Qwen 2.5B — the engine behind VoxBar AI. This one is special. It combines the Canary 1B Flash speech encoder with Alibaba's Qwen 1.7B language model, creating what NVIDIA calls a Speech-Augmented Language Model (SALM). It doesn't just hear words — it understands context, producing transcription that reads like it was typed by a person.

Why It Matters to Us

Every model the NeMo team releases is open-source and commercially licensed under CC-BY-4.0. That's a deliberate choice. They could keep these models proprietary. They could charge for access. Instead, they publish the weights, the training code, and the documentation — and let developers like us build on top of them.

VoxBar exists because of that decision. We didn't train these models. We didn't build the FastConformer architecture or invent the Token-and-Duration Transducer. The NeMo team did that work — years of research, massive compute budgets, and thousands of hours of audio data — and then made it available to everyone.

We took their models and built something we hope is useful: a private, local, no-cloud transcription tool that anyone with an NVIDIA GPU can run on their own machine. But the foundation is theirs.

The Open Source Commitment

What stands out about the NeMo team isn't just the quality of their models — it's the consistency of their open-source commitment. They don't release one model and move on. They iterate publicly:

Parakeet went from 1.1B → 0.6B v2 → 0.6B v3, each time improving accuracy while reducing model size
Canary went from 1B → 1B Flash → 1B v2 (expanding to 25 languages) → Qwen 2.5B (adding LLM capabilities)
Every release includes pre-trained checkpoints, training recipes, and integration with the NeMo toolkit

That's not a marketing exercise. That's a team that genuinely wants developers to use their work.

Attribution

We use three NVIDIA models under the CC-BY-4.0 license, and we're glad to give credit where it's due:

VoxBar Ultra is powered by nvidia/parakeet-tdt-0.6b-v2
VoxBar Flash is powered by nvidia/canary-1b-flash
VoxBar AI is powered by nvidia/canary-qwen-2.5b

All three models were created by the NVIDIA NeMo team and are licensed under CC-BY-4.0.

VoxBar is an independent product and is not affiliated with, endorsed by, or sponsored by NVIDIA. We're just grateful users of their open-source models.