The Team Behind the Engine: Alibaba's Qwen

VoxBar AI's engine — the Canary Qwen 2.5B — is a hybrid. Half of it is a speech encoder built by NVIDIA. The other half is a large language model built by a team on the other side of the world: Alibaba's Tongyi Lab, creators of the Qwen model family.

Who They Are

The Qwen team operates within Alibaba Cloud's Tongyi Lab in China. They're led by a group of engineers who have quietly built one of the most successful open-source AI model families in the world:

Lin Junyang leads the open-source initiatives
Liu Dayiheng serves as lead engineer
Hui Binyuan drives the coding model efforts
Xu Jin handles audio and multimodal capabilities
Wu Chenfei leads the vision model work

By 2025, Qwen had become the most popular open model family globally, with over 40 million downloads and more than 50,000 derivative models built on top of their work.

What They Built

The Qwen family started as a text-only language model and expanded into an entire ecosystem:

Qwen 2.5 (September 2024) — over 100 models across language, vision, and audio, with significant improvements in knowledge, coding, and reasoning
Qwen 3 (April 2025) — models from 0.6B to 235B parameters, trained on 36 trillion tokens across 119 languages
Qwen 3.5 (February 2026) — the latest generation, supporting 201 languages with a new architecture

The model that matters to VoxBar is Qwen 1.7B — a compact but capable language model that NVIDIA chose as the backbone for Canary Qwen 2.5B. When NVIDIA needed a language model small enough to pair with a speech encoder but smart enough to produce contextually aware transcription, they picked Qwen.

Why It Matters to VoxBar AI

Most speech recognition models just map sounds to words. They hear "there" and write "there" — even when the context clearly calls for "their" or "they're." They don't understand language. They process audio.

VoxBar AI is different because of Qwen. The Canary Qwen 2.5B model pairs NVIDIA's speech encoder with Qwen's language understanding, creating what NVIDIA calls a Speech-Augmented Language Model. The speech encoder hears the sounds. The Qwen backbone understands the meaning. Together, they produce transcription that reads naturally — with correct word choices, proper punctuation, and contextual intelligence.

That's not something we built. That's something Alibaba's team made possible by releasing their models under the Apache 2.0 license, allowing anyone — including NVIDIA — to integrate Qwen into new architectures.

The Open Source Philosophy

What's remarkable about the Qwen team is the scale of their open-source commitment. Most of their models — including the one we use — are released under Apache 2.0, one of the most permissive licences in open source. No attribution required. No restrictions on commercial use.

They keep their largest "Max" models proprietary and tied to Alibaba Cloud's services, which is fair enough. But the core model family — the one that developers actually build on — is freely available.

That decision, made in Hangzhou, directly enables a transcription tool running on someone's desktop in London. The open-source ecosystem is genuinely global, and the Qwen team is a significant part of why.

VoxBar is an independent product and is not affiliated with, endorsed by, or sponsored by Alibaba Cloud or the Qwen team. We use the Qwen architecture indirectly through NVIDIA's Canary Qwen 2.5B model.