🚀🚀🚀 Introducing Xiaomi-MiMo-Audio — A BREAKTHROUGH in general-purpose audio intelligence! We scaled pretraining to 100M+ hours and observed true EMERGENCE: few-shot generalization across diverse audio tasks!
🔥 MiMo-Audio-7B-Instruct supercharged with thinking mechanisms + instruction tuning:
✅ Open-source 7B SOTA on MMSU, MMAU, MMAR, MMAU-Pro
✅ Outperforms Gemini-2.5-Flash on audio understanding (MMAU)
✅ Beats GPT-4o-Audio on complex reasoning (Big-Bench-Audio-S2T)
It’s all OPEN — tokenizer, model, evaluation, and future audacity!
Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.
We posit that the foremost criterion for an audio tokenizer is its reconstruction fidelity, and that its tokens should be amenable to downstream language modeling. Accordingly, we introduce MiMo-Audio-Tokenizer. This 1.2B-parameter model employs a Transformer-based architecture comprising an encoder, a discretization layer, and a decoder, operating at a 25Hz frame rate and generating 200 tokens per second through 8 layers of residual vector quantization (RVQ). By integrating semantic and reconstruction objectives, we trained it from scratch on a 10-million-hour corpus, achieving superior performance in reconstruction quality and facilitating downstream language modeling.
MiMo-Audio is a unified generative audio-language model that jointly models sequences of text and audio tokens. The model accepts both text and audio tokens as input and autoregressively predicts either text or audio tokens, thereby supporting a comprehensive range of tasks involving arbitrary combinations of text and audio modalities. To enhance the modeling efficiency for high-token-rate (200 tokens/second) sequences and mitigate the length disparity between speech and text modalities, we propose a novel architecture combining a patch encoder, LLM, and patch decoder. The patch encoder aggregates four consecutive timesteps of RVQ tokens into a single patch, downsampling the sequence to a 6.25Hz representation for the LLM. Subsequently, the patch decoder autoregressively generates the full 25Hz RVQ token sequence.
By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks.
At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models.
The overall performance is as follows.
MiMo-Audio demonstrates comprehensive spoken dialogue capabilities across multiple domains, showcasing its ability to engage in natural conversations with deep understanding and appropriate responses.
Through generative pretraining on large-scale speech corpora, MiMo-Audio acquires general speech continuation capabilities. Given a audio prompt, it generates coherent and contextually appropriate continuations that preserve key acoustic properties such as speaker identity, prosody, and environmental sounds.
The following are examples of continuation in various voice styles: news broadcasting, audiobook narration, podcast programs, dialect speech, game live streaming, teacher's lectures, crosstalk performances, poetry recitation, and radio programs.
We designed few-shot in-context learning evaluation tasks for MiMo-Audio to assess the model's ability to complete speech-to-speech generation tasks relying solely on contextual speech examples without parameter updates. This benchmark aims to systematically evaluate the model's comprehensive potential in speech understanding and generation—we expect to observe emergent in-context learning capabilities similar to those demonstrated by GPT-3 in the text domain.
Functions include: Style Conversion, Voice Conversion, Speech Translation, and Speech Editing.
Due to its large-scale pretraining, extensive dataset exposure, and carefully designed supervised fine-tuning (SFT) data, our model exhibits strong audio understanding capabilities along with a certain level of reasoning ability.
Comprehensive audio captioning capabilities that provide detailed descriptions of audio content across various domains and scenarios.
Advanced audio reasoning capabilities that enable deep understanding and analysis of complex audio content, including context identification and logical inference.
Comprehensive understanding of extended audio content, capable of processing and analyzing lengthy audio sequences with sustained attention and coherent interpretation.
MiMo-Audio integrates the Instruct TTS function and incorporates the Thinking mode to optimize the generation results.