Back to blog
Dec 12, 2024
OmniAudio is the world's fastest and most efficient audio-language model - a 2.6B-parameter multimodal model that seamlessly processes both text and audio inputs. Omni-Audio's architecture integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module. Unlike traditional approaches that chain ASR and LLM models together, it unifies both capabilities in a single efficient architecture for minimal latency and resource overhead. This enables secure, responsive audio-text processing directly on edge devices like smartphones, laptops, and robotics.
On a 2024 Mac Mini M4 Pro, Qwen2-Audio-7B-Instruct running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while OmniAudio-2.6B through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering 5.5x to 10.3x faster performance on consumer hardware.
Audio-language models like Qwen2-Audio and Moshi are gaining traction, but edge deployment options remain limited. While frameworks like llama.cpp and ollama support text and vision-language models, they lack audio processing capabilities.
We developed Nexa SDK, a GGML-based inference engine in C++, to enable efficient audio-language model deployment on edge devices. Here is how you run Omni-Audio with Nexa SDK 👇
Install Nexa SDK, run this on your terminal
nexa run omniaudio
Or run it with Streamlit local UI:
nexa run omniaudio -st
💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.
OmniAudio-2.6B's architecture integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module.
Our design leverages the sparsity of the language model's embedding space. The projector module maps Whisper's audio tokens into sequences that dimensionally align with Gemma's text embeddings, enabling effective audio-text fusion while maintaining the language model's original performance.
We developed OmniAudio-2.6B through a three-stage training pipeline to ensure robust performance on both transcription and conversational tasks:
The initial stage focuses on core audio-text alignment using the MLS English 10k transcription dataset. We enhanced the dataset by introducing a special <|transcribe|>
token, enabling the model to distinguish between transcription and completion tasks. This token-based task differentiation is crucial for maintaining consistent performance across different use cases.
For instruction tuning, we created a synthetic dataset still based on MLS English 10k transcription except that we use proprietary model to sample the following response based on the context of the transcription of the existing dataset. We use this way to construct rich audio-text pairs. This synthetic data approach enables the model to understand and process conversational audio inputs effectively.
The final stage refined model quality through DPO. Using GPT-4o API, we evaluated our initial OmniAudio-2.6B model's outputs (serving as the "policy" model) and identified incorrect responses. These were marked as "rejected," while GPT-4o generated alternative responses served as "preferred" references. To maintain Gemma2's text processing quality, we applied an additional preference training step. We used Gemma2's original text responses as the "gold standard" and trained the model to match this performance level when handling audio inputs.
Building on our current architecture, we're developing direct audio generation capabilities through enhanced token integration, which will enable the model to produce high-quality audio outputs. We're also implementing function calling support via Octopus_v2 integration, expanding the model's ability to interface with external systems and APIs.
We aim to transform OmniAudio into a comprehensive multimodal system capable of two-way voice communication and can take action for users.
Kudos to <Alex>, <Zack> and Nexa AI team.
Blog written by <Kai>.
Join +8,000 developers