Nexa
Discord
navigation

Back to blog

OmniAudio-2.6B: World's Fastest Audio Language Model for Edge Deployment

Dec 12, 2024

Overview

OmniAudio is the world's fastest and most efficient audio-language model - a 2.6B-parameter multimodal model that seamlessly processes both text and audio inputs. Omni-Audio's architecture integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module. Unlike traditional approaches that chain ASR and LLM models together, it unifies both capabilities in a single efficient architecture for minimal latency and resource overhead. This enables secure, responsive audio-text processing directly on edge devices like smartphones, laptops, and robotics.

Performance Benchmarks on Consumer Hardware

On a 2024 Mac Mini M4 Pro, Qwen2-Audio-7B-Instruct running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while OmniAudio-2.6B through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering 5.5x to 10.3x faster performance on consumer hardware.

Use Cases

Voice QA
How to start a fire without a fire starter?
result for offline audio QA
Voice-in Conversation
I have a rough day at work
response from omniaudio
Creative Content Generation
Write a haiku about autumn leaves
haiku written by omniaudio
Recording Summary
Can you summarize this meeting record?
To-do item generated from meeting recording
Changing the Tone of Speaking
Can make this more causal?
rewritten sentences

Get your hands on OmniAudio-2.6B

Option 1: HuggingFace Space 🤗

NexaAIDev/omni-audio-demo

Option 2: Run OmniAudio-2.6B on your device with Nexa SDK ✨

Audio-language models like Qwen2-Audio and Moshi are gaining traction, but edge deployment options remain limited. While frameworks like llama.cpp and ollama support text and vision-language models, they lack audio processing capabilities.

We developed Nexa SDK, a GGML-based inference engine in C++, to enable efficient audio-language model deployment on edge devices. Here is how you run Omni-Audio with Nexa SDK 👇

Install Nexa SDK, run this on your terminal

nexa run omniaudio

Or run it with Streamlit local UI:

nexa run omniaudio -st

💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.

Model Architecture

model architecture for omniaudio-2.6b

OmniAudio-2.6B's architecture integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module.

Our design leverages the sparsity of the language model's embedding space. The projector module maps Whisper's audio tokens into sequences that dimensionally align with Gemma's text embeddings, enabling effective audio-text fusion while maintaining the language model's original performance.

Training Methodology

We developed OmniAudio-2.6B through a three-stage training pipeline to ensure robust performance on both transcription and conversational tasks:

Pretraining

The initial stage focuses on core audio-text alignment using the MLS English 10k transcription dataset. We enhanced the dataset by introducing a special <|transcribe|> token, enabling the model to distinguish between transcription and completion tasks. This token-based task differentiation is crucial for maintaining consistent performance across different use cases.

Supervised Fine-tuning (SFT)

For instruction tuning, we created a synthetic dataset still based on MLS English 10k transcription except that we use proprietary model to sample the following response based on the context of the transcription of the existing dataset. We use this way to construct rich audio-text pairs. This synthetic data approach enables the model to understand and process conversational audio inputs effectively.

Direct Preference Optimization (DPO)

The final stage refined model quality through DPO. Using GPT-4o API, we evaluated our initial OmniAudio-2.6B model's outputs (serving as the "policy" model) and identified incorrect responses. These were marked as "rejected," while GPT-4o generated alternative responses served as "preferred" references. To maintain Gemma2's text processing quality, we applied an additional preference training step. We used Gemma2's original text responses as the "gold standard" and trained the model to match this performance level when handling audio inputs.

What's Next for OmniAudio

Building on our current architecture, we're developing direct audio generation capabilities through enhanced token integration, which will enable the model to produce high-quality audio outputs. We're also implementing function calling support via Octopus_v2 integration, expanding the model's ability to interface with external systems and APIs.

We aim to transform OmniAudio into a comprehensive multimodal system capable of two-way voice communication and can take action for users.

Kudos to <Alex>, <Zack> and Nexa AI team.

Blog written by <Kai>.

Join +8,000 developers

Stay tuned with the Best in On-Device AI