Back to blog
While audio language models are becoming more popular, deploying them on edge devices remains challenging. Popular frameworks like llama.cpp and Ollama support text and vision models but have limited compatibility with audio models.
Qwen2-Audio is a SOTA small-scale multimodal model that handles audio and text inputs. It enables voice interaction without ASR modules, provides audio analysis, and supports Chinese, English, and major European languages.
To start, install Nexa SDK first and run this on your terminal:
nexa run qwen2audio
Or run it with Streamlit local UI (python package required):
nexa run qwen2audio -st
Drag and drop your audio file into the terminal (or enter file path on Linux). Add text prompt to guide analysis or leave empty for direct voice input to the model.
💻 To see how much RAM is needed to run Qwen2-Audio on your device, check the RAM requirements for different quantization versions listed here - the default q4_K_M version requires 4.2GB of RAM.
🎵 For optimal performance, use 16kHz .wav
audio format. Other audio formats and sample rates are supported and will be automatically converted to the required format.
For more use cases and model capabilities, check out Qwen's blog.
For developers, server deployment and Python interface will be the next steps. Please follow Nexa SDK for updates and submit issues for any feature requests.
Kudos to Nexa AI team.
Blog written by <Kai> and <Ayla>.
Join +8,000 developers