Back to blog
Jan 31, 2025
Our local inference framework makes deploying generative AI models on-device seamless and efficient. Our technology supports a wide range of chipsets—including AMD, Qualcomm, Intel, NVIDIA, and your own—across all major operating systems. We provided benchmarks for generative AI models across multiple popular tasks, each tested on different types of devices with varying TOPS performance levels.
Deploying AI models directly on a device rather than relying on cloud-based APIs provides several advantages:
With Nexa Edge Inference, developers can efficiently run Gen AI models across various devices with minimal resource consumption.
Nexa AI on-device deployment supports multimodal AI, allowing applications to process and integrate multiple data types:
By leveraging NexaQuant, our multimodal models achieve superior compression and acceleration while maintaining state-of-the-art performance.
We provided benchmarks for generative AI models across multiple popular tasks, each tested on different types of devices with varying TOPS performance levels. If you have a device and an interested use case or generative AI task, you can refer to a similar performance device below to estimate how well your device might handle it:
Gen AI Tasks included:
Device type included:
Evaluating realvoice interactions with language models—processing audio input to generate audio output.
Device Type | Chip & Device | Latency (TTFT) | Decoding Speed | Avg. Peak RAM |
---|---|---|---|---|
Modern Laptop Chips (GPU) | Apple M3 Pro GPU | 0.67s | 20.46 tokens/s | ~990MB |
Modern Laptop Chips (iGPU) | AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M) | 1.01s | 19.28 tokens/s | ~990MB |
Modern Laptop Chips (CPU) | Intel Core Ultra 7 268V | 1.89s | 11.88 tokens/s | ~990MB |
Flagship Mobile Chips CPU | Qualcomm Snapdragon 8 Gen 3 (Samsung S24) | 1.45s | 9.13 tokens/s | ~990MB |
Embedded IoT Systems CPU | Raspberry Pi 4 Model B | 6.9s | 4.5 tokens/s | ~990MB |
*speech to speech benchmark uses Moshi with NexaQuant
Assessing AI models for text generation based on text input.
Device Type | Chip & Device | Latency - TTFT | Decoding Speed | Avg. Peak RAM |
---|---|---|---|---|
Modern Laptop Chips (GPU) | Apple M3 Pro GPU | 0.12s | 49.01 tokens/s | ~2580MB |
Modern Laptop Chips (iGPU) | AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M) | 0.19s | 30.54 tokens/s | ~2580MB |
Modern Laptop Chips (CPU) | Intel Core Ultra 7 268V | 0.63s | 14.35 tokens/s | ~2580MB |
Flagship Mobile Chips CPU | Qualcomm Snapdragon 8 Gen 3 (Samsung S24) | 0.27s | 10.89 tokens/s | ~2580MB |
Embedded IoT Systems CPU | Raspberry Pi 4 Model B | 1.27s | 5.31 tokens/s | ~2580MB |
*text-to-text benchmark uses llama-3.2 with NexaQuant
Evaluating AI's ability to analyze visual inputs, generate responses, extract key visual information, and dynamically direct tools—vision in, text out.
Device Type | Chip & Device | Latency - TTFT | Decoding Speed | Avg. Peak RAM |
---|---|---|---|---|
Modern Laptop Chips (GPU) | Apple M3 Pro GPU | 2.62s | 86.77 tokens/s | ~ 1093MB |
Modern Laptop Chips (iGPU) | AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M) | 2.14s | 83.41 tokens/s | ~1093MB |
Modern Laptop Chips (CPU) | Intel Core Ultra 7 268V | 9.43s | 45.65 tokens/s | ~1093MB |
Flagship Mobile Chips CPU | Qualcomm Snapdragon 8 Gen 3 (Samsung S24) | 7.26s | 27.66 tokens/s | ~1093MB |
Embedded IoT Systems CPU | Raspberry Pi 4 Model B | 22.32 | 6.15 tokens/s | ~1093MB |
*vision-to-text benchmark uses OmniVLM with NexaQuant
For the best performance, devices capable of generating over 5 tokens per second ensure a smooth AI experience. Want to estimate performance for your device? Contact us today.