Back to blog
Jan 31, 2025
Nexa AI makes deploying generative AI models on-device seamless and efficient. Our technology supports a wide range of chipsets—including AMD, Qualcomm, Intel, NVIDIA, and your own—across all major operating systems.
Deploying AI models directly on a device rather than relying on cloud-based APIs provides several advantages:
With Nexa Edge Inference, developers can efficiently run Gen AI models across various devices with minimal resource consumption.
Nexa AI on-device deployment supports multimodal AI, allowing applications to process and integrate multiple data types:
By leveraging NexaQuant, our multimodal models achieve superior compression and acceleration while maintaining state-of-the-art performance.
We provided benchmarks for generative AI models across multiple popular tasks, each tested on different types of devices with varying TOPS performance levels. If you have a device and an interested use case or generative AI task, you can refer to a similar performance device below to estimate how well your device might handle it:
Gen AI Tasks included:
Device type included:
Evaluating realvoice interactions with language models—processing audio input to generate audio output.
Device Type | Chip & Device | Latency (TTFT) | Decoding Speed | Avg. Peak RAM |
---|---|---|---|---|
Modern Laptop Chips (GPU) | Apple M3 Pro GPU | 0.67s | 20.46 tokens/s | ~990MB |
Modern Laptop Chips (iGPU) | AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M) | 1.01s | 19.28 tokens/s | ~990MB |
Modern Laptop Chips (CPU) | Intel Core Ultra 7 268V | 1.89s | 11.88 tokens/s | ~990MB |
Flagship Mobile Chips CPU | Qualcomm Snapdragon 8 Gen 3 (Samsung S24) | 1.45s | 9.13 tokens/s | ~990MB |
Embedded IoT Systems CPU | Raspberry Pi 4 Model B | 6.9s | 4.5 tokens/s | ~990MB |
*speech to speech benchmark uses Moshi with NexaQuant
Assessing AI models for text generation based on text input.
Device Type | Chip & Device | Latency - TTFT | Decoding Speed | Avg. Peak RAM |
---|---|---|---|---|
Modern Laptop Chips (GPU) | Apple M3 Pro GPU | 0.12s | 49.01 tokens/s | ~2580MB |
Modern Laptop Chips (iGPU) | AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M) | 0.19s | 30.54 tokens/s | ~2580MB |
Modern Laptop Chips (CPU) | Intel Core Ultra 7 268V | 0.63s | 14.35 tokens/s | ~2580MB |
Flagship Mobile Chips CPU | Qualcomm Snapdragon 8 Gen 3 (Samsung S24) | 0.27s | 10.89 tokens/s | ~2580MB |
Embedded IoT Systems CPU | Raspberry Pi 4 Model B | 1.27s | 5.31 tokens/s | ~2580MB |
*text-to-text benchmark uses llama-3.2 with NexaQuant
Evaluating AI's ability to analyze visual inputs, generate responses, extract key visual information, and dynamically direct tools—vision in, text out.
Device Type | Chip & Device | Latency - TTFT | Decoding Speed | Avg. Peak RAM |
---|---|---|---|---|
Modern Laptop Chips (GPU) | Apple M3 Pro GPU | 2.62s | 86.77 tokens/s | ~ 1093MB |
Modern Laptop Chips (iGPU) | AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M) | 2.14s | 83.41 tokens/s | ~1093MB |
Modern Laptop Chips (CPU) | Intel Core Ultra 7 268V | 9.43s | 45.65 tokens/s | ~1093MB |
Flagship Mobile Chips CPU | Qualcomm Snapdragon 8 Gen 3 (Samsung S24) | 7.26s | 27.66 tokens/s | ~1093MB |
Embedded IoT Systems CPU | Raspberry Pi 4 Model B | 22.32 | 6.15 tokens/s | ~1093MB |
*vision-to-text benchmark uses OmniVLM with NexaQuant
For the best performance, devices capable of generating over 5 tokens per second ensure a smooth AI experience. Want to estimate performance for your device? Contact us today.