Nexa
Discord
navigation

Back to blog

On-Device Gen AI Multimodal Benchmarks Across Devices With Nexa Compression and Inference

Jan 31, 2025

TL;DR

Our local inference framework makes deploying generative AI models on-device seamless and efficient. Our technology supports a wide range of chipsets—including AMD, Qualcomm, Intel, NVIDIA, and your own—across all major operating systems. We provided benchmarks for generative AI models across multiple popular tasks, each tested on different types of devices with varying TOPS performance levels.

Key Benefits:

  1. Multimodal Capabilities – Support for text, audio, video, and vision-based Gen AI tasks.
  2. Wide Hardware Compatibility – Run AI models on PCs, laptops, mobile devices, and embedded systems.
  3. Leading Performance – With NexaQuant, our edge inference framework, models run 2.5X faster while requiring 4X less storage and memory, maintaining high accuracy.

Why On-Device AI?

Deploying AI models directly on a device rather than relying on cloud-based APIs provides several advantages:

  • Privacy & Security – Data stays on the device, ensuring confidentiality.
  • Lower Costs – No expensive cloud inference fees.
  • Speed & Responsiveness – Low latency inference without internet dependency.
  • Offline Capability – AI applications work even in low-connectivity areas.

With Nexa Edge Inference, developers can efficiently run Gen AI models across various devices with minimal resource consumption.

New Trend of Multimodal AI Use Cases

Nexa AI on-device deployment supports multimodal AI, allowing applications to process and integrate multiple data types:

  • Text-based AI – Chatbots, document summarization, coding assistants
  • Speech-to-speech AI – Real-time voice translation, AI-powered voice assistants
  • Vision AI – Object detection, image captioning, OCR for document processing

By leveraging NexaQuant, our multimodal models achieve superior compression and acceleration while maintaining state-of-the-art performance.

Gen AI Tasks Performance Benchmarks Across Devices

We provided benchmarks for generative AI models across multiple popular tasks, each tested on different types of devices with varying TOPS performance levels. If you have a device and an interested use case or generative AI task, you can refer to a similar performance device below to estimate how well your device might handle it:

Gen AI Tasks included:

  • Speech-to-speech
  • Text-to-text
  • Vision-to-text

Device type included:

  • Modern Laptop Chips – Optimized for local AI processing on desktops and laptops.
  • Flagship Mobile Chips – AI models running on smartphones and tablets.
  • Embedded Systems (~4 TOPS) – Low-power devices for edge computing applications.

Speech-to-Speech Benchmarks

Evaluating realvoice interactions with language models—processing audio input to generate audio output.

Device Type

Chip & Device

Latency (TTFT)

Decoding Speed

Avg. Peak RAM

Modern Laptop Chips (GPU)

Apple M3 Pro GPU

0.67s

20.46 tokens/s

~990MB

Modern Laptop Chips (iGPU)

AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)

1.01s

19.28 tokens/s

~990MB

Modern Laptop Chips (CPU)

Intel Core Ultra 7 268V

1.89s

11.88 tokens/s

~990MB

Flagship Mobile Chips CPU

Qualcomm Snapdragon 8 Gen 3 (Samsung S24)

1.45s

9.13 tokens/s

~990MB

Embedded IoT Systems CPU

Raspberry Pi 4 Model B

6.9s

4.5 tokens/s

~990MB

*speech to speech benchmark uses Moshi with NexaQuant

Text-to-Text Benchmarks

Assessing AI models for text generation based on text input.

Device Type

Chip & Device

Latency - TTFT

Decoding Speed

Avg. Peak RAM

Modern Laptop Chips (GPU)

Apple M3 Pro GPU

0.12s

49.01 tokens/s

~2580MB

Modern Laptop Chips (iGPU)

AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)

0.19s

30.54 tokens/s

~2580MB

Modern Laptop Chips (CPU)

Intel Core Ultra 7 268V

0.63s

14.35 tokens/s

~2580MB

Flagship Mobile Chips CPU

Qualcomm Snapdragon 8 Gen 3 (Samsung S24)

0.27s

10.89 tokens/s

~2580MB

Embedded IoT Systems CPU

Raspberry Pi 4 Model B

1.27s

5.31 tokens/s

~2580MB

*text-to-text benchmark uses llama-3.2 with NexaQuant

Vision-to-Text Benchmarks

Evaluating AI's ability to analyze visual inputs, generate responses, extract key visual information, and dynamically direct tools—vision in, text out.

Device Type

Chip & Device

Latency - TTFT

Decoding Speed

Avg. Peak RAM

Modern Laptop Chips (GPU)

Apple M3 Pro GPU

2.62s

86.77 tokens/s

~ 1093MB

Modern Laptop Chips (iGPU)

AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)

2.14s

83.41 tokens/s

~1093MB

Modern Laptop Chips (CPU)

Intel Core Ultra 7 268V

9.43s

45.65 tokens/s

~1093MB

Flagship Mobile Chips CPU

Qualcomm Snapdragon 8 Gen 3 (Samsung S24)

7.26s

27.66 tokens/s

~1093MB

Embedded IoT Systems CPU

Raspberry Pi 4 Model B

22.32

6.15 tokens/s

~1093MB

*vision-to-text benchmark uses OmniVLM with NexaQuant

Can Your Device Run Nexa AI?

For the best performance, devices capable of generating over 5 tokens per second ensure a smooth AI experience. Want to estimate performance for your device? Contact us today.