Nexa
Discord
navigation

Back to blog

NexaQuant: Llama.cpp-Compatible Model Compression with 100%+ Accuracy Recovery

Dec 31, 2024

Nexa Quant can make the model 3X lighter and can recover 100%+ the accuracy

TL;DR

Key Points

NexaQuant can make your AI use cases run 3x faster and saves 4x energy and storage space on your devices with 100%+ accuracy recovery. It supports multimodality (text, audio, vision, image gen and video). Making diverse AI use cases more context aware, running more efficiently on devices, and unlocking cloud-level AI use cases on power-constraint devices without compromise

  • Bonus point, it is compatible with llama.cpp and supports diverse hardware platforms, from mobile to desktop

Why Compressing a Language Model

2024 marks a breakthrough for small-scale language models, with Gemma, Llama, and Qwen model families achieving remarkable capabilities under 10B parameters. However, even smaller models like Llama 3.2 1B and 3B still require significant computational resources, with storage requirements of 2.30GB and 6.00GB, and RAM usage of 3.08GB and 7.11GB respectively. These requirements often exceed the capabilities of typical consumer devices for efficient on-device deployment.

This creates a practical dilemma. Most consumer devices, from smartphones to laptops, struggle to run these models effectively. While cloud deployment offers a potential solution, it comes with significant drawbacks: increased latency, dependency on internet connectivity, substantial scaling costs, and critical privacy concerns when processing sensitive data. To unlock the full potential of AI on everyday devices, we need a better approach to model compression.

Balancing Performance Against Accuracy Remains a Key Challenge

Standard model quantization offers a path forward - using 4-bit weights-only quantization, Llama 3.2 1B can be compressed to just 0.73GB storage and requires 1.38GB RAM. This enables a much efficient and manageable local inference on consumer hardware.

However, current compression methods significantly degrade model reliability - our experiments show Llama 3.2 1B's performance drops to 86.45% after Q4_0 quantization compared to its original FP16 version. This significantly affects critical tasks like document summarization and accurate question-answering.

Introducing NexaQuant

How NexaQuant work

NexaQuant represents a breakthrough in model compression technology, delivering an advanced optimization pipeline that maintains model intelligence while significantly reducing computational requirements.

At its core, NexaQuant introduces a novel quantization approach specifically designed for transformer-based neural networks. The key innovation lies in its robust handling of outlier values during the quantization process. By incorporating in-house calibration data during compression, NexaQuant optimizes model performance for production environments.

Our benchmarks demonstrate NexaQuant's effectiveness: when applied to Llama 3.1/3.2 models (1B, 3B, and 8B variants), it achieves 100% of the original BF16 model performance across standard evaluation metrics. This slight performance improvement over the baseline is consistently reproducible across our test suite.

The technology supports any transformer-based model, including multimodal systems that process vision and audio inputs. While NexaQuant can scale to handle models of any size, we've optimized it specifically for models under 10 billion parameters, a range we've identified as the optimal balance between computational efficiency and practical deployment requirements.

Model Compression Benchmark & Examples

Text Models

3X Lighter Without Accuracy Compromise

Our benchmarks demonstrate NexaQuant's breakthrough in efficiency-performance balance across a comprehensive suite of evaluations. We test models on diverse capabilities including mathematical reasoning (GSM8K), complex instruction following (IFEVAL), reading comprehension (OpenBookQA), and multi-step logical reasoning (ARC).

Llama3.2 - 3B - Instruct HF

BF16

Q4_0 (gs=128)

Nexa Q4_0 (gs=128)

Benchmark

Degraded after Q4_0 (gs=128)

Restoration %

Improvement, % (the percentage over the original checkpoint)

60.82

57.62

62.77

IFEVAL

94.74%

103.21%

8.94%

60.77

57.63

59.07

MMLU (5-shot))

94.83%

97.20%

2.5%

53.22

52.62

53.57

Hellaswag

98.87%

100.66%

1.81%

43.43

40.87

42.24

arc_challenge

94.11%

97.26%

3.35%

75.34

71.72

74.75

arc_easy

95.20%

99.22%

4.22%

76.82

76.12

77.2

piqa

99.35%

100.76%

1.42%

28.8

29

29.2

openbook qa

100.69%

101.39%

0.69%

63.92

58.99

64.75

gsm8k

92.29%

101.30%

9.76%

Total

96.26%

100.12%

4.09%

Llama3.1 - 8B - Instruct HF

BF16

Q4_0 (gs=128)

Nexa Q4_0 (gs=128)

Benchmark

Degraded after Q4_0 (gs=128)

Restoration %

Improvement, % (the percentage over the original checkpoint)

56.2

53.79

61.86

IFEVAL

95.71%

110.07%

15.00%

68.2

66.02

65.06

MMLU (5-shot))

96.80%

95.40%

-1.45%

59.77

58.99

60.55

Hellaswag

98.69%

101.31%

2.64%

52.73

52.22

52.65

arc_challenge

99.03%

99.85%

0.82%

81.06

81.19

82.07

arc_easy

100.16%

101.25%

1.08%

80.2

79.76

81.45

piqa

99.45%

101.56%

2.12%

36

33.8

36

openbook qa

93.89%

100.00%

6.51%

74.22

71.19

72.9

gsm8k

95.92%

98.22%

2.40%

Total

97.46%

100.96%

3.64%

1B NexaQuant Model for IoT and Wearables

For Llama 3.2 1B, compared to original model (FP16):

  • 68% smaller storage footprint: 2.30GB → 730MB
  • 55% reduced runtime memory: 3.08GB → 1.38GB

Performance metrics compared to standard Q4_0 quantization:

  • 101.75% accuracy restoration of original model (FP16) performance

NexaQuant unlocks sophisticated AI capabilities on resource-constrained devices while surpassing original model performance.

3B NexaQuant Model for Mobile Devices and Laptops

For Llama 3.2 3B, compared to original model (FP16):

  • 73% smaller storage footprint: 6.00GB → 1.60GB
  • 66% reduced runtime memory: 7.11GB → 2.44GB

Performance metrics compared to standard Q4_0 quantization:

  • 100.12% accuracy restoration of original model (FP16) performance

This balance of size and capability enables advanced language features on mobile devices without sacrificing computational efficiency.

8B NexaQuant Model for Desktops, Robots and Automobiles

For Llama 3.1 8B, compared to original model (BF16):

  • 71% smaller storage footprint: 15GB → 4.3GB
  • 67% reduced runtime memory: 15.52GB → 5.09GB

Performance metrics compared to standard Q4_0 quantization:

  • 100.96% accuracy restoration of original model (BF16) performance

NexaQuant makes enterprise-grade language models accessible on standard computing hardware, enabling advanced AI applications without specialized infrastructure.

Vision Models

Multimodal models are typically larger than text-only models, requiring both increased model parameters and additional components like visual/audio projectors. This leads to substantially higher storage requirements and runtime memory consumption.NexaQuant seamlessly works with multimodal models like vision language model (VLM), as demonstrated across benchmarks in visual question answering (DocVQA), multimodal reasoning (MMBench), and cross-modal understanding (MMMU).

For Qwen-VL-2B with visual projector included and compared to original model (BF16):

  • 49% smaller storage footprint: 4.42GB → 2.27GB
  • 33% reduced runtime memory: 4.40GB → 2.94GB

Performance metrics compared to standard Q4_0 quantization:

  • 97.4% overall accuracy restoration

Examples

In these challenging visual document QA tasks, such as reading invoice numbers from complex tables and extracting population figures from historical documents, NexaQuant-compressed Qwen2-VL-2B maintains perfect accuracy while even the FP16 model struggles with precise number extraction.

Real world VQA cases for vision model with NexaQuant

Audio Language Models

Voice-based assistants are highly sought after, requiring the ability to understand tone, emotion, and nuanced sound for personalized interactions. Instant, real-time feedback is essential, but current audio models are too slow and traditional compression sacrifices accuracy. NexaQuant transforms audio language models, enabling real-time conversations with instant feedback while maintaining advanced capabilities to understand tone, emotion, and rich audio context.

For Qwen2-Audio-7.8B compared to original model (BF16), nexa compressed Qwen2-Audio-7.8B can achieve

  • 29% of the original file size: 14.5 GB → 4.2 GB
  • 25% of runtime memory needed: 16.80GB → 4.2 GB
  • 95% overall accuracy restoration compared to full sized model
  • 3X faster inference speed
Example 1:
audio_115

NexaQuant

The subject of the sentence is 'many people'

bnb load_in_4bits

The subject of the sentence is 'many people watch TV shows'

Example 2:
audio_3

NexaQuant

The area of a triangle is given by the formula: A = (p * q) / 2, where p and q are the lengths of the two sides of the triangle. In this case, p = 15 cm and q = 30 cm. Substituting these values into the formula gives:\n\nA = (15 * 30) / 2\nA = 450 / 2\nA = 225 cm^2\n\nTherefore, the area of the triangle is 225 square centimeters

bnb load_in_4bits

The area of a triangle is given by the formula (1/2) * b * h, where b is the base and h is the height. In this case, the base is 15 units and the height is 30 units, so the area of the triangle is:\n\n(1/2) * 15 * 30 = 450 square units\n\nTherefore, the area of the triangle is 450 square units.

Video Understanding Models

Understanding video unlocks vast potential for vision-based devices, smart cameras, and screen-aware AI. Due to significant privacy concerns, on-device video processing is crucial for driving adoption of more context-aware AI use cases. However, current video language models are too slow for practical use. NexaQuant accelerates video language models by 4x while preserving high accuracy, enabling faster and more secure on-device AI applications.

Compared to original model (BF16), nexa compressed model can achieve

  • 84% of the original file size: 1.79GB GB → 1.5 GB
  • 50% of runtime memory needed: 5.36 GB → 2.72 GB

Performance metrics compared to standard Q4_0 quantization:

  • 3X faster inference speed
Example 1:

Prompt:

Describe what the person is doing in the video in detail.

NexaQuant

The person is walking on a wooden deck, moving their feet in a rhythmic motion. The background features a railing and a body of water, with a bridge and some buildings visible in the distance.

Image Generation Model

Image generation models hold immense potential for creative applications, design tools, and personalized content creation. However, their high computational demands make them slow and less practical for real-time use. On-device processing is key to ensuring privacy and cost-efficiency. With NexaQuant, we can accelerate image generation models by 4x while maintaining high-quality outputs, enabling faster, more secure, and privacy-friendly creative experiences.

For FLUX.1-dev compared to original model (BF16), nexa compressed model can achieve

  • 27.9% of the original file size: 23.8 GB → 6.64 GB
  • 36% of runtime memory needed: 34.66GB → 12.61 GB

Performance metrics compared to standard Q4_0 quantization:

  • 9.6x faster inference speed
Example with flux1.dev

Prompt 1:

On Mars, an astronaut holds a board that reads NEXA AI

Output:

Example 1 with flux1.dev

Prompt 2:

A dragon perched on a cliff overlooking a futuristic city, with glowing neon lights and flying cars.

Output:

Example 2 flux1.dev

Prompt 3:

A close-up shot of a vintage leather-bound book lying open on a rustic wooden desk, with a quill pen and ink bottle beside it.

Output:

Example 3 flux1.dev

Edge Inference on ANY Devices and Platforms

With Nexa SDK, NexaQuant is compatible with popular local inference frameworks like llama.cpp, enabling efficient deployment across diverse computing platforms. We conducted comprehensive performance testing using Qwen2.5-1.5B-Instruct across multiple consumer devices and platforms.

Time to First Token (TTFT)

  • iPhone: 0.48 seconds
  • Samsung S24 Ultra: 0.25 seconds
  • HP Laptop with AMD Ryzen AI 9 HX370 CPU: 1.138 seconds
  • MacBook Pro M4 Metal: 0.232 seconds

Decoding (Inference) Speed

  • iPhone: 29.78 tok/s
  • Samsung S24 Ultra: 18.45 tok/s
  • HP Laptop with AMD Ryzen AI 9 HX370 CPU: 51.78 tok/s
  • MacBook Pro M4 Metal: 148.19 tok/s

These performance metrics demonstrate that tiny but mighty AI is now deployable across existing software and hardware infrastructure. With sub-second startup and robust processing speeds, organizations and developers can integrate multimodal language models directly into their applications - from mobile apps to desktop software - enabling real-time AI capabilities throughout their technology stack.

Unlock More Business Value with Domain-Adapted AI

Smaller, efficient models are ideal for domain-specific fine-tuning, enabling organizations to create tailored AI solutions with their proprietary data.

NexaQuant's compression pipeline is specifically designed to work with custom data, not only preserving but often enhancing performance in specialized domains while reducing hallucination risks.

We're launching an early access program for businesses and developers to explore these capabilities. Contact us to bring powerful, accurate, and efficient AI directly to your edge devices and applications.

Contact Us to Get Started

Looking to optimize Gen AI on your edge devices? Connect with us to explore how NexaQuant can transform your on-device AI with high speed and accuracy.

Contact our team today to access the tools, models, and support to make on-device Gen AI your competitive edge.

Fill out the NexaQuant early access program form
Or, schedule a call with us for a FREE POC 👇

Join +8,000 developers

Stay tuned with the Best in On-Device AI