Back to blog
Dec 31, 2024
NexaQuant can make your AI use cases run 3x faster and saves 4x energy and storage space on your devices with 100%+ accuracy recovery. It supports multimodality (text, audio, vision, image gen and video). Making diverse AI use cases more context aware, running more efficiently on devices, and unlocking cloud-level AI use cases on power-constraint devices without compromise
2024 marks a breakthrough for small-scale language models, with Gemma, Llama, and Qwen model families achieving remarkable capabilities under 10B parameters. However, even smaller models like Llama 3.2 1B and 3B still require significant computational resources, with storage requirements of 2.30GB and 6.00GB, and RAM usage of 3.08GB and 7.11GB respectively. These requirements often exceed the capabilities of typical consumer devices for efficient on-device deployment.
This creates a practical dilemma. Most consumer devices, from smartphones to laptops, struggle to run these models effectively. While cloud deployment offers a potential solution, it comes with significant drawbacks: increased latency, dependency on internet connectivity, substantial scaling costs, and critical privacy concerns when processing sensitive data. To unlock the full potential of AI on everyday devices, we need a better approach to model compression.
Standard model quantization offers a path forward - using 4-bit weights-only quantization, Llama 3.2 1B can be compressed to just 0.73GB storage and requires 1.38GB RAM. This enables a much efficient and manageable local inference on consumer hardware.
However, current compression methods significantly degrade model reliability - our experiments show Llama 3.2 1B's performance drops to 86.45% after Q4_0 quantization compared to its original FP16 version. This significantly affects critical tasks like document summarization and accurate question-answering.
NexaQuant represents a breakthrough in model compression technology, delivering an advanced optimization pipeline that maintains model intelligence while significantly reducing computational requirements.
At its core, NexaQuant introduces a novel quantization approach specifically designed for transformer-based neural networks. The key innovation lies in its robust handling of outlier values during the quantization process. By incorporating in-house calibration data during compression, NexaQuant optimizes model performance for production environments.
Our benchmarks demonstrate NexaQuant's effectiveness: when applied to Llama 3.1/3.2 models (1B, 3B, and 8B variants), it achieves 100% of the original BF16 model performance across standard evaluation metrics. This slight performance improvement over the baseline is consistently reproducible across our test suite.
The technology supports any transformer-based model, including multimodal systems that process vision and audio inputs. While NexaQuant can scale to handle models of any size, we've optimized it specifically for models under 10 billion parameters, a range we've identified as the optimal balance between computational efficiency and practical deployment requirements.
Our benchmarks demonstrate NexaQuant's breakthrough in efficiency-performance balance across a comprehensive suite of evaluations. We test models on diverse capabilities including mathematical reasoning (GSM8K), complex instruction following (IFEVAL), reading comprehension (OpenBookQA), and multi-step logical reasoning (ARC).
BF16 | Q4_0 (gs=128) | Nexa Q4_0 (gs=128) | Benchmark | Degraded after Q4_0 (gs=128) | Restoration % | Improvement, % (the percentage over the original checkpoint) |
---|---|---|---|---|---|---|
60.82 | 57.62 | 62.77 | IFEVAL | 94.74% | 103.21% | 8.94% |
60.77 | 57.63 | 59.07 | MMLU (5-shot)) | 94.83% | 97.20% | 2.5% |
53.22 | 52.62 | 53.57 | Hellaswag | 98.87% | 100.66% | 1.81% |
43.43 | 40.87 | 42.24 | arc_challenge | 94.11% | 97.26% | 3.35% |
75.34 | 71.72 | 74.75 | arc_easy | 95.20% | 99.22% | 4.22% |
76.82 | 76.12 | 77.2 | piqa | 99.35% | 100.76% | 1.42% |
28.8 | 29 | 29.2 | openbook qa | 100.69% | 101.39% | 0.69% |
63.92 | 58.99 | 64.75 | gsm8k | 92.29% | 101.30% | 9.76% |
Total | 96.26% | 100.12% | 4.09% | |||
BF16 | Q4_0 (gs=128) | Nexa Q4_0 (gs=128) | Benchmark | Degraded after Q4_0 (gs=128) | Restoration % | Improvement, % (the percentage over the original checkpoint) |
---|---|---|---|---|---|---|
56.2 | 53.79 | 61.86 | IFEVAL | 95.71% | 110.07% | 15.00% |
68.2 | 66.02 | 65.06 | MMLU (5-shot)) | 96.80% | 95.40% | -1.45% |
59.77 | 58.99 | 60.55 | Hellaswag | 98.69% | 101.31% | 2.64% |
52.73 | 52.22 | 52.65 | arc_challenge | 99.03% | 99.85% | 0.82% |
81.06 | 81.19 | 82.07 | arc_easy | 100.16% | 101.25% | 1.08% |
80.2 | 79.76 | 81.45 | piqa | 99.45% | 101.56% | 2.12% |
36 | 33.8 | 36 | openbook qa | 93.89% | 100.00% | 6.51% |
74.22 | 71.19 | 72.9 | gsm8k | 95.92% | 98.22% | 2.40% |
Total | 97.46% | 100.96% | 3.64% | |||
For Llama 3.2 1B, compared to original model (FP16):
Performance metrics compared to standard Q4_0 quantization:
NexaQuant unlocks sophisticated AI capabilities on resource-constrained devices while surpassing original model performance.
For Llama 3.2 3B, compared to original model (FP16):
Performance metrics compared to standard Q4_0 quantization:
This balance of size and capability enables advanced language features on mobile devices without sacrificing computational efficiency.
For Llama 3.1 8B, compared to original model (BF16):
Performance metrics compared to standard Q4_0 quantization:
NexaQuant makes enterprise-grade language models accessible on standard computing hardware, enabling advanced AI applications without specialized infrastructure.
Multimodal models are typically larger than text-only models, requiring both increased model parameters and additional components like visual/audio projectors. This leads to substantially higher storage requirements and runtime memory consumption.NexaQuant seamlessly works with multimodal models like vision language model (VLM), as demonstrated across benchmarks in visual question answering (DocVQA), multimodal reasoning (MMBench), and cross-modal understanding (MMMU).
For Qwen-VL-2B with visual projector included and compared to original model (BF16):
Performance metrics compared to standard Q4_0 quantization:
In these challenging visual document QA tasks, such as reading invoice numbers from complex tables and extracting population figures from historical documents, NexaQuant-compressed Qwen2-VL-2B maintains perfect accuracy while even the FP16 model struggles with precise number extraction.
Voice-based assistants are highly sought after, requiring the ability to understand tone, emotion, and nuanced sound for personalized interactions. Instant, real-time feedback is essential, but current audio models are too slow and traditional compression sacrifices accuracy. NexaQuant transforms audio language models, enabling real-time conversations with instant feedback while maintaining advanced capabilities to understand tone, emotion, and rich audio context.
For Qwen2-Audio-7.8B compared to original model (BF16), nexa compressed Qwen2-Audio-7.8B can achieve
NexaQuant
The subject of the sentence is 'many people'
bnb load_in_4bits
The subject of the sentence is 'many people watch TV shows'
NexaQuant
The area of a triangle is given by the formula: A = (p * q) / 2, where p and q are the lengths of the two sides of the triangle. In this case, p = 15 cm and q = 30 cm. Substituting these values into the formula gives:\n\nA = (15 * 30) / 2\nA = 450 / 2\nA = 225 cm^2\n\nTherefore, the area of the triangle is 225 square centimeters
bnb load_in_4bits
The area of a triangle is given by the formula (1/2) * b * h, where b is the base and h is the height. In this case, the base is 15 units and the height is 30 units, so the area of the triangle is:\n\n(1/2) * 15 * 30 = 450 square units\n\nTherefore, the area of the triangle is 450 square units.
Understanding video unlocks vast potential for vision-based devices, smart cameras, and screen-aware AI. Due to significant privacy concerns, on-device video processing is crucial for driving adoption of more context-aware AI use cases. However, current video language models are too slow for practical use. NexaQuant accelerates video language models by 4x while preserving high accuracy, enabling faster and more secure on-device AI applications.
Compared to original model (BF16), nexa compressed model can achieve
Performance metrics compared to standard Q4_0 quantization:
Prompt:
Describe what the person is doing in the video in detail.
NexaQuant
The person is walking on a wooden deck, moving their feet in a rhythmic motion. The background features a railing and a body of water, with a bridge and some buildings visible in the distance.
Image generation models hold immense potential for creative applications, design tools, and personalized content creation. However, their high computational demands make them slow and less practical for real-time use. On-device processing is key to ensuring privacy and cost-efficiency. With NexaQuant, we can accelerate image generation models by 4x while maintaining high-quality outputs, enabling faster, more secure, and privacy-friendly creative experiences.
For FLUX.1-dev compared to original model (BF16), nexa compressed model can achieve
Performance metrics compared to standard Q4_0 quantization:
Prompt 1:
On Mars, an astronaut holds a board that reads NEXA AI
Output:
Prompt 2:
A dragon perched on a cliff overlooking a futuristic city, with glowing neon lights and flying cars.
Output:
Prompt 3:
A close-up shot of a vintage leather-bound book lying open on a rustic wooden desk, with a quill pen and ink bottle beside it.
Output:
With Nexa SDK, NexaQuant is compatible with popular local inference frameworks like llama.cpp, enabling efficient deployment across diverse computing platforms. We conducted comprehensive performance testing using Qwen2.5-1.5B-Instruct across multiple consumer devices and platforms.
These performance metrics demonstrate that tiny but mighty AI is now deployable across existing software and hardware infrastructure. With sub-second startup and robust processing speeds, organizations and developers can integrate multimodal language models directly into their applications - from mobile apps to desktop software - enabling real-time AI capabilities throughout their technology stack.
Smaller, efficient models are ideal for domain-specific fine-tuning, enabling organizations to create tailored AI solutions with their proprietary data.
NexaQuant's compression pipeline is specifically designed to work with custom data, not only preserving but often enhancing performance in specialized domains while reducing hallucination risks.
We're launching an early access program for businesses and developers to explore these capabilities. Contact us to bring powerful, accurate, and efficient AI directly to your edge devices and applications.
Looking to optimize Gen AI on your edge devices? Connect with us to explore how NexaQuant can transform your on-device AI with high speed and accuracy.
Contact our team today to access the tools, models, and support to make on-device Gen AI your competitive edge.
Join +8,000 developers