Back to blog
Feb 18, 2025
DeepSeek-R1 has been making headlines for rivaling OpenAI’s O1 reasoning model while remaining fully open-source. Many users want to run DeepSeek Distilled models locally to ensure data privacy, reduce latency, and maintain offline access. However, fitting such a large model onto personal devices typically requires quantization (e.g. Q4_K_M), which often sacrifices accuracy (up to ~22% accuracy loss) and undermines the performance of the local reasoning tasks.
We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to 1/4 of its original size—without losing any accuracy. In tests on an HP OmniBook AIPC with an AMD Ryzen™ AI 9 HX 370 processor, the NexaQuant DeepSeek-R1-Distill-Qwen-1.5B model maintained full-precision accuracy at a decoding speed of 66.40 tokens per second while using only 1,228 MB of RAM. In contrast, the unquantized model ran at 25.28 tokens per second and used 3,788 MB of RAM.
Here’s a comparison of how a standard Q4_K_M and NexaQuant-4Bit handle a common investment banking brain teaser question. NexaQuant excels in accuracy while shrinking the model file size by 4 times.
Prompt: <A Common Investment Banking BrainTeaser Question>
There is a 6x8 rectangular chocolate bar made up of small 1x1 bits. We want to break it into the 48 bits. We can break one piece of chocolate horizontally or vertically, but cannot break two pieces together! What is the minimum number of breaks required?
Correct Answer: 47
Prompt:
A stick is broken into 3 parts, by choosing 2 points randomly along its length. With what probability can it form a triangle?
Correct Answer: 1/4
Distill-R1 models have proven their ability to tackle complex reasoning tasks, such as brain teasers from investment banking interviews, using just 1.5B parameters—something larger models like Llama3.2-3B struggle to achieve.
The benchmarks show that the NexaQuant 4bit models preserve the reasoning & general capacity of the original 16bit model, delivering uncompromised performance in a significantly smaller memory & storage footprint.
Full 16bit | llama.cpp - 4 bit | NexaQuant - 4 bit | |
---|---|---|---|
HellaSwag | 35.81 | 34.31 | 34.60 |
MMLU | 37.31 | 35.49 | 37.41 |
ARC_easy | 67.55 | 54.20 | 65.53 |
MathQA | 41.04 | 28.51 | 39.87 |
PIQA | 65.56 | 61.70 | 65.07 |
IFEval | 18.78 | 16.52 | 21.84 |
Full 16 bit | llama.cpp - 4 bit | NexaQuant - 4 bit | |
---|---|---|---|
HellaSwag | 57.07 | 52.12 | 54.56 |
MMLU | 55.59 | 52.82 | 54.94 |
ARC_easy | 74.49 | 69.32 | 71.72 |
MathQA | 35.34 | 30.00 | 32.46 |
PIQA | 78.56 | 76.09 | 77.68 |
IFEval | 36.26 | 35.35 | 34.12 |
We tested speed and memory usage on HP OmniBook AIPC with AMD Ryzen AI 9 HX 370 processor using a 512-token prompt and generating 128 tokens. By leveraging NexaQuant’s Q4_0 quantization, you can achieve:
These improvements highlight NexaQuant’s ability to deliver high-performance AI on personal devices, combining faster inference and dramatically reduced memory usage—without sacrificing accuracy.
TTFT (s) | Prefilling speed (tokens/s) | Decoding speed (tokens/s) | Peak RAM (MB) | Model size (GB) | |
---|---|---|---|---|---|
Standard Q4_K_M | 0.75 | 683.21 | 62.51 | 1536 | 1.04 |
Standard Q3_K_L | 0.74 | 690.66 | 53.85 | 1433 | 0.94 |
NexaQuant 4bit | 0.70 | 729.71 | 66.40 | 1228 | 0.99 |
Full precision | 0.81 | 633.48 | 25.28 | 3788 | 3.31 |
TTFT (s) | Prefilling speed (tokens/s) | Decoding speed (tokens/s) | Peak RAM (MB) | Model size (GB) | |
---|---|---|---|---|---|
Standard Q4_K_M | 5.64 | 90.72 | 16.20 | 5324 | 4.58 |
Standard Q3_K_L | 5.66 | 90.35 | 13.75 | 4812 | 4.02 |
NexaQuant 4bit | 3.96 | 129.14 | 17.20 | 5017 | 4.35 |
Full precision | 13.59 | 37.67 | 5.30 | 15564 | 14.9 |
NexaQuant is compatible with Nexa-SDK, Ollama, LM Studio, Jan.ai, AnythingLLM and any llama.cpp based project.
For instructions on how to run the model: See our Hugging Face ReadMe.
P.S. if you liked our work, feel free to ⭐Star us: github.com/NexaAI/nexa-SDK or follow us on Twitter(X).
Interested in running DeepSeek R1 Distill models on your own devices? We offer deployment solutions with CPU, GPU, and NPU acceleration—plus cross-platform compatibility—so your engineering team can stay focused on what they do best.
Need to compress your own fine-tuned DeepSeek-Distill-R1 model for specific use cases? We can help you achieve strong reasoning capabilities in a fraction of the time.
Ready to get started? Schedule a call to learn more about our end-to-end support.
Join Discord server for feedback and discussion.
Follow us on Twitter (X) and newsletter below.