Nexa
Discord
navigation

Back to blog

Nexa Quantized DeepSeek R1 Distill Model With Full Quality Recovery

Feb 18, 2025

Background + Overview

DeepSeek-R1 has been making headlines for rivaling OpenAI’s O1 reasoning model while remaining fully open-source. Many users want to run DeepSeek Distilled models locally to ensure data privacy, reduce latency, and maintain offline access. However, fitting such a large model onto personal devices typically requires quantization (e.g. Q4_K_M), which often sacrifices accuracy (up to ~22% accuracy loss) and undermines the performance of the local reasoning tasks.

We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to 1/4 of its original size—without losing any accuracy. In tests on an HP OmniBook AIPC with an AMD Ryzen™ AI 9 HX 370 processor, the NexaQuant DeepSeek-R1-Distill-Qwen-1.5B model maintained full-precision accuracy at a decoding speed of 66.40 tokens per second while using only 1,228 MB of RAM. In contrast, the unquantized model ran at 25.28 tokens per second and used 3,788 MB of RAM.

Use Case Demo

Here’s a comparison of how a standard Q4_K_M and NexaQuant-4Bit handle a common investment banking brain teaser question. NexaQuant excels in accuracy while shrinking the model file size by 4 times.

For DeepSeek-R1-Distill-Qwen-1.5B model:
Prompt: <A Common Investment Banking BrainTeaser Question>
There is a 6x8 rectangular chocolate bar made up of small 1x1 bits. We want to break it into the 48 bits. We can break one piece of chocolate horizontally or vertically, but cannot break two pieces together! What is the minimum number of breaks required?

Correct Answer: 47

For Deepseek-R1-Distill-Llama-8B model:
Prompt:
A stick is broken into 3 parts, by choosing 2 points randomly along its length. With what probability can it form a triangle?

Correct Answer: 1/4

Distill-R1 models have proven their ability to tackle complex reasoning tasks, such as brain teasers from investment banking interviews, using just 1.5B parameters—something larger models like Llama3.2-3B struggle to achieve.

Benchmarks

The benchmarks show that the NexaQuant 4bit models preserve the reasoning & general capacity of the original 16bit model, delivering uncompromised performance in a significantly smaller memory & storage footprint.

Reasoning Capacity:
DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant
DeepSeek-R1-Distill-Llama-8B-NexaQuant
General capacity:
DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant

Full 16bit

llama.cpp - 4 bit

NexaQuant - 4 bit

HellaSwag

35.81

34.31

34.60

MMLU

37.31

35.49

37.41

ARC_easy

67.55

54.20

65.53

MathQA

41.04

28.51

39.87

PIQA

65.56

61.70

65.07

IFEval

18.78

16.52

21.84

DeepSeek-R1-Distill-Llama-8B-NexaQuant

Full 16 bit

llama.cpp - 4 bit

NexaQuant - 4 bit

HellaSwag

57.07

52.12

54.56

MMLU

55.59

52.82

54.94

ARC_easy

74.49

69.32

71.72

MathQA

35.34

30.00

32.46

PIQA

78.56

76.09

77.68

IFEval

36.26

35.35

34.12

On-Device Performance

We tested speed and memory usage on HP OmniBook AIPC with AMD Ryzen AI 9 HX 370 processor using a 512-token prompt and generating 128 tokens. By leveraging NexaQuant’s Q4_0 quantization, you can achieve:

  • Compared to Standard Q4_K_M:
    • 7.14% faster time to first token (TTFT)
    • 6.81% faster prefill speed
    • 6.22% faster decoding speed
    • 20.05% lower RAM usage
  • Compared to Full Precision:
    • 15.71% faster TTFT
    • 15.19% faster prefill speed
    • 162.66% faster decoding speed
    • 67.58% lower RAM usage

These improvements highlight NexaQuant’s ability to deliver high-performance AI on personal devices, combining faster inference and dramatically reduced memory usage—without sacrificing accuracy.

DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant

TTFT (s)

Prefilling speed (tokens/s)

Decoding speed (tokens/s)

Peak RAM (MB)

Model size (GB)

Standard Q4_K_M

0.75

683.21

62.51

1536

1.04

Standard Q3_K_L

0.74

690.66

53.85

1433

0.94

NexaQuant 4bit

0.70

729.71

66.40

1228

0.99

Full precision

0.81

633.48

25.28

3788

3.31

DeepSeek-R1-Distill-Llama-8B-NexaQuant

TTFT (s)

Prefilling speed (tokens/s)

Decoding speed (tokens/s)

Peak RAM (MB)

Model size (GB)

Standard Q4_K_M

5.64

90.72

16.20

5324

4.58

Standard Q3_K_L

5.66

90.35

13.75

4812

4.02

NexaQuant 4bit

3.96

129.14

17.20

5017

4.35

Full precision

13.59

37.67

5.30

15564

14.9

How to run locally

NexaQuant is compatible with Nexa-SDK, Ollama, LM Studio, Jan.ai, AnythingLLM and any llama.cpp based project.

For instructions on how to run the model: See our Hugging Face ReadMe.

P.S. if you liked our work, feel free to ⭐Star us: github.com/NexaAI/nexa-SDK or follow us on Twitter(X).

Contact Us

Interested in running DeepSeek R1 Distill models on your own devices? We offer deployment solutions with CPU, GPU, and NPU acceleration—plus cross-platform compatibility—so your engineering team can stay focused on what they do best.

Need to compress your own fine-tuned DeepSeek-Distill-R1 model for specific use cases? We can help you achieve strong reasoning capabilities in a fraction of the time.

Ready to get started? Schedule a call to learn more about our end-to-end support.

Next Steps

  1. This model is built for complex problem-solving, which is why it sometimes takes a long thinking process even for simple questions. We recognize this and are working on improving it in the next update.
  2. Inference Nexa Quantized Deepseek-R1 distilled model on NPU.

Thank you!

Join Discord server for feedback and discussion.

Follow us on Twitter (X) and newsletter below.