Nexa
Discord
navigation

Back to blog

What can you do with tiny (1B/3B) LLMs in a local RAG system?

Nov 1, 2024

Intro

Smaller language models are gaining traction, particularly for local RAG (Retrieval-Augmented Generation) applications. As more people (not just developers) seek privacy-conscious ways to interact with their documents without relying on cloud services like Anthropic's Claude or OpenAI's GPT APIs, local solutions are becoming increasingly attractive. To understand the practical capabilities of these smaller models, we built and tested a local RAG system on a 2021 MacBook Pro (M1 Pro) 14". Here's what we discovered...

Technical Stack

We built the system using:

  • Nomic's embedding model
  • Llama3.2 3B instruct
  • Langchain RAG workflow
  • Nexa SDK for Embedding & Inference
  • Chroma DB

The complete code and technical stack are available on GitHub.

Performance Insights

What Works Well

The system's basic performance exceeded my expectations, particularly when tested with Nvidia's Q2 2025 financial report (9 pages of complex financial data):

Asking two questions in a single query - Claude vs. Local RAG System
  • Lightning-Fast PDF Processing: Document loading takes under 2 seconds
  • Competitive Speed: Simple information retrieval slightly outpaces Claude 3.5 Sonnet with Claude web app
  • Decent Context Understanding: Successfully combines information from different parts of the same document

For straightforward queries like "What's NVIDIA's total revenue?", the system performs very well. Think of it as an enhanced Ctrl/Command+F with comprehension capabilities.

Limitations

As expected, the smaller models (in this case, Llama3.2 3B) show their limitations with complex analytical tasks. Asking for year-over-year growth comparisons between segments or trend analysis leads to unreliable outputs.

Pushing Small Models' Limits with LoRA

Making a search-optimized fine-tuning or LoRA takes lots of time. So as a proof of concept, we trained specific adapters for generating pie charts and column charts. Think of it like giving the model different "hats" to wear for different tasks 🎩.

Task Routing System

We first implemented with the Octopus_v2 action model for task routing:

  • <pdf> or <document> tags trigger RAG for document search
  • "column chart" or "pie chart" keywords activate visualization LoRA
  • Regular chat uses the base model

And it works! For example:

  1. Ask about revenue numbers from the PDF → gets the data via RAG
  2. Say "make a pie chart" → switches to visualization mode and uses the previous data to generate the chart
Generate column chart from previous data, the GPU is working hard
Generate pie chart from previous data, plz blame Llama3.2 for the wrong title

The LoRAs are pretty basic (trained on small batches of data) and far from robust, but it hints at something interesting: you could potentially have one small base model (3B) with different LoRA "plugins" for specific tasks in a local RAG system. Again, it is kind of like having a lightweight model that can wear different hats or shoes when needed.

Try It Yourself

We've open-sourced everything, here is the link. Few things to know:

  • Use <pdf> tag to trigger RAG
  • Say "column chart" or "pie chart" for visualizations
  • Needs about 10GB RAM

What's Next

Working on:

  1. Getting it to understand images/graphs in documents
  2. Making the LoRA switching more efficient (just one parent model)
  3. Teaching it to break down complex questions better with multi-step reasoning or simple CoT

Small LLMs excel at basic document Q&A while struggling with complex analysis. However, their real potential shines through easy fine-tuning due to their scales - especially with LoRA. By treating these models like a Swiss Army knife, adding task-specific 'hats' through LoRA, we can build powerful, fully on-device solutions for most daily tasks. The future of local AI might not be about having one massive model, but rather a lightweight base model with swappable specialized adapters.

Kudos to <David>, <Perry> and Nexa AI team.

Blog written by <Kai>.

Join +8,000 developers

Stay tuned with the Best in On-Device AI