Back to blog
Nov 1, 2024
Smaller language models are gaining traction, particularly for local RAG (Retrieval-Augmented Generation) applications. As more people (not just developers) seek privacy-conscious ways to interact with their documents without relying on cloud services like Anthropic's Claude or OpenAI's GPT APIs, local solutions are becoming increasingly attractive. To understand the practical capabilities of these smaller models, we built and tested a local RAG system on a 2021 MacBook Pro (M1 Pro) 14". Here's what we discovered...
We built the system using:
The complete code and technical stack are available on GitHub.
The system's basic performance exceeded my expectations, particularly when tested with Nvidia's Q2 2025 financial report (9 pages of complex financial data):
For straightforward queries like "What's NVIDIA's total revenue?", the system performs very well. Think of it as an enhanced Ctrl/Command+F with comprehension capabilities.
As expected, the smaller models (in this case, Llama3.2 3B) show their limitations with complex analytical tasks. Asking for year-over-year growth comparisons between segments or trend analysis leads to unreliable outputs.
Making a search-optimized fine-tuning or LoRA takes lots of time. So as a proof of concept, we trained specific adapters for generating pie charts and column charts. Think of it like giving the model different "hats" to wear for different tasks 🎩.
We first implemented with the Octopus_v2 action model for task routing:
And it works! For example:
The LoRAs are pretty basic (trained on small batches of data) and far from robust, but it hints at something interesting: you could potentially have one small base model (3B) with different LoRA "plugins" for specific tasks in a local RAG system. Again, it is kind of like having a lightweight model that can wear different hats or shoes when needed.
We've open-sourced everything, here is the link. Few things to know:
Working on:
Small LLMs excel at basic document Q&A while struggling with complex analysis. However, their real potential shines through easy fine-tuning due to their scales - especially with LoRA. By treating these models like a Swiss Army knife, adding task-specific 'hats' through LoRA, we can build powerful, fully on-device solutions for most daily tasks. The future of local AI might not be about having one massive model, but rather a lightweight base model with swappable specialized adapters.
Kudos to <David>, <Perry> and Nexa AI team.
Blog written by <Kai>.
Join +8,000 developers