Revolutionizing On-Device Language Models for Long Contexts
At Nexa AI, we're committed to pushing the boundaries of what's possible with on-device language models. With the growing demand for privacy, reduced latency, and offline functionality, deploying language models directly on devices has become increasingly important.
Mobile devices have limited computational resources and battery life. Processing long contexts requires substantial memory and computational power, which can rapidly drain battery life and degrade user experience due to slow response times. This is particularly problematic for real-time applications like voice assistants and interactive chatbots.
Why did we build Squid? We saw an urgent need for a solution that maintains the accuracy and capabilities of language models while significantly reducing their energy footprint and improving response times on resource-constrained devices. Our goal was to enable sophisticated AI functionalities on edge devices without compromising performance or draining battery life.
Viewing Long Text as a New Type of Data efficiently.
Inspired by models that process both images and text, we started treating long pieces of text as a separate kind of data. This allowed us to apply specialized techniques to handle lengthy texts more efficiently.
Two-Part Model Architecture
Squid uses a unique architecture consisting of two interconnected models:
1. Small Model:
2. Large Model:
This setup allows Squid to process long texts much more efficiently, as the large model doesn't need to handle the entire lengthy text directly.
Memory Tokens
To help the models communicate effectively, we introduced special "memory tokens." Think of these as placeholders or bookmarks that capture important information from the long text. The small model creates these memory tokens, which the large model then uses to understand the context and generate appropriate responses.
Bridging the Models
A special component called the "projector" helps translate information between the small and large models. It ensures that the compressed summaries and memory tokens created by the small model are in a format that the large model can easily understand.
To make Squid as effective as possible, we trained it in three stages:
1. Restoration Training
Goal: Teach the large model to reconstruct the original long text from the compressed summaries provided by the small model.
How it works: The small model summarizes the long text, and the large model tries to rebuild the original text from this summary. This ensures that important details aren't lost in the compression process.
2. Continual Training
Goal: Improve the model's ability to continue a piece of text seamlessly.
How it works: We feed the models a part of the text and train them to generate the next part. This helps the model produce coherent continuations, making it great for tasks like story generation or extending conversations.
3. Instruction Fine-Tuning
Goal: Enhance the model's ability to follow user instructions and answer questions accurately.
How it works: We fine-tune the model using a large set of question-and-answer pairs, ensuring it can provide helpful and relevant responses across various topics.
A crucial part of developing Squid was rigorously testing it to ensure it performs well across different tasks.
Composition of the Dataset
Types of Questions Tested
We wanted to make sure Squid could handle a variety of tasks, so we included six types of questions:
Performance Gains
Latency Reduction:
Energy Efficiency:
Accuracy Across Tasks
Overall Correctness:
Category-Specific Performance:
Category | Correctness (%) |
---|---|
Contextual QA | 97.76% |
Numeric QA | 98.53% |
Rephrasing | 99.22% |
Summarization | 99.62% |
Title / Keywords | 100.00% |
Continuation | 100.00% |
Comparative Analysis
System 1 | System 2 | Win (%) | Lose (%) | Tie (%) | Win + Tie (%) |
---|---|---|---|---|---|
Squid | AutoCompressor | 95.1 | 0 | 4.9 | 100 |
Squid | Qwen2-7B | 23.6 | 32.2 | 44.2 | 67.8 |
Against AutoCompressor (based on Llama-2-7b):
Against Qwen2-7B :
Squid represents a significant advancement in the development of energy-efficient, on-device language models capable of processing long contexts without sacrificing performance. By introducing a new modality and leveraging a dual-decoder architecture, we've achieved substantial improvements in both energy consumption and latency.
Implications:
We invite the AI community to explore Squid and contribute to its development. The model is publicly available at our Hugging Face repository.
Join us in pushing the boundaries of on-device language modeling at nexa.ai. Together, we can make AI more efficient, accessible, and impactful.