Back to blog
Nov 12, 2024
Squid is a breakthrough language model by Nexa AI that efficiently processes long texts on resource-constrained devices, achieving:
At Nexa AI, we're committed to pushing the boundaries of what's possible with on-device language models. With the growing demand for privacy, reduced latency, and offline functionality, deploying language models directly on devices has become increasingly important.
Mobile devices have limited computational resources and battery life. Processing long contexts requires substantial memory and computational power, which can rapidly drain battery life and degrade user experience due to slow response times. This is particularly problematic for real-time applications like voice assistants and interactive chatbots.
Why did we build Squid? We saw an urgent need for a solution that maintains the accuracy and capabilities of language models while significantly reducing their energy footprint and improving response times on resource-constrained devices. Our goal was to enable sophisticated AI functionalities on edge devices without compromising performance or draining battery life.
Inspired by models that process both images and text, we started treating long pieces of text as a separate kind of data. This allowed us to apply specialized techniques to handle lengthy texts more efficiently.
Squid uses a unique architecture consisting of two interconnected models:
1. Small Model:
2. Large Model:
This setup allows Squid to process long texts much more efficiently, as the large model doesn't need to handle the entire lengthy text directly.
To help the models communicate effectively, we introduced special "memory tokens." Think of these as placeholders or bookmarks that capture important information from the long text. The small model creates these memory tokens, which the large model then uses to understand the context and generate appropriate responses.
A special component called the "projector" helps translate information between the small and large models. It ensures that the compressed summaries and memory tokens created by the small model are in a format that the large model can easily understand.
To make Squid as effective as possible, we trained it in three stages:
Goal: Teach the large model to reconstruct the original long text from the compressed summaries provided by the small model.
How it works: The small model summarizes the long text, and the large model tries to rebuild the original text from this summary. This ensures that important details aren't lost in the compression process.
Goal: Improve the model's ability to continue a piece of text seamlessly.
How it works: We feed the models a part of the text and train them to generate the next part. This helps the model produce coherent continuations, making it great for tasks like story generation or extending conversations.
Goal: Enhance the model's ability to follow user instructions and answer questions accurately.
How it works: We fine-tune the model using a large set of question-and-answer pairs, ensuring it can provide helpful and relevant responses across various topics.
A crucial part of developing Squid was rigorously testing it to ensure it performs well across different tasks.
We wanted to make sure Squid could handle a variety of tasks, so we included six types of questions:
Contextual Questions (~56% of samples):
Numeric Questions (~9%):
Rephrasing Tasks (~7%):
Summarization (~7%):
Title or Keywords Extraction (~14%):
Continuation Tasks (~7%):
Category | Correctness (%) |
---|---|
Contextual QA | 97.76% |
Numeric QA | 98.53% |
Rephrasing | 99.22% |
Summarization | 99.62% |
Title / Keywords | 100.00% |
Continuation | 100.00% |
Weighted Average Correctness: 98.53% across all question categories.
System 1 | System 2 | Win(%) | Lose (%) | Tie (%) | Win + Tie (%) |
---|---|---|---|---|---|
Squid | AutoCompressor | 95.1 | 0 | 4.9 | 100 |
Squid | Qwen2-7B | 23.6 | 32.2 | 44.2 | 67.8 |
vs. AutoCompressor
vs. Qwen2-7B:
Squid represents a significant advancement in the development of energy-efficient, on-device language models capable of processing long contexts without sacrificing performance. By introducing a new modality and leveraging a dual-decoder architecture, we've achieved substantial improvements in both energy consumption and latency. Here are some key applications and benefits:
We invite the AI community to explore Squid and contribute to its development. The model is publicly available at our Hugging Face repository.
Join us in pushing the boundaries of on-device language modeling at nexa.ai. Together, we can make AI more efficient, accessible, and impactful.
[Paper - Squid: Long Context as a New Modality for Energy-Efficient On-Device Language Models]
Kudos to <Alex>, <Zack>, <Shuo>, <Ethan> and Nexa AI team.
Blog written by <Kai>.
Join +8,000 developers