Nexa
Discord
navigation

Back to blog

Squid

Nov 12, 2024

TL;DR

Squid is a breakthrough language model by Nexa AI that efficiently processes long texts on resource-constrained devices, achieving:

  • 10x improvement in energy efficiency
  • 5x reduction in processing time
  • High accuracy across various tasks
  • You can get your hands on Squid 👉 here

Introduction

At Nexa AI, we're committed to pushing the boundaries of what's possible with on-device language models. With the growing demand for privacy, reduced latency, and offline functionality, deploying language models directly on devices has become increasingly important.

Mobile devices have limited computational resources and battery life. Processing long contexts requires substantial memory and computational power, which can rapidly drain battery life and degrade user experience due to slow response times. This is particularly problematic for real-time applications like voice assistants and interactive chatbots.

Why did we build Squid? We saw an urgent need for a solution that maintains the accuracy and capabilities of language models while significantly reducing their energy footprint and improving response times on resource-constrained devices. Our goal was to enable sophisticated AI functionalities on edge devices without compromising performance or draining battery life.

Squid: A New Approach to Handling Long Texts

Viewing Long Text as a New Type of Data efficiently

Inspired by models that process both images and text, we started treating long pieces of text as a separate kind of data. This allowed us to apply specialized techniques to handle lengthy texts more efficiently.

Two-Part Model Architecture

Squid uses a unique architecture consisting of two interconnected models:

Squid‘s small-large model architecture

1. Small Model:

  • Size: 0.5 billion parameters (a measure of model complexity).
  • Role: Compresses long texts into concise representations.
  • Function: Acts like a summarizer, capturing the essence of the long text without needing to process every word in detail.

2. Large Model:

  • Size: 7 billion parameters.
  • Role: Takes the compressed summary from the small model, along with the user's question or prompt, to generate accurate and relevant responses.
  • Function: Handles the heavy lifting of understanding the user's request and providing detailed answers.

This setup allows Squid to process long texts much more efficiently, as the large model doesn't need to handle the entire lengthy text directly.

Memory Tokens

To help the models communicate effectively, we introduced special "memory tokens." Think of these as placeholders or bookmarks that capture important information from the long text. The small model creates these memory tokens, which the large model then uses to understand the context and generate appropriate responses.

Bridging the Models

A special component called the "projector" helps translate information between the small and large models. It ensures that the compressed summaries and memory tokens created by the small model are in a format that the large model can easily understand.

Our Training Process

To make Squid as effective as possible, we trained it in three stages:

1. Restoration Training

Goal: Teach the large model to reconstruct the original long text from the compressed summaries provided by the small model.

How it works: The small model summarizes the long text, and the large model tries to rebuild the original text from this summary. This ensures that important details aren't lost in the compression process.

2. Continual Training

Goal: Improve the model's ability to continue a piece of text seamlessly.

How it works: We feed the models a part of the text and train them to generate the next part. This helps the model produce coherent continuations, making it great for tasks like story generation or extending conversations.

3. Instruction Fine-Tuning

Goal: Enhance the model's ability to follow user instructions and answer questions accurately.

How it works: We fine-tune the model using a large set of question-and-answer pairs, ensuring it can provide helpful and relevant responses across various topics.

In-Depth Look at Our Testing Dataset

A crucial part of developing Squid was rigorously testing it to ensure it performs well across different tasks.

Composition of the Dataset
  • Total Samples: 3,740 pairs of long texts, user prompts, and expected responses.
  • Source: Extracted from a well-known dataset called Prompt-with-Context (PWC).
  • Selection Criteria: We focused on samples where the long text was less than 512 words, matching Squid's optimal input size.
Types of Questions Tested

We wanted to make sure Squid could handle a variety of tasks, so we included six types of questions:

Contextual Questions (~56% of samples):

  • Questions seeking specific information from the text.
  • Example: "Explain the significance of Red Hat’s acquisition of NooBaa."

Numeric Questions (~9%):

  • Questions requiring precise numerical answers.
  • Example: "What is the overall length and diameter of the Stainless Phantom M2 .30 Cal. Sound Suppression System?"

Rephrasing Tasks (~7%):

  • Requests to rewrite the text in different words.
  • Example: "Rephrase the above text."

Summarization (~7%):

  • Tasks asking for a brief summary of the text.
  • Example: "Summarize the above text."

Title or Keywords Extraction (~14%):

  • Requests to generate a title or extract key terms.
  • Example: "Write a title for the above text."

Continuation Tasks (~7%):

  • Asking the model to continue the text.
  • Example: "Write a paragraph that follows the above text."

The Results

Performance Gains
  • Inference Time: 4.32s (vs. 20.71s baseline Qwen2-7B)
  • Energy Efficiency: 10x improvement
Accuracy Across Tasks

Category

Correctness (%)

Contextual QA

97.76%

Numeric QA

98.53%

Rephrasing

99.22%

Summarization

99.62%

Title / Keywords

100.00%

Continuation

100.00%

Weighted Average Correctness: 98.53% across all question categories.

Comparative Analysis

System 1

System 2

Win(%)

Lose (%)

Tie (%)

Win + Tie (%)

Squid

AutoCompressor

95.1

0

4.9

100

Squid

Qwen2-7B

23.6

32.2

44.2

67.8

vs. AutoCompressor

  • 95.1% win rate
  • 4.9% tie rate

vs. Qwen2-7B:

  • 23.6% win rate
  • 44.2% tie rate
  • 67.8% combined win-tie rate

Why Squid Matters

Squid represents a significant advancement in the development of energy-efficient, on-device language models capable of processing long contexts without sacrificing performance. By introducing a new modality and leveraging a dual-decoder architecture, we've achieved substantial improvements in both energy consumption and latency. Here are some key applications and benefits:

Edge Computing
  • Runs advanced AI features on phones without draining battery
  • Powers smart home devices without constant cloud access
  • Enables AI features on wearables like smartwatches and fitness trackers
Privacy & Security
  • Keeps sensitive business communications completely private
  • Processes medical data while maintaining HIPAA compliance
  • Works reliably in areas with poor internet connectivity
Daily User Benefits
  • Instant responses from virtual assistants even offline
  • Smoother chatbot conversations in customer service
  • Quick text processing for emails and documents on the go

Get Started with Squid

We invite the AI community to explore Squid and contribute to its development. The model is publicly available at our Hugging Face repository.

Join us in pushing the boundaries of on-device language modeling at nexa.ai. Together, we can make AI more efficient, accessible, and impactful.

[Paper - Squid: Long Context as a New Modality for Energy-Efficient On-Device Language Models]

Kudos to <Alex>, <Zack>, <Shuo>, <Ethan> and Nexa AI team.

Blog written by <Kai>.

Join +8,000 developers

Stay tuned with the Best in On-Device AI