Nexa
Nexa
Model HubLeaderboardDocsBlogs
Discord
navigation
Squidlink icon

Revolutionizing On-Device Language Models for Long Contexts

Try SquidDeploy this model for your business

TL;DR

  • is a groundbreaking language model developed by Nexa AI to efficiently process long pieces of text on devices with limited resources.
    • Treats lengthy text as a new type of data, similar to how models handle images and text together.
    • Uses a two-part architecture with a small and a large model working together.
    • Achieves a 10x improvement in energy efficiency.
    • Offers a 5x reduction in processing time compared to traditional methods.
    • Maintains high accuracy across various tasks involving long texts.
  • here.

Introduction

At Nexa AI, we're committed to pushing the boundaries of what's possible with on-device language models. With the growing demand for privacy, reduced latency, and offline functionality, deploying language models directly on devices has become increasingly important.

Mobile devices have limited computational resources and battery life. Processing long contexts requires substantial memory and computational power, which can rapidly drain battery life and degrade user experience due to slow response times. This is particularly problematic for real-time applications like voice assistants and interactive chatbots.

Why did we build Squid? We saw an urgent need for a solution that maintains the accuracy and capabilities of language models while significantly reducing their energy footprint and improving response times on resource-constrained devices. Our goal was to enable sophisticated AI functionalities on edge devices without compromising performance or draining battery life.

Squid: A New Approach to Handling Long Texts

Viewing Long Text as a New Type of Data efficiently.

Inspired by models that process both images and text, we started treating long pieces of text as a separate kind of data. This allowed us to apply specialized techniques to handle lengthy texts more efficiently.

Two-Part Model Architecture

Squid uses a unique architecture consisting of two interconnected models:

visualInfor1

1. Small Model:

  • Size: 0.5 billion parameters (a measure of model complexity).
  • Role: Compresses long texts into concise representations.
  • Function: Acts like a summarizer, capturing the essence of the long text without needing to process every word in detail.

2. Large Model:

  • Size: 7 billion parameters.
  • Role: Takes the compressed summary from the small model, along with the user's question or prompt, to generate accurate and relevant responses.
  • Function: Handles the heavy lifting of understanding the user's request and providing detailed answers.

This setup allows Squid to process long texts much more efficiently, as the large model doesn't need to handle the entire lengthy text directly.

Memory Tokens

To help the models communicate effectively, we introduced special "memory tokens." Think of these as placeholders or bookmarks that capture important information from the long text. The small model creates these memory tokens, which the large model then uses to understand the context and generate appropriate responses.

Bridging the Models

A special component called the "projector" helps translate information between the small and large models. It ensures that the compressed summaries and memory tokens created by the small model are in a format that the large model can easily understand.

Our Training Process

To make Squid as effective as possible, we trained it in three stages:

1. Restoration Training

Goal: Teach the large model to reconstruct the original long text from the compressed summaries provided by the small model.

How it works: The small model summarizes the long text, and the large model tries to rebuild the original text from this summary. This ensures that important details aren't lost in the compression process.

2. Continual Training

Goal: Improve the model's ability to continue a piece of text seamlessly.

How it works: We feed the models a part of the text and train them to generate the next part. This helps the model produce coherent continuations, making it great for tasks like story generation or extending conversations.

3. Instruction Fine-Tuning

Goal: Enhance the model's ability to follow user instructions and answer questions accurately.

How it works: We fine-tune the model using a large set of question-and-answer pairs, ensuring it can provide helpful and relevant responses across various topics.

In-Depth Look at Our Testing Dataset

A crucial part of developing Squid was rigorously testing it to ensure it performs well across different tasks.

Composition of the Dataset

  • Total Samples: 3,740 pairs of long texts, user prompts, and expected responses.
  • Source: Extracted from a well-known dataset called Prompt-with-Context (PWC).
  • Selection Criteria:
    • We focused on samples where the long text was less than 512 words, matching Squid's optimal input size.

Types of Questions Tested

We wanted to make sure Squid could handle a variety of tasks, so we included six types of questions:

  1. Contextual Questions (~56% of samples):
    • Questions seeking specific information from the text.
    • Example: "Explain the significance of Red Hat’s acquisition of NooBaa."
  2. Numeric Questions (~9%):
    • Questions requiring precise numerical answers.
    • Example: "What is the overall length and diameter of the Stainless Phantom M2 .30 Cal. Sound Suppression System?"
  3. Rephrasing Tasks (~7%):
    • Requests to rewrite the text in different words.
    • Example: "Rephrase the above text."
  4. Summarization (~7%):
    • Tasks asking for a brief summary of the text.
    • Example: "Summarize the above text."
  5. Title or Keywords Extraction (~14%):
    • Requests to generate a title or extract key terms.
    • Example: "Write a title for the above text."
  6. Continuation Tasks (~7%):
    • Asking the model to continue the text.
    • - Example: "Write a paragraph that follows the above text."

The Results

Performance Gains

Latency Reduction:

  • Squid's Average Inference Time: 4.32 seconds.
  • Baseline (Qwen2-7B) Inference Time: 20.71 seconds.
  • Improvement Factor: 4.79× faster than the baseline.
  • Implication: Significant enhancement in responsiveness, crucial for real-time applications.

Energy Efficiency:

  • Achieved a 10-fold improvement in energy efficiency over conventional methods.
  • - Benefit: Prolonged battery life and reduced energy consumption on devices with limited power resources.

Accuracy Across Tasks

Overall Correctness:

  • Weighted Average Correctness: 98.53% across all question categories.

Category-Specific Performance:

CategoryCorrectness (%)
Contextual QA97.76%
Numeric QA98.53%
Rephrasing99.22%
Summarization99.62%
Title / Keywords100.00%
Continuation100.00%
  • Noteworthy Observations:
    • Achieved perfect correctness in "Title / Keywords" and "Continuation" categories.
    • Maintained high accuracy in "Numeric QA," which typically demands precise answers.

Comparative Analysis

System 1System 2Win (%)Lose (%)Tie (%)Win + Tie (%)
SquidAutoCompressor95.104.9100
SquidQwen2-7B23.632.244.267.8

Against AutoCompressor (based on Llama-2-7b):

  • Results:
    • Squid won 95.1% of the comparisons.
    • Tied in 4.9%, with 0%losses.
  • Interpretation:
    • Suggests that AutoCompressor may overfit its training data.
    • Squid demonstrates superior generalization and consistent performance.

Against Qwen2-7B :

  • Results:
    • Squid won 23.6% of the comparisons.
    • Tied in 44.2%.losses.
    • Combined win-tie rate of 67.8%.
  • Significance:
    • Despite using compressed tokens, Squid performs comparably to the larger baseline model.
    • Highlights Squid's ability to maintain performance while reducing computational demands.

Why Squid Matters

Squid represents a significant advancement in the development of energy-efficient, on-device language models capable of processing long contexts without sacrificing performance. By introducing a new modality and leveraging a dual-decoder architecture, we've achieved substantial improvements in both energy consumption and latency.

Implications:

  • Edge Devices:
    • Enhances capabilities of mobile phones, IoT devices, and wearables
    • Enables sophisticated AI functionalities without overburdening limited resources.
  • Privacy and Offline Functionality:
    • Supports applications where data privacy is paramount.
    • Allows for reliable operation without constant internet connectivity.
  • User Experience:
    • Reduces latency, leading to more responsive and interactive AI applications.
    • Improves satisfaction in real-time communication tools like voice assistants and chatbots.

Get Started with Squid

We invite the AI community to explore Squid and contribute to its development. The model is publicly available at our Hugging Face repository.

Join us in pushing the boundaries of on-device language modeling at nexa.ai. Together, we can make AI more efficient, accessible, and impactful.