Nexa
Discord
navigation

Back to blog

Octopus v2

Nov 12, 2024

TL;DR

  • Octopus v2 is an on-device language model with 0.5B or 2B parameters that outperforms GPT-4 in both accuracy and latency for function calling tasks.
  • It uses "functional tokens" that enable efficient mapping of user queries to device functions, dramatically reducing context length by 95%.
  • We have released the full model weights for developers to customize and deploy Octopus v2 for a wide range of on-device AI applications. Check it on the On-Device Model Hub.

Evaluation

We compared different variants of Octopus v2 models (Octopus-0 to Octopus-3, with varying training configurations and dataset sizes) against leading models including GPT-4, GPT-3.5 (with and without RAG), and Llama-7B with RAG.

Our evaluation focused on Android system function calls and expanded the evaluation to include 20 vehicle function calls, and conducted tests with Yelp and DoorDash APIs.

In terms of accuracy, Octopus-0 achieved the highest at 99.524%, outperforming GPT-4 (98.571%) and GPT-3.5 (97.143% without RAG, 98.095% with RAG). Llama-7B-RAG showed the lowest accuracy at 68.095%.

In terms of accuracy, Octopus-0 achieved the highest at 99.524%, outperforming GPT-4 (98.571%) and GPT-3.5 (97.143% without RAG, 98.095% with RAG). Llama-7B-RAG showed the lowest accuracy at 68.095%.

For inference time, Octopus models demonstrated significantly lower latency, around 0.36-0.38 seconds per function call, compared to GPT-4 (1.02s), GPT-3.5 (1.18s without RAG, 1.97s with RAG), and Llama-7B-RAG (13.46s).

For inference time, Octopus models demonstrated significantly lower latency, around 0.36-0.38 seconds per function call, compared to GPT-4 (1.02s), GPT-3.5 (1.18s without RAG, 1.97s with RAG), and Llama-7B-RAG (13.46s).

Introduction

Task automation and function calling have long been dominated by large, cloud-based language models. While powerful, these solutions raise concerns about availability, privacy, and cost.

Octopus v2 tackles these issues head-on. We've developed 0.5B and 2B parameter models that match cloud-based AI in function calling, but for local problems, consumers, and IoT devices.

This blog post will primarily focus on the 2B version of Octopus v2 as we open source it on HuggingFace.

Performance Limitations of Edge Devices

Until now, deploying large Language Models (LLMs) for task automation and function calling on edge devices has faced significant hurdles:

  • Computational Constraints: Edge devices lack the processing power of cloud servers, making it difficult to run complex AI models efficiently.
  • Memory Limitations: State-of-the-art language models often exceed the available memory on most consumer and IoT devices.
  • Power Consumption: Running large AI models can quickly drain battery life, limiting their practical utility on portable devices.
  • Connectivity Dependence: Reliance on cloud processing necessitates a stable internet connection, restricting functionality in areas with poor or no connectivity.

While running small models like Gemma and LLaMA locally offer advantages in responsiveness, privacy, and affordability, their capabilities in task automation and function calling have lagged significantly behind cloud-based frontier models like GPT-4. This performance gap has limited the potential for advanced AI applications on edge devices.

Fast and Accurate Function-calling with Functional Tokens

Functional token focuses on a fixed of actions and hugely increases accuracy and efficiently in small model's action taking.

Octopus v2 combines the two-step process of function invocation — function selection and parameter generation — into a unified language model to achieve faster inference speeds and improved system efficiency.

To further enhance accuracy and efficiency, Octopus v2 introduces Functional Tokens. These are unique tokens that are added to the model's vocabulary, each corresponding to a specific device operation or action. It transforms function selection into a straightforward single-token classification task, significantly reducing the required context length compared to traditional retrieval-based methods.

The model is trained on a dataset that includes function descriptions, allowing it to understand the meaning of these specialized tokens. The prompt template is designed to accommodate single, parallel, and nested function calls. During inference, Octopus v2 uses the special token <nexa_end> to signify the end of a function call, streamlining the process.

By focusing on a fixed set of actions, Octopus v2 effectively turns function calling into a standard completion task. As a result, even smaller models can efficiently perform complex operations on edge devices.

Dataset

The dataset comprises 20 carefully selected Android APIs, chosen based on usability, usage frequency, and technical implementation complexity. These APIs are organized into three categories:

  • Android System API: Includes essential system-level functions for basic mobile operations, such as making calls, texting, setting alarms, modifying screen brightness, creating calendar entries, managing Bluetooth, enabling do-not-disturb mode, and taking photos. Highly sensitive tasks are excluded.
  • Android API: Focuses on APIs from pre-installed Google apps like YouTube, Chrome, Gmail, and Maps. This covers functionalities such as accessing trending news, retrieving weather updates, searching for YouTube content, and map navigation.
  • Android Smart Device Management API: Extends to the Google Home ecosystem, improving smart home device management. Functions include adjusting a Nest Thermostat, managing media playback on Google Nest devices, and controlling door locks.
Training data pipeline for octopus_v2

The dataset creation process, utilizing the selected APIs, involves three key phases: (1) generating relevant queries and their associated function call arguments with Google Gemini, (2) developing irrelevant queries to create negative samples, and (3) implementing verification to check and, if necessary, regenerate function calls for accuracy. This processing pipeline ensures a balanced, high-quality dataset for training, validation, and testing of cases that is similar to real-world use cases.

For Octopus v2 2B, we use Google Gemma-2B as the pretrained base and employ full model training and LoRA (Low-Rank Adaptation).

What's Next

[Paper - Octopus v2: On-device language model for super agent]

Kudos to <Alex>, <Zack> and Nexa AI team.

Blog written by <Kai>.

Join +8,000 developers

Stay tuned with the Best in On-Device AI