On-Device 0.5B LLMs, Voice/Text in, action out, outperform GPT-4 in function-calling
We compared different variants of Octopus v2 models (Octopus-0 to Octopus-3, with varying training configurations and dataset sizes) against leading models including GPT-4, GPT-3.5 (with and without RAG), and Llama-7B with RAG.
Our evaluation focused on Android system function calls and expanded the evaluation to include 20 vehicle function calls, and conducted tests with Yelp and DoorDash APIs.
In terms of accuracy, Octopus-0 achieved the highest at 99.524%, outperforming GPT-4 (98.571%) and GPT-3.5 (97.143% without RAG, 98.095% with RAG). Llama-7B-RAG showed the lowest accuracy at 68.095%.
For inference time, Octopus models demonstrated significantly lower latency, around 0.36-0.38 seconds per function call, compared to GPT-4 (1.02s), GPT-3.5 (1.18s without RAG, 1.97s with RAG), and Llama-7B-RAG (13.46s).
Task automation and function calling have long been dominated by large, cloud-based language models. While powerful, these solutions raise concerns about availability, privacy, and cost.
Octopus v2 tackles these issues head-on. We've developed 0.5B and 2B parameter models that match cloud-based AI in function calling, but for local problems, consumers, and IoT devices.
This blog post will primarily focus on the 2B version of Octopus v2 as we open source it on HuggingFace.
Until now, deploying large Language Models (LLMs) for task automation and function calling on edge devices has faced significant hurdles:
While running small models like Gemma and LLaMA locally offer advantages in responsiveness, privacy, and affordability, their capabilities in task automation and function calling have lagged significantly behind cloud-based frontier models like GPT-4. This performance gap has limited the potential for advanced AI applications on edge devices.
Octopus v2 combines the two-step process of function invocation — function selection and parameter generation — into a unified language model to achieve faster inference speeds and improved system efficiency.
To further enhance accuracy and efficiency, Octopus v2 introduces Functional Tokens. These are unique tokens that are added to the model's vocabulary, each corresponding to a specific device operation or action. It transforms function selection into a straightforward single-token classification task, significantly reducing the required context length compared to traditional retrieval-based methods.
The model is trained on a dataset that includes function descriptions, allowing it to understand the meaning of these specialized tokens. The prompt template is designed to accommodate single, parallel, and nested function calls. During inference, Octopus v2 uses the special token <nexa_end> to signify the end of a function call, streamlining the process.
By focusing on a fixed set of actions, Octopus v2 effectively turns function calling into a standard completion task. As a result, even smaller models can efficiently perform complex operations on edge devices.
The dataset comprises 20 carefully selected Android APIs, chosen based on usability, usage frequency, and technical implementation complexity. These APIs are organized into three categories:
The dataset creation process, utilizing the selected APIs, involves three key phases: (1) generating relevant queries and their associated function call arguments with Google Gemini, (2) developing irrelevant queries to create negative samples, and (3) implementing verification to check and, if necessary, regenerate function calls for accuracy. This processing pipeline ensures a balanced, high-quality dataset for training, validation, and testing of cases that is similar to real-world use cases.
For Octopus v2 2B, we use Google Gemma-2B as the pretrained base and employ full model training and LoRA (Low-Rank Adaptation).