TL;DR
- Efficient: Octopus v3 is less than half the size of Octopus v2 2B. It runs efficiently on edge devices, including those as small as a Raspberry Pi.
- Accurate: With the functional token, Octopus v3 achieves function calling accuracy on par with GPT-4V and GPT-4.
- Multimodal: Octopus v3 supports both text and image inputs.
- Multilingual: Octopus v3 understands both English and Chinese.
Introduction
Octopus v3 represents a significant advancement in on-device multimodal AI and AI agent applications. While its predecessor, Octopus v2, focused on text-based interactions and outperformed GPT-4 in function-calling speed and accuracy, v3 expands capabilities to include visual inputs and multilingual support.
This blog explores Octopus v3's technical aspects and assesses its performance in real-world scenarios, comparing this compact, on-device model to larger, cloud-based alternatives like GPT-4V.
Training Methods
We developed Octopus v3 with these key techniques to add multimodal capabilities while keeping the model compact:
- Visual Information Processing: We use CLIP (Contrastive Language-Image Pre-training) for image encoding. CLIP's ability to align visual and textual information makes it ideal for processing visual data alongside text in the multimodal model.
- Functional Token: We retained the functional token approach from Octopus v2. It allows the model to represent specific functions as tokens, enabling it to understand and execute a wide range of tasks efficiently and responsively.
- Multi-stage Training: Our process begins with separate training of the causal language model and image encoder. We then merge these components and perform alignment training to synchronize image and text processing. After integrating functional token learning from Octopus v2, we also use reinforcement learning with another large language model as the reward model. It helps to refine the model's ability to process multimodal input (vision + text) into actual actions.
Model Evaluation
To assess Octopus v3's performance in multimodality and function-calling, we compared it with a combination of GPT-4V and GPT-4 across 10 common use cases in smartphone operating systems:
- Send email: Given an image with the text "THE ERA OF ON-DEVICE AI AGENT IS COMING!" and instructions to email about AI progress, Octopus v3 composed a concise email capturing the main idea in both English and Chinese and called for a send email action.
- Send text message: When shown an image of the Golden Gate Bridge, Octopus v3 accurately described the scene for a text message in both English and Chinese and called for a send text action.
- Google search: Presented with an image of the Great Wall of China and asked about its history, Octopus v3 generated accurate search queries in both English and Chinese and called for a search action.
- Amazon purchase: When shown an image of a compact dishwasher, Octopus v3 accurately described the product's color, size, category, and application, and called for a purchase action.
- Smart recycle: Given an image of plastic water bottles, Octopus v3 correctly identified the items and broke down the material for recycling.
- Lost and found reporting: When shown an image of a computer mouse, Octopus v3 provided a detailed description of the item in both English and Chinese.
- Interior design suggestions: Presented with an image of a modern living room, Octopus v3 offered style suggestions that aligned with the room's existing aesthetics.
- Instacart shopping: When shown an image of a pineapple and asked to order two, Octopus v3 correctly identified the item and quantity for purchase in both English and Chinese and called for an Instacart purchase action.
- DoorDash ordering: Given an image of Mexican cuisine and instructions to deliver to a specific location, Octopus v3 accurately described the dish and included the delivery address in both English and Chinese and called for a DoorDash ordering action.
- Animal care: Presented with an image of a long-haired dog and asked for care instructions, Octopus v3 correctly identified the animal and the type of care needed (grooming) in both English and Chinese.
In all these cases, Octopus v3, with less than 1B parameters, demonstrated function-calling accuracy comparable to the cloud-based, large-scale GPT-4V and GPT-4 combination. It proficiently handled both English and Chinese inputs, effectively processing visual and textual information to generate accurate responses and actions.
The versatility across complex, multimodal tasks in various domains underscores Octopus v3's potential as a powerful AI agent for everyday smartphone use cases. Importantly, Octopus v3 runs entirely on-device, requiring no internet connection for its operations.
What's Next
[Paper - Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent]
Kudos to <Alex>, <Zack> and Nexa AI team.
Blog written by <Kai>.