Back to blog
Nov 15, 2024
[Nov 21, 2024] Omnivision-968M upgraded 🚀 with improved art analysis, scene comprehension, style recognition, color perception, and world knowledge. Preview live on 👉 Hugging Face Space. Model files are updated at 🤗 NexaAIDev/omnivision-968M.
OmniVision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
(OmniVision generated description for an image)
(OmniVision could assist with memory by looking up images)
(OmniVision analyzed food images and generate recipes)
(OmniVision identified the correct HDMI port location)
Install Nexa SDK, run this on your terminal:
nexa run omnivision
Or run it with Streamlit local UI:
nexa run omnivision -st
💻 OmniVision FP16 version requires 988 MB RAM and 948 MB storage space.
OmniVision's architecture consists of three key components:
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
We developed OmniVision through a three-stage training pipeline:
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics.
Processing image tokens creates significant computational overhead in edge deployment of multimodal models. In the standard LLaVA architecture, each image generates 729 tokens (27x27), leading to high latency and computational costs. We developed a reshaping mechanism in the projection stage that transforms image embeddings from [batch_size, 729, hidden_size]
to [batch_size, 81, hidden_size*9]
. This reduces token count by 9x without compromising model performance.Our experiments show this compression method hugely improved model performance. Analysis suggests this improvement stems from the base Qwen model's handling of shorter sequences, where the compressed format provides more concentrated information representation.
Traditional DPO methods can lead to significant shifts in model behavior. Our DPO implementation uses minimal-edit pairs for training. The teacher model makes small, targeted improvements to the base model's outputs while preserving their original structure. This approach ensures precise quality improvements without disrupting the model's core capabilities.
Below we demonstrate a figure to show how OmniVision performs against nanoLLAVA:
We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of OmniVision.
Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B | |
---|---|---|---|
MM-VET | 27.5 | 23.9 | 49.5 |
ChartQA (Test) | 59.2 | N/A | 73.5 |
MMMU (Test) | 41.8 | 28.6 | 41.1 |
MMMU (Eval) | 39.9 | 30.4 | 41.1 |
ScienceQA (Eval) | 62.2 | 59.0 | N/A |
ScienceQA (Test) | 64.5 | 59.0 | N/A |
POPE | 89.4 | 84.1 | N/A |
In all the tasks, OmniVision outperforms nanoLLAVA, the previous world's smallest vision-language model.
Omnivision is in early development and we are working to address current limitations:
In the long term, we aim to develop OmniVision as a fully optimized, production-ready solution for edge AI multimodal applications.
Kudos to <Alex>, <Zack> and Nexa AI team.
Blog written by <Kai>, <Alan>.
Join +8,000 developers