Nexa AI | Accelerate Gen-AI Tasks on Any Device – Simplified AI Delivery for Enterprises

Nexa SDK: A Comprehensive On-Device AI Inference Toolkit

Run Multimodal AI Models on Your Local Devices.

Getting started Use Nexa SDK for your business

TL;DR

Nexa SDK is a comprehensive AI toolkit for on-device AI deployment

Supports ONNX and GGML models

Integrated conversion engine for custom GGML quantized models

Offers text generation, image generation, vision-language models (VLM), and text-to-speech (TTS) capabilities

Features OpenAI-compatible API and Streamlit UI

Runs on any Python environment with GPU acceleration support

Why We Built Nexa SDK - A Multimodal On-device AI Inference Toolkit

After releasing Octopus series models, we tried running AI models on laptops, Android phones, and iPhones. To support multi-modality, we experimented with massive vision-language models (VLM), TTS models, and ASR models. We tried multiple solutions like llama.cpp, ollama, onnxruntime, MLC-LLM, MLX, and more, but felt frustrated. Firstly, most on-device serving frameworks couldn't support tasks like image generation, ASR, and TTS. Secondly, there were many different file formats, like GGUF and ONNX, with no unified solution. Moreover, most inference engines were CPU-only, with limited CUDA and Metal support, resulting in response delays and rapid battery drain. This wasn't just an inconvenience, it was a dealbreaker for on-device AI.

It became clear that existing tools couldn't handle real-time processing efficiently, suffered from poor battery performance, and relied too much on constant internet access. That's what sparked the development of Nexa SDK. We set out to create a toolkit that could handle real-world applications without compromising speed, efficiency, or privacy, making on-device AI truly practical, whether it is on your mobile, laptop, or other edge devices.

The Problem Nexa SDK Solves

Multi-modality: support text generation, Image generation, VLM, audio processing within a single toolkit
Multi-formats: ONNX & GGML models optimized to run on local devices
API support: FastAPI server and OpenAI compatible JSON format

Getting Started with Nexa SDK

Step 1: Download and Install Nexa SDK as a Python Package with CLI

Setting up Nexa SDK is straightforward! Choose your device and copy the right CLI from github or through our local AI model hub interface to install Nexa SDK in your terminal.

Here is the example installation command to install Nexa SDK with GPU (CUDA) support in Windows PowerShell.

$env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

See the documentation for choosing between CPU and GPU versions

Step 2: Build with simple CLI

Running with Nexa is simple! Here is an example of running Gemma 1.1 2B with Nexa SDK.

nexa run gemma-1.1-2b-instruct:q4_0

Here are some videos building with different kinds of models with Nexa SDK

Install the SDK and run LLM

nexa run llama3.1

Nexa run LLM with --streamlit

nexa run llama3.1 -st

Nexa run Faster Whisper with --streamlit

nexa run faster-whisper-tiny -st

Nexa run VLM with --streamlit

nexa run llava-llama3 -st

Nexa run image generation

nexa run lcm-dreamshaper

Nexa run API server

nexa server llama3.1

For more information and guidance, visit https://github.com/NexaAI/nexa-sdk.

Key Features & Multi-modality AI Use Case Example

Feature	Nexa SDK	Ollama	Optimum	LM Studio
GGML Support	✅	✅	❌	✅
ONNX Support	✅	❌	✅	❌
Text Generation	✅	✅	✅	✅
Image Generation	✅	❌	❌	❌
Vision-Language Models	✅	✅	✅	✅
Text-to-Speech	✅	❌	✅	❌
Server Capability	✅	✅	✅	✅
User Interface	✅	❌	❌	✅

Here are some examples using Nexa SDK:

On-device AI Soulmate
Using the Nexa SDK, this project creates a local interactive AI character that supports voice input, voice output, and local profile image generation, all powered by the Llama3 Uncensored Model, all without connection to the internet.
On-device Financial Advisor
In this example, the Nexa SDK powers a sophisticated financial query system with on-device processing to ensure data privacy. Key features include adjustable parameters such as model selection, temperature, max tokens, top-k, and top-p, allowing for fine-tuned responses based on user needs.

Go to Nexa SDK Examples to explore and contribute more use case examples!

Future Development and Roadmap

Benchmark toolkit
A benchmark toolkit will be introduced to help users evaluate and optimize model performance. Support for mobile and browser platforms will expand the SDK’s accessibility and usability across different devices.
Mobile and browser support
Increased multimodal support will enable integration of audio, image, and video capabilities, broadening the range of applications.
More multimodal support (audio/image/video) and more integration with other tools (OpenWebUI/Mem0)
Additionally, the SDK will offer enhanced integration with other tools, such as OpenWebUI and Mem0, to streamline workflows and improve interoperability. These developments aim to make the Nexa SDK more versatile and user-friendly.
More exciting use examples that showcase the amazing capability of Nexa SDK
The roadmap also includes more exciting use case examples of running Qwen 2.5, LLaMA, Phi3.5, Whisper, Flux and Stable Diffusion models efficiently on laptops, mobile devices, or edge devices using Nexa SDK, showcasing its versatility and innovation.

Follow us on Twitter and join our Discord to stay updated with release notes and be part of the discussion.
For collaboration opportunities, contact us at: octopus@nexa.ai.

Products

Multimodal Model Support

Model Compression

Local On-Device Inference

For Business

Success Stories

Use Cases

Developers

Nexa-SDK GitHub

Tiny Model Hub

Docs

SLM Leaderboard

Download SDK

Blog

Nexa Blog

Company

About

Career

Contactoctopus@nexa.ai

Social