Nexa
Discord
navigation

Back to blog

Nexa SDK Tutorial: A Comprehensive On-Device AI Inference Toolkit

Sep 9, 2024

TL;DR

  • Nexa SDK is a comprehensive AI toolkit for on-device AI deployment
  • Supports ONNX and GGML models
  • Integrated conversion engine for custom GGML quantized models
  • Offers text generation, image generation, vision-language models (VLM), and text-to-speech (TTS) capabilities
  • Features OpenAI-compatible API and Streamlit UI
  • Runs on any Python environment with GPU acceleration support

Why We Built Nexa SDK - A Multimodal On-device AI Inference Toolkit

After releasing Octopus series models, we tried running AI models on laptops, Android phones, and iPhones. To support multi-modality, we experimented with massive vision-language models (VLM), TTS models, and ASR models. We tried multiple solutions like llama.cpp, ollama, onnxruntime, MLC-LLM, MLX, and more, but felt frustrated. Firstly, most on-device serving frameworks couldn't support tasks like image generation, ASR, and TTS. Secondly, there were many different file formats, like GGUF and ONNX, with no unified solution. Moreover, most inference engines were CPU-only, with limited CUDA and Metal support, resulting in response delays and rapid battery drain. This wasn't just an inconvenience, it was a dealbreaker for on-device AI.

It became clear that existing tools couldn't handle real-time processing efficiently, suffered from poor battery performance, and relied too much on constant internet access. That's what sparked the development of Nexa SDK. We set out to create a toolkit that could handle real-world applications without compromising speed, efficiency, or privacy, making on-device AI truly practical, whether it is on your mobile, laptop, or other edge devices.

The Problem Nexa SDK Solves

  1. Multi-modality: support text generation, Image generation, VLM, audio processing within a single toolkit
  2. Multi-formats: ONNX & GGML models optimized to run on local devices
  3. API support: FastAPI server and OpenAI compatible JSON format

Getting Started with Nexa SDK

Step 1: Download and Install Nexa SDK as a Python Package with CLI

Setting up Nexa SDK is straightforward! Choose your device and copy the right CLI from github or through our local AI model hub interface to install Nexa SDK in your terminal.

Here is the example installation command to install Nexa SDK with GPU (CUDA) support in Windows PowerShell.

$env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

See the documentation for choosing between CPU and GPU versions

Step 2: Build with simple CLI

Running with Nexa is simple! Here is an example of running Gemma 1.1 2B with Nexa SDK.

nexa run gemma-1.1-2b-instruct:q4_0

Here are some videos building with different kinds of models with Nexa SDK

  • Install the SDK and run LLM

nexa run llama3.1

  • Nexa run LLM with --streamlit

nexa run llama3.1 -st

  • Nexa run Faster Whisper with --streamlit

nexa run faster-whisper-tiny -st

  • Nexa run VLM with --streamlit

nexa run llava-llama3 -st

  • Nexa run image generation

nexa run lcm-dreamshaper

  • Nexa run API server

nexa server llama3.1

For more information and guidance, visit https://github.com/NexaAI/nexa-sdk.

Key Features & Multi-modality AI Use Case Example

Feature

Nexa SDK

Ollama

Optimum

LM Studio

GGML Support

ONNX Support

Text Generation

Image Generation

Vision-Language Models

Text-to-Speech

Server Capability

User Interface

Here are some examples using Nexa SDK:

  • On-device AI Soulmate

    Using the Nexa SDK, this project creates a local interactive AI character that supports voice input, voice output, and local profile image generation, all powered by the Llama3 Uncensored Model, all without connection to the internet.
  • On-device Financial Advisor

    In this example, the Nexa SDK powers a sophisticated financial query system with on-device processing to ensure data privacy. Key features include adjustable parameters such as model selection, temperature, max tokens, top-k, and top-p, allowing for fine-tuned responses based on user needs.

Go to Nexa SDK Examples to explore and contribute more use case examples!

What's Next

  1. Benchmark toolkit
    A benchmark toolkit will be introduced to help users evaluate and optimize model performance. Support for mobile and browser platforms will expand the SDK’s accessibility and usability across different devices.
  2. Mobile and browser support
    Increased multimodal support will enable integration of audio, image, and video capabilities, broadening the range of applications.
  3. More multimodal support (audio/image/video) and more integration with other tools (OpenWebUI/Mem0)
    Additionally, the SDK will offer enhanced integration with other tools, such as OpenWebUI and Mem0, to streamline workflows and improve interoperability. These developments aim to make the Nexa SDK more versatile and user-friendly.
  4. More exciting use examples that showcase the amazing capability of Nexa SDK
    The roadmap also includes more exciting use case examples of running Qwen 2.5, LLaMA, Phi3.5, Whisper, Flux and Stable Diffusion models efficiently on laptops, mobile devices, or edge devices using Nexa SDK, showcasing its versatility and innovation.

Follow us on Twitter and join our Discord to stay updated with release notes and be part of the discussion.


For collaboration opportunities, contact us at: octopus@nexa.ai.

Kudos to Nexa AI team.

Blog written by <Zack>, <Yin> and <Ayla>.

Join +8,000 developers

Stay tuned with the Best in On-Device AI