Tracked Repositories
131 open-source AI inference repositories across 61 organizations.
HuggingFace
HuggingFace Transformers — state-of-the-art NLP/ML model library (~140K stars)
HuggingFace Diffusers — diffusion model inference & training (Stable Diffusion, Flux, etc.)
Minimalist Rust ML framework for inference — targets browser WASM and GPU, zero Python dependency
HuggingFace TGI — LLM serving (archived March 2026, read-only)
TensorFlow
Industry-standard deep learning framework with XLA compilation backend
TensorFlow Serving — high-performance gRPC/REST serving for TF models (multi-version, canary, batching)
TensorFlow Lite for microcontrollers and embedded devices
Ollama
User-friendly local LLM runner built on llama.cpp (~167K stars)
ggml-org
High-performance LLM inference in C/C++ (CPU + GPU)
High-performance Whisper speech recognition in C/C++
Open WebUI
Self-hosted ChatGPT alternative with built-in RAG, offline-capable (~104K stars)
Meta / PyTorch
Primary ML framework; torch.compile + AOTInductor for production inference optimization
PyTorch's portable execution framework for on-device inference
TorchServe — production PyTorch model serving (archived August 2025)
DeepSeek AI
Reference inference code for DeepSeek-V3 (671B MoE); includes FP8 training framework
vLLM Project
Most widely adopted open-source LLM serving engine; PagedAttention, continuous batching
vLLM community plugin for Intel Gaudi accelerators
Nomic AI
Desktop AI app + SDK for running LLMs locally (~73K stars)
Google AI Edge
Cross-platform ML pipeline framework (vision, audio, NLP)
AI Edge model gallery
LiteRT for language model inference
Official Gemma model cookbook — recipes, fine-tuning, deployment guides
Sample apps using MediaPipe
Highly optimized neural network operators library (ARM, x86, WASM)
Google's Lite Runtime (successor to TensorFlow Lite)
Model visualization and exploration tool
LiteRT integration with PyTorch
Sample code for LiteRT
Quantization tooling for AI Edge models
Sample models for AI Edge
AI Edge APIs — upstream repo deleted (404), local copy retained
Apple / ML-Explore
Array framework for ML on Apple silicon (Python)
Example models and applications using MLX
Reverse-engineered Apple Neural Engine (ANE) — hardware ops, memory layout, firmware interactions
Tools for converting & running models with Core ML
LLM inference and fine-tuning with MLX
Example apps using MLX Swift
Swift bindings for MLX
Efficient data loading for MLX
LLM inference in Swift via MLX
C bindings for MLX
Oobabooga
Gradio web UI for LLMs — multi-backend (llama.cpp, ExLlamaV2, transformers) (~43K stars)
Mudler (LocalAI)
Free, open-source OpenAI drop-in replacement — runs locally, no GPU required (~36K stars)
BerriAI
Unified OpenAI-compatible proxy for 100+ LLM providers (vLLM, Ollama, Bedrock, Azure, etc.)
Exo Explore
Run LLMs distributed across heterogeneous devices (Mac, iPhone, etc.)
Ray Project
Distributed AI compute engine; Ray Serve handles online and async batch inference (~39K stars)
DeepSpeed AI
Microsoft DeepSpeed — distributed training and inference (ZeRO, MII, FastGen)
Microsoft / ONNX
Microsoft's cross-platform, high-performance ONNX inference engine
LM-Sys
LLM serving framework and home of Chatbot Arena (~37K stars)
JAX (Google DeepMind)
Composable NumPy transformations (JIT, grad, vmap) compiled via XLA to GPUs and TPUs — primary DeepMind research/production runtime
Miscellaneous
MLC's universal LLM deployment engine (multi-backend)
Tile-based ML language and compiler
Community on-device LLM project
vLLM-style inference on Apple silicon via MLX
Tencent
High-performance neural network inference for mobile (Android/iOS)
Tencent Neural Network — mobile and edge inference
NVIDIA
NVIDIA's optimized LLM inference library (GPU)
NVIDIA's high-performance deep learning inference SDK (GPU)
C++ LLM/VLM inference runtime for Jetson and NVIDIA edge devices
SGLang
High-throughput LLM/VLM serving with RadixAttention and structured generation
Mozilla AI
Single-file LLM executables via Cosmopolitan Libc — zero install, all platforms (~21K stars)
Triton Language (OpenAI)
Python-like GPU kernel language used by vLLM FlashAttention and PyTorch inductor
MLC AI
High-performance LLM inference in web browsers via WebGPU
KVCache AI
CPU-GPU hybrid inference; runs DeepSeek 671B on 14GB VRAM + 382GB DRAM with massive speedup over llama.cpp
Alibaba
Alibaba's neural network inference framework for mobile & edge
Apache
Apache TVM ML compiler — auto-tunes models for any hardware target
Apache TVM Foreign Function Interface for deep learning compilation
Blaizzy (Community MLX)
Audio models (TTS, ASR) with MLX
Vision-language models on Apple silicon via MLX
Swift audio inference using MLX
Text embedding models with MLX
Video model inference with MLX
RunAnywhere
RunAnywhere SDKs for on-device inference deployment
RunAnywhere CLI tool
OpenVINO Toolkit / Intel
Intel's toolkit for optimizing & deploying deep learning on Intel hardware
Neural Network Compression Framework — quantization, pruning, sparsity for OpenVINO
OpenVINO GenAI — generative AI layer with speculative decoding & KV-cache opt
K2 / Next-gen ASR
ONNX-based runtime for ASR, TTS, VAD, and keyword spotting
Intel
Intel IPEX-LLM — local LLM acceleration on Intel hardware (archived Jan 2026, read-only)
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity
jundot
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
Mistral AI
Official minimal inference library for all Mistral models (7B, Mixtral, Pixtral)
Triton Inference Server
NVIDIA Triton — production multi-model inference server (HTTP/gRPC, multi-backend)
Dusty-NV (NVIDIA Jetson)
DNN inference library & tutorials for NVIDIA Jetson
BentoML
Unified serving framework: real-time APIs, task queues, batching, multi-model chains
Nexa AI
Unified SDK for running LLMs and multimodal models locally
InternLM / Shanghai AI Lab
High-throughput LLM serving with TurboMind engine (C++/CUDA)
PaddlePaddle (Baidu)
Lightweight inference engine for mobile & embedded from PaddlePaddle
AI Dynamo (NVIDIA)
Datacenter-scale distributed inference serving framework (Rust + Python, disaggregated prefill/decode, engine-agnostic)
ArgMax
On-device Whisper inference for Apple platforms (Swift)
Python tooling for WhisperKit model optimization
On-device AI benchmarking framework
Swift playground for ArgMax SDK
Osaurus
Native macOS AI agent harness in Swift — any model, persistent memory, autonomous execution, MCP server, MLX + Apple Neural Engine, fully offline
Cactus Compute
Cactus core edge inference framework
React Native bindings for Cactus
Flutter bindings for Cactus
Kotlin/Android bindings for Cactus
Demo chat app using Cactus
TurboDeRP (ExLlamaV2)
High-performance EXL2-quantized inference for consumer NVIDIA GPUs
OpenNMT
Fast C++ inference for Transformer models; INT8/INT16 CPU quantization, multi-platform
OpenXLA
Compiler for JAX, TF, PyTorch targeting GPU, TPU, and CPU from a unified IR
Luminal AI
Rust-based deep learning compiler with a small static graph IR for fast, portable inference (CUDA, Metal, CPU)
Liquid AI
Examples, tutorials and apps for Liquid AI LFM + LEAP SDK
Speech-to-Speech audio models by Liquid AI
Minimal fine-tuning repo for LFM2, fully open-source
Example apps for LeapSDK
Liquid AI documentation
Fluid Inference
On-device audio inference framework
Fluid Inference core runtime
Rust text processing library for inference
Try Mirai
Mirai's on-device inference runtime
Mirai's LLaMA-based on-device model
Swift SDK for Uzu
UbiquitousLearning
Multimodal LLM inference framework for mobile & edge
Qualcomm
State-of-the-art ML models optimized for Qualcomm Snapdragon NPU/DSP/QNN deployment
Sample apps and tutorials for deploying models on Qualcomm hardware (TFLite, ONNX, QNN)
ARM Software
ARM Neural Network SDK for ARM & Mali devices
AMD ROCm
AI Tensor Engine for ROCm — centralized repo for high-perf AI operators on AMD Instinct GPUs
AMD's graph inference engine for MI-series GPUs
ROCm fork of FlashAttention with Composable Kernel (CK) and Triton backends
NimbleEdge
NimbleEdge's deliteAI on-device inference framework
NimbleEdge fork of ExecuTorch with edge optimizations
Picovoice
Picovoice's on-device LLM inference engine
Zetic AI
MLange sample applications
MLange extension library
MLange SDK documentation
iOS extension framework for MLange
iOS framework for MLange