Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
AI/MLTiny-vLLM is a lightweight, high-performance inference engine for large language models, implemented in C++ and CUDA to maximize speed and efficiency on GPU hardware. It is designed for developers and researchers who need a minimal, low-latency alternative to larger frameworks for deploying LLMs in resource-constrained or production environments. Its interest lies in demonstrating that a compact, hand-optimized codebase can rival or exceed the performance of major inference libraries while offering greater transparency and control.
Cross-platform signals
You might also like
More in AI/ML
Self-hosted AI workspace.
Makes your AI agent think like the laziest senior dev in the room. The best code is the code you never wrote.
DeepSeek-native AI coding agent for your terminal. Engineered around prefix-cache stability — leave it running.