LLM inference server built from scratch. C++ model loader + Go gRPC API.
Status: π§ In Development (Q1 2026)
A learning project that removes the "magic" from LLM inference by building each layer manually:
- Foundations β Transformer math in NumPy, validated against GPT2 model
- C++ Core β GGUF parser + mmap weight loader
- Go Server β gRPC API with metrics and security middleware
- Production β Docker deployment with KV cache optimization
- Single-head attention implementation
- Multi-head attention with PyTorch validation
- GGUF format parser (C++)
- Memory-mapped weight loading
- Go gRPC server
- Observability (Prometheus/Grafana)
- Security middleware
- Docker deployment
- KV cache optimization
- Python 3.10+ (foundations)
- CMake 3.20+, CUDA 12.x (C++ core)
- Go 1.21+ (server)
- Docker with NVIDIA runtime (deployment)