Skip to content

The Zenith of ML Platforms - AI-first MLOps platform for 2026 with LLMs, RAG, agents, real-time monitoring, and beautiful UI. Surpasses Vertex AI, SageMaker, Azure ML.

License

Notifications You must be signed in to change notification settings

bhanukaranwal/Zenith

πŸš€ Zenith - The Zenith of Machine Learning Platforms

Python 3.11+ FastAPI React 19 License: Apache 2.0 Docker Kubernetes

The ultimate open-source AI-first MLOps platform for 2026 β€” combining enterprise-grade ML lifecycle management with cutting-edge LLM, RAG, and agent capabilities. Built to surpass Vertex AI, SageMaker, Azure ML, Databricks, MLflow, W&B, and more.

🎯 Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ React 19 Frontend UI β”‚ β”‚ Experiments β”‚ Models β”‚ Deployments β”‚ Monitoring β”‚ Agents β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ REST API / WebSocket β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ FastAPI Backend (Async) β”‚ β”‚ Auth β”‚ Projects β”‚ Datasets β”‚ Features β”‚ Training β”‚ Deploy β”‚ β””β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β–Ό β–Ό β–Ό β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β” β”‚PG β”‚β”‚Redisβ”‚β”‚S3/ β”‚β”‚Tritonβ”‚β”‚vLLMβ”‚β”‚Celeryβ”‚β”‚Jupyterβ”‚β”‚OTelβ”‚β”‚Vectorβ”‚β”‚Featureβ”‚ β”‚SQL β”‚β”‚Cacheβ”‚β”‚Blobβ”‚β”‚Serveβ”‚β”‚GPU β”‚β”‚Workerβ”‚β”‚Lab β”‚β”‚Exportβ”‚β”‚DB β”‚β”‚Store β”‚ β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”˜

✨ Feature Comparison

Feature Zenith Vertex AI SageMaker Azure ML Databricks MLflow W&B
Open Source βœ… ❌ ❌ ❌ Partial βœ… ❌
LLM-Native βœ… βœ… βœ… βœ… βœ… Partial βœ…
Agent Orchestration βœ… Partial ❌ Partial βœ… ❌ ❌
Prompt Playground βœ… βœ… ❌ βœ… βœ… ❌ βœ…
RAG Pipeline Builder βœ… ❌ ❌ ❌ βœ… ❌ ❌
OpenTelemetry Native βœ… Partial Partial Partial ❌ βœ… ❌
Feature Store (Online) βœ… βœ… βœ… βœ… βœ… ❌ ❌
Real-time Drift Detection βœ… βœ… βœ… βœ… βœ… ❌ Partial
LLM-as-Judge Eval βœ… ❌ ❌ ❌ βœ… ❌ βœ…
LoRA/QLoRA Fine-tuning βœ… βœ… βœ… βœ… βœ… ❌ ❌
Collaborative UI βœ… βœ… βœ… βœ… βœ… ❌ βœ…
Self-Hosted βœ… ❌ ❌ ❌ Partial βœ… ❌
Cost Free $$$ $$$ $$$ $$$ Free $$

🎁 Core Capabilities

ML/LLM Lifecycle Management

  • Data Versioning: Immutable dataset snapshots with lineage tracking
  • Feature Store: Online (Redis) + Offline (Parquet/Delta) with point-in-time joins
  • Experiment Tracking: Parameters, metrics, artifacts, prompts, traces with real-time visualization
  • Model Registry: Staging/production promotion with approval workflows and A/B testing
  • Distributed Training: PyTorch FSDP/DDP, Hugging Face Accelerate, multi-GPU support
  • Hyperparameter Optimization: Optuna Bayesian optimization + prompt search
  • Deployment: Batch/real-time/streaming with autoscaling and canary releases
  • Monitoring: Drift detection (Evidently), performance metrics, cost tracking
  • Explainability: SHAP values, attention visualization, feature importance
  • Governance: Bias detection, PII scanning, audit logs, RBAC

LLM & Agent Features

  • Prompt Playground: Interactive testing with multiple models, temperature control, few-shot examples
  • RAG Pipeline Builder: Visual editor for embedding, retrieval, reranking, generation
  • Agent Orchestration: LangGraph/CrewAI-style workflows with tool integration
  • Chain Tracing: OpenTelemetry-based distributed traces for complex LLM chains
  • LLM-as-Judge: Automated evaluation using GPT-4, Claude for quality scoring
  • Fine-tuning: LoRA, QLoRA with monitoring and automatic checkpoint management
  • Vector Search: Integrated embedding storage and semantic search
  • Hallucination Detection: Confidence scoring and fact verification

Developer Experience

  • Modern UI: React 19 + Tailwind + shadcn/ui with dark mode
  • Real-time Collaboration: Live experiment updates, shared notebooks
  • Jupyter Integration: Embedded JupyterLab with SDK pre-installed
  • REST + Python SDK: Comprehensive APIs for all operations
  • OpenTelemetry Export: Send traces to Datadog, Grafana, Jaeger
  • Plugin System: Custom evaluators, metrics, retrievers, agents
  • One-command Deploy: Docker Compose or Kubernetes Helm

πŸš€ Quick Start

Prerequisites

  • Docker 24+ & Docker Compose 2.20+
  • 16GB RAM minimum (32GB recommended)
  • NVIDIA GPU (optional, for LLM inference)

Installation

git clone https://github.com/yourusername/zenith-ml.git cd zenith-ml

cp .env.example .env

docker-compose up -d

docker-compose logs -f backend

Access Points

First Steps

from zenith import ZenithClient

client = ZenithClient("http://localhost:8000")

project = client.create_project( name="my-first-project", description="Testing Zenith capabilities" )

experiment = client.start_experiment( project_id=project.id, name="baseline-model" )

client.log_params({"learning_rate": 0.001, "batch_size": 32}) client.log_metrics({"accuracy": 0.95, "loss": 0.12})

client.log_model(model, name="my-model", framework="pytorch")

πŸ“Š Feature Deep Dive

Experiment Tracking

  • MLflow-compatible API with superior UI
  • Real-time metric streaming with WebSocket
  • Side-by-side run comparison with diff views
  • Nested runs for hyperparameter sweeps
  • Artifact versioning with S3/MinIO backend
  • Git integration for code versioning

Feature Store

  • Online serving with Redis (<10ms latency)
  • Offline storage with Parquet/Delta Lake
  • Point-in-time correct joins for time-series
  • Feature transformation pipelines
  • Schema evolution and validation
  • Feature lineage and impact analysis

Model Deployment

  • Triton Inference Server integration
  • vLLM for high-throughput LLM serving
  • FastAPI endpoints with automatic OpenAPI
  • A/B testing and canary deployments
  • Autoscaling based on latency/throughput
  • Multi-model serving with routing

Monitoring & Observability

  • Data drift detection (Evidently AI)
  • Model performance degradation alerts
  • LLM-specific metrics (hallucination rate, toxicity)
  • OpenTelemetry traces for debugging
  • Cost tracking per model/endpoint
  • Real-time dashboards with Recharts

Agent & RAG Workflows

  • Visual workflow builder for agent orchestration
  • Pre-built RAG templates (Q&A, summarization, etc.)
  • Multi-hop reasoning with chain-of-thought
  • Tool calling with automatic schema generation
  • Human-in-the-loop approvals
  • Workflow versioning and rollback

πŸ—οΈ Project Structure

zenith-ml/ β”œβ”€β”€ backend/ # FastAPI application β”œβ”€β”€ frontend/ # React 19 UI β”œβ”€β”€ jupyter/ # JupyterLab configuration β”œβ”€β”€ inference/ # Triton models β”œβ”€β”€ kubernetes/ # Helm charts β”œβ”€β”€ scripts/ # Utility scripts β”œβ”€β”€ examples/ # End-to-end tutorials β”œβ”€β”€ tests/ # Test suite └── docs/ # Documentation

πŸ› οΈ Technology Stack

Backend: FastAPI, SQLAlchemy 2, asyncpg, Redis, Celery Frontend: React 19, Vite, TypeScript, Tailwind CSS, shadcn/ui, Zustand, TanStack Query ML: PyTorch, Transformers, Accelerate, PEFT, Optuna, Evidently Inference: Triton, vLLM, llama.cpp Observability: OpenTelemetry, Prometheus, Grafana Storage: PostgreSQL, Redis, S3/MinIO Orchestration: Kubernetes, Celery, RQ

πŸ“š Examples

  • Tabular ML: XGBoost with feature store and drift monitoring
  • Computer Vision: ResNet fine-tuning with distributed training
  • LLM Fine-tuning: LoRA on Llama 3 for domain adaptation
  • RAG Agent: Question-answering with retrieval and reranking
  • Multi-modal: CLIP for image-text matching with monitoring

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

πŸ“„ License

Apache License 2.0 - see LICENSE file

🌟 Star History

⭐ Star us on GitHub to support the project!

πŸ“§ Support


Built with ❀️ for the ML/AI community