Skip to content
GitHub Actions edited this page Jan 2, 2026 · 1 revision

ThemisDB LLM Plugin System - Quick Start

Version: 1.3.0
Release: Dezember 2025


🚀 Schnellstart

1. llama.cpp einrichten

# Automatisches Setup (lokaler Clone, nicht committen)
bash scripts/setup-llamacpp.sh

# Oder manuell (Root-Verzeichnis)
git clone https://github.com/ggerganov/llama.cpp.git llama.cpp

2. Build mit LLM Support

# CPU-Only Build
cmake -B build -DTHEMIS_ENABLE_LLM=ON
cmake --build build

# Mit CUDA (NVIDIA GPU)
cmake -B build \
    -DTHEMIS_ENABLE_LLM=ON \
    -DTHEMIS_ENABLE_CUDA=ON
cmake --build build

# Mit Metal (Apple Silicon)
cmake -B build \
    -DTHEMIS_ENABLE_LLM=ON \
    -DTHEMIS_ENABLE_METAL=ON
cmake --build build

3. Model herunterladen

mkdir -p models

# Beispiel: Mistral 7B Instruct (Q4 quantized, ~4GB)
# Download from HuggingFace:
# https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF

4. Konfiguration

cp config/llm_config.example.yaml config/llm_config.yaml

# Editiere llm_config.yaml:
# - Setze model.path auf dein heruntergeladenes Model
# - Konfiguriere GPU layers (n_layers)
# - Optional: LoRA adapters

5. ThemisDB mit LLM starten

./build/themis_server --config config/llm_config.yaml

📚 Dokumentation

Dokument Beschreibung
LLM_PLUGIN_DEVELOPMENT_GUIDE.md Vollständiger Entwickler-Guide für Plugin-Entwicklung
LLAMA_CPP_INTEGRATION.md llama.cpp Integration Details
AI_ECOSYSTEM_SHARDING_ARCHITECTURE.md Distributed Sharding Architektur (Roadmap)

🧩 Plugin Architektur

┌─────────────────────────────────────────────────────┐
│          ThemisDB LLM Plugin System                 │
├─────────────────────────────────────────────────────┤
│                                                      │
│  ILLMPlugin Interface                               │
│    ↓                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────┐  │
│  │ LlamaCpp     │  │   vLLM       │  │  Custom  │  │
│  │   Plugin     │  │   Plugin     │  │  Plugin  │  │
│  └──────────────┘  └──────────────┘  └──────────┘  │
│         ↓                  ↓                ↓        │
│  ┌──────────────────────────────────────────────┐  │
│  │       LLMPluginManager                       │  │
│  │  - Plugin Discovery & Loading                │  │
│  │  - Model Management                          │  │
│  │  - LoRA Coordination                         │  │
│  └──────────────────────────────────────────────┘  │
│                                                      │
└─────────────────────────────────────────────────────┘

💡 Code Beispiele

Ollama-Style: Lazy Model Loading

#include "llm/model_loader.h"

// Lazy loader setup
LazyModelLoader::Config config;
config.max_models = 3;           // Keep up to 3 models in memory
config.max_vram_mb = 24576;      // 24 GB budget
config.model_ttl = std::chrono::seconds(1800);  // 30 min TTL

LazyModelLoader loader(config);

// First request: Loads model lazily (~2-3 seconds)
auto* model = loader.getOrLoadModel(
    "mistral-7b",
    "/models/mistral-7b-instruct-q4.gguf"
);

// Subsequent requests: Instant (cache hit!)
auto* same_model = loader.getOrLoadModel("mistral-7b", "");
// ~0ms!

// Pin important models to prevent eviction
loader.pinModel("mistral-7b");

vLLM-Style: Multi-LoRA Management

#include "llm/multi_lora_manager.h"

// Multi-LoRA setup
MultiLoRAManager::Config config;
config.max_lora_slots = 16;      // Up to 16 LoRAs
config.max_lora_vram_mb = 2048;  // 2 GB for LoRAs
config.enable_multi_lora_batch = true;

MultiLoRAManager lora_mgr(config);

// Load multiple LoRAs for same base model
lora_mgr.loadLoRA("legal-qa", "/loras/legal-qa-v1.bin", "mistral-7b");
lora_mgr.loadLoRA("medical-diag", "/loras/medical-v1.bin", "mistral-7b");
lora_mgr.loadLoRA("code-assist", "/loras/code-v1.bin", "mistral-7b");

// Use different LoRAs per request
InferenceRequest req1;
req1.prompt = "Legal question";
req1.lora_adapter_id = "legal-qa";

InferenceRequest req2;
req2.prompt = "Medical question";
req2.lora_adapter_id = "medical-diag";

// Fast LoRA switching (~5ms)

Plugin registrieren

#include "llm/llm_plugin_manager.h"
#include "llm/llamacpp_plugin.h"

// Plugin erstellen und konfigurieren
json config = {
    {"model_path", "/models/mistral-7b-instruct-q4.gguf"},
    {"n_gpu_layers", 32},
    {"n_ctx", 4096}
};

createLlamaCppPlugin("llamacpp", config["model_path"], config);

Text generieren

auto& manager = LLMPluginManager::instance();

InferenceRequest request;
request.prompt = "Was ist ThemisDB?";
request.max_tokens = 512;
request.temperature = 0.7f;

auto response = manager.generate(request);
std::cout << "Response: " << response.text << std::endl;
std::cout << "Tokens: " << response.tokens_generated << std::endl;
std::cout << "Time: " << response.inference_time_ms << "ms" << std::endl;

RAG (Retrieval-Augmented Generation)

// Dokumente aus ThemisDB abrufen (Vector Search)
RAGContext context;
context.query = "Rechtliche Aspekte der Datenspeicherung";
context.documents = {
    {.content = "Dokument 1 Inhalt...", .source = "doc1.pdf", .relevance_score = 0.95},
    {.content = "Dokument 2 Inhalt...", .source = "doc2.pdf", .relevance_score = 0.87}
};

InferenceRequest request;
request.prompt = context.query;

auto response = manager.generateRAG(context, request);

LoRA Adapter laden

auto* plugin = manager.getPlugin("llamacpp");

// LoRA laden
plugin->loadLoRA(
    "legal-qa-v1",
    "/loras/legal-qa-v1.bin",
    1.0f  // scale
);

// Mit LoRA inferieren
InferenceRequest request;
request.prompt = "Rechtliche Frage...";
request.lora_adapter_id = "legal-qa-v1";

auto response = plugin->generate(request);

🎯 Features

✅ Implementiert (v1.3.0)

  • ✅ Plugin-basierte Architektur
  • ✅ llama.cpp Integration (Reference Implementation)
  • ✅ Model Loading (GGUF Format)
  • ✅ LoRA Adapter Support
  • ✅ GPU Acceleration (CUDA, Metal, Vulkan, HIP)
  • ✅ RAG Integration
  • ✅ Memory Management & Statistics
  • ✅ Multi-Plugin Support
  • Ollama-style Lazy Loading (v1.3.0)
  • vLLM-style Multi-LoRA (v1.3.0)

🚧 Roadmap (Future)

  • 🚧 HTTP API Endpoints
  • 🚧 Streaming Generation
  • 🚧 Batch Inference
  • 🚧 Distributed Sharding (etcd + gRPC)
  • 🚧 Cross-Shard LoRA Transfer
  • 🚧 Federated RAG Queries
  • 🚧 vLLM Plugin Implementation
  • 🚧 Model Replication (Raft Consensus)

🔧 Build Optionen

Option Beschreibung Default
THEMIS_ENABLE_LLM LLM Plugin Support aktivieren OFF
THEMIS_ENABLE_CUDA CUDA GPU Support OFF
THEMIS_ENABLE_METAL Metal GPU Support (macOS) OFF
THEMIS_ENABLE_VULKAN Vulkan GPU Support OFF
THEMIS_ENABLE_HIP AMD HIP/ROCm Support OFF

📊 Performance

Typische Latenz (Mistral-7B Q4, RTX 4090)

Operation Latenz Throughput
Model Loading ~2-3 Sekunden -
Text Generation (512 tokens) ~300ms ~1700 tokens/s
RAG Query (10 docs) ~320ms -
LoRA Loading ~50ms -
LoRA Switch ~5ms -
Embedding (512 tokens) ~5ms -

VRAM Usage

Model Quantization VRAM
Phi-3-Mini Q4_K_M ~2 GB
Mistral-7B Q4_K_M ~4 GB
Llama-2-13B Q4_K_M ~8 GB
Llama-3-70B Q4_K_M ~40 GB

🆘 Troubleshooting

Model lädt nicht

# Prüfe Model Format (muss GGUF sein)
file models/mistral-7b.gguf

# Prüfe Permissions
ls -la models/

# Teste mit kleinerem Model
# Phi-3-Mini: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf

Langsame Inferenz

# Erhöhe GPU layers in config
n_layers: 32  # → 35 oder mehr

# Prüfe GPU Auslastung
nvidia-smi  # (CUDA)
# Oder
sudo powermetrics --samplers gpu_power  # (Metal/macOS)

# Reduziere Context Size wenn möglich
n_ctx: 4096  # → 2048

Build Fehler

# llama.cpp lokaler Clone fehlt?
ls -la ./llama.cpp
# Falls nicht vorhanden: lokalen Clone erstellen (nicht committen)
git clone https://github.com/ggerganov/llama.cpp.git llama.cpp

# CUDA nicht gefunden?
export CUDA_PATH=/usr/local/cuda
cmake -B build -DTHEMIS_ENABLE_LLM=ON -DTHEMIS_ENABLE_CUDA=ON

Windows/MSVC Build mit LLM

# Empfohlen: Skript für MSVC Release-Build mit LLM
powershell -File scripts/build-themis-server-llm.ps1

# Sanity-Check
./build-msvc/bin/themis_server.exe --help

Hinweise:

  • Das Skript setzt Visual Studio 2022 (-G "Visual Studio 17 2022") und x64 Architektur (-A x64).
  • vcpkg-Toolchain wird eingebunden; llama.cpp/ ist lokaler Clone und per .gitignore/.dockerignore ausgeschlossen.

🔗 Links


📝 Lizenz

ThemisDB: MIT License
llama.cpp: MIT License


Version: 1.3.0
Last Updated: Dezember 2025
Status: Production Ready (Reference Implementation)

ThemisDB Dokumentation

Version: 1.3.0 | Stand: Dezember 2025


📋 Schnellstart


🏗️ Architektur


🗄️ Basismodell


💾 Storage & MVCC


📇 Indexe & Statistiken


🔍 Query & AQL


💰 Caching


📦 Content Pipeline


🔎 Suche


⚡ Performance & Benchmarks


🏢 Enterprise Features


✅ Qualitätssicherung


🧮 Vektor & GNN


🌍 Geo Features


🛡️ Sicherheit & Governance

Authentication

Schlüsselverwaltung

Verschlüsselung

TLS & Certificates

PKI & Signatures

PII Detection

Vault & HSM

Audit & Compliance

Security Audits

Gap Analysis


🚀 Deployment & Betrieb

Docker

Observability

Change Data Capture

Operations


💻 Entwicklung

API Implementations

Changefeed

Security Development

Development Overviews


📄 Publikation & Ablage


🔧 Admin-Tools


🔌 APIs


📚 Client SDKs


📊 Implementierungs-Zusammenfassungen


📅 Planung & Reports


📖 Dokumentation


📝 Release Notes


📖 Styleguide & Glossar


🗺️ Roadmap & Changelog


💾 Source Code Documentation

Main Programs

Source Code Module


🗄️ Archive


🤝 Community & Support


Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/

Clone this wiki locally