Skip to content

NEXT_STEPS_ANALYSIS

GitHub Actions edited this page Jan 2, 2026 · 1 revision

ThemisDB - Nächste Schritte Analyse

Stand: 5. Dezember 2025
Version: 1.0.0
Kategorie: Development


Datum: 17. November 2025 (Aktualisiert nach AQL 100% Sprint)
Basis: Code-Analyse + Todo-Liste + Implementation Summary
Status nach AQL 100% Sprint: 65% Gesamt-Implementierung


Executive Summary

Nach Abschluss des AQL 100% Sprints (Phase 1 komplett) sind die nächsten logischen Schritte:

✅ ABGESCHLOSSEN:

  1. AQL Advanced Features100% KOMPLETT (17.11.2025)
    • LET/Variable Bindings ✅
    • OR/NOT Operators ✅
    • Window Functions ✅
    • CTEs (WITH clause) ✅
    • Subqueries ✅
    • Advanced Aggregations ✅

🎯 Priorität 1 (Sofort - Q4 2025):

  1. Content Pipeline (30% → 80%, 1-2 Wochen)
  2. Inkrementelle Backups (0% → 90%, 1 Woche)
  3. Admin Tools MVP (27% → 70%, 2-3 Wochen)

🎯 Priorität 2 (Q1 2026): 4. HSM/eIDAS PKI (Docs vorhanden → Production, 2 Wochen) 5. Security Hardening (45% → 80%, 2-3 Wochen)


Sprint 1 Ergebnisse (17.11.2025)

✅ AQL 100% - KOMPLETT IMPLEMENTIERT

Commits: 5
Zeilen Code: +5,012
Tests: +70
Dauer: 1 Tag

Implementierte Features:

  1. LET/Variable Bindings (608 Zeilen, 25+ Tests)

    • LetEvaluator class
    • Arithmetische Operationen (+, -, *, /, %)
    • String-Funktionen (CONCAT, SUBSTRING, UPPER, LOWER)
    • Math-Funktionen (ABS, MIN, MAX, CEIL, FLOOR, ROUND)
    • Nested field access (doc.address.city)
    • Array indexing (doc.tags[0])
    • Variable chaining (LET x = ..., LET y = x * 2)
  2. OR/NOT Operators (159 Zeilen, 15+ Tests)

    • De Morgan's Laws transformation
    • NOT (A OR B) = (NOT A) AND (NOT B)
    • NOT (A AND B) = (NOT A) OR (NOT B)
    • NEQ conversion: A != B = (A < B) OR (A > B)
    • Double negation elimination
    • Index-Merge für OR queries
  3. Window Functions (800+ Zeilen, 20+ Tests)

    • ROW_NUMBER(), RANK(), DENSE_RANK()
    • LAG(expr, offset), LEAD(expr, offset)
    • FIRST_VALUE(expr), LAST_VALUE(expr)
    • PARTITION BY (multi-column)
    • ORDER BY (multi-column, ASC/DESC)
    • Frame definitions (ROWS/RANGE BETWEEN ... AND ...)
  4. CTEs (WITH clause) (200+ Zeilen)

    • Common Table Expressions
    • Temporary named result sets
    • Non-recursive CTEs (full stub)
    • Recursive CTEs (Phase 2 placeholder)
  5. Subqueries (200+ Zeilen)

    • Scalar subqueries: (SELECT value)
    • IN subqueries: value IN (SELECT ...)
    • EXISTS/NOT EXISTS
    • Correlated subqueries (Phase 2 placeholder)
  6. Advanced Aggregations (300+ Zeilen, 25+ Tests)

    • PERCENTILE(expr, p), MEDIAN(expr)
    • STDDEV(expr), STDDEV_POP(expr)
    • VARIANCE(expr), VAR_POP(expr)
    • IQR(expr), MAD(expr), RANGE(expr)

Detaillierte Analyse (Aktualisiert)

Implementierungs-Schritte:

  1. LET Evaluator (4-6h)

    // src/query/let_evaluator.cpp
    class LetEvaluator {
        std::unordered_map<std::string, nlohmann::json> bindings_;
    public:
        void evaluateLet(const LetNode& node, const nlohmann::json& current_doc);
        nlohmann::json resolveVariable(const std::string& var_name);
    };
  2. Integration in Query Engine (2-3h)

    • Add LET evaluator to query execution pipeline
    • Variable resolution in FILTER/RETURN expressions
  3. Tests (3-4h)

    • Unit tests: LET mit Arithmetik, String-Ops, Nested Objects
    • Integration tests: LET + FILTER, LET in Joins
    • Edge cases: Undefined variables, circular dependencies

DoD:

  • ✅ LET bindings funktionieren in FOR/FILTER/RETURN
  • ✅ Mehrere LETs pro Query
  • ✅ LETs können frühere LETs referenzieren
  • ✅ 15+ Tests PASSING

Files zu ändern:

  • src/query/aql_translator.cpp - LET evaluation logic
  • src/query/query_engine.cpp - Variable resolution

1. Content Pipeline Vervollständigen (HÖCHSTE PRIORITÄT)

Status: 30% implementiert, Basis-Schema vorhanden
Impact: RAG/Hybrid-Search Workloads blockiert
Aufwand: 1-2 Wochen

1.1 Advanced Extraction (PDF/DOCX/XLSX)

Code-Status:

// ✅ Text Processor vorhanden (src/content/text_processor.cpp)
// ✅ Mock CLIP Processor (src/content/mock_clip_processor.cpp)
// ❌ Keine echten PDF/DOCX Parser

TODO-Marker im Code:

  • src/api/http_server.cpp:4 - "TODO: Implement in Phase 4, Task 11"
  • Content-Pipeline nur Mockups

Implementierungs-Schritte:

  1. PDF Extraction (6-8h)

    • Library: poppler-cpp oder pdfium
    • Text + Metadata (author, created, pages)
    • Image extraction für multi-modal
  2. DOCX Extraction (4-6h)

    • Library: libxml2 (OpenXML parsing)
    • Text + Styles + Metadata
  3. XLSX Extraction (4-6h)

    • Library: xlnt oder libxlsx
    • Tabellen → JSON/CSV
  4. Tests (4-5h)

    • Real-world PDFs (100+ pages)
    • Complex DOCX (images, tables, formulas)
    • Large XLSX (10k rows)

DoD:

  • ✅ PDF/DOCX/XLSX extraction funktioniert
  • ✅ Metadata preservation
  • ✅ Error handling für corrupted files
  • ✅ Integration mit ContentManager

Files zu ändern:

  • src/content/pdf_processor.cpp - NEW
  • src/content/docx_processor.cpp - NEW
  • src/content/xlsx_processor.cpp - NEW
  • CMakeLists.txt - Add poppler/libxml2/xlnt
  • vcpkg.json - Add dependencies

2.2 Chunking Optimierung

Code-Status:

// ⚠️ Basis-Chunking vorhanden
// ❌ Keine semantische Chunking-Strategies

Implementierungs-Schritte:

  1. Semantic Chunking (6-8h)

    • Sentence-level chunking (NLTK/spaCy)
    • Paragraph-preserving chunking
    • Sliding window mit overlap
  2. Chunk Metadata (3-4h)

    • Position tracking (start_offset, end_offset)
    • Parent-child relationships
    • Chunk embeddings
  3. Batch Upload Optimization (4-6h)

    • Parallel chunk processing (Intel TBB)
    • RocksDB WriteBatch für bulk inserts

DoD:

  • ✅ 3 Chunking-Strategies (fixed-size, sentence, paragraph)
  • ✅ Chunk metadata vollständig
  • ✅ 10x faster bulk upload
  • ✅ Tests PASSING

Files zu ändern:

  • src/content/chunking_strategy.cpp - NEW
  • src/content/content_manager.cpp - Batch optimization
  • tests/test_chunking.cpp - NEW

3. Admin Tools MVP (MEDIUM)

Status: 27% implementiert (nur AuditLogViewer produktiv)
Impact: Operations, Compliance, DSGVO
Aufwand: 2-3 Wochen

3.1 Tool-Status Audit

Aktuelle Tools (WPF .NET 8):

Tool Code Status Backend API Tests %
AuditLogViewer ✅ Implementiert /audit/logs 90%
SAGAVerifier ✅ Implementiert /saga/batches ⚠️ Minimal 70%
PIIManager ✅ Implementiert /pii/* ⚠️ Minimal 60%
KeyRotationDashboard ✅ MVP (Demo-Daten) /keys/* 40%
RetentionManager ✅ MVP (Demo-Daten) ⚠️ Teilweise 30%
ClassificationDashboard ✅ MVP (Demo-Daten) /classification/* 40%
ComplianceReports ✅ MVP (Demo-Daten) /reports/* 40%

Durchschnitt: 27% (stark durch fehlende Tests und echte Backend-Integration gezogen)

3.2 Kritische Gaps

Backend-APIs fehlen:

  • /pii/* - VORHANDEN (implementiert in Critical Sprint)
  • /keys/* - VORHANDEN
  • /classification/* - VORHANDEN
  • ⚠️ /retention/* - TEILWEISE (ContinuousAggregateManager vorhanden, kein HTTP-Endpoint)
  • /reports/* - VORHANDEN

Action Items:

  1. Retention API Endpoint (4-6h)

    // src/server/http_server.cpp
    CROW_ROUTE(app, "/api/retention/policies").methods("GET"_method)
    CROW_ROUTE(app, "/api/retention/policies").methods("POST"_method)
    CROW_ROUTE(app, "/api/retention/execute").methods("POST"_method)
  2. Integration Tests (8-10h)

    • E2E tests für jedes Tool
    • Mock Backend → Real Backend migration
  3. Deployment Scripts (3-4h)

    • MSI Installer (WiX Toolset)
    • Auto-Update mechanism

DoD:

  • ✅ Alle 7 Tools mit Live-Backend verbunden
  • ✅ Integration tests PASSING
  • ✅ Deployment-ready MSI

Files zu ändern:

  • src/server/http_server.cpp - Retention endpoints
  • tools/*/ViewModels/*.cs - Remove mock data
  • tools/deployment/build.ps1 - NEW

4. Inkrementelle Backups (CRITICAL for Production)

Status: 0% implementiert (nur RocksDB Checkpoints)
Impact: Data loss prevention, disaster recovery
Aufwand: 1 Woche

4.1 WAL-Archiving

Code-Status:

// ✅ RocksDB Checkpoints implementiert
// ❌ Keine WAL-Archivierung
// ❌ Keine Point-in-Time Recovery

TODO-Marker:

  • docs/development/todo.md:60 - "Inkrementelle Backups / WAL-Archiving — TODO"

Implementierungs-Schritte:

  1. WAL Archive Manager (8-10h)

    class WALArchiveManager {
        void archiveWAL(const std::string& wal_file, const std::string& archive_path);
        void restoreFromWAL(const std::string& archive_path, uint64_t target_timestamp);
        std::vector<WALFile> listArchivedWALs();
    };
  2. Incremental Backup (6-8h)

    • Copy only changed WAL files since last backup
    • Manifest file (backup_manifest.json) with timestamps
  3. Point-in-Time Recovery (8-10h)

    • Restore checkpoint + replay WAL files until target timestamp
    • Verify data integrity after recovery
  4. Automated Backup Jobs (4-6h)

    • Cron-style scheduler (every 6h, daily, weekly)
    • Retention policy (keep last 7 dailies, 4 weeklies, 12 monthlies)
  5. Cloud Storage Integration (6-8h)

    • S3 upload via aws-sdk-cpp
    • Azure Blob Storage via azure-storage-cpp
    • Google Cloud Storage via google-cloud-cpp

DoD:

  • ✅ Incremental backups funktionieren
  • ✅ Point-in-Time Recovery tested
  • ✅ S3/Azure/GCS upload
  • ✅ Automated schedules
  • ✅ Restore tests PASSING

Files zu ändern:

  • include/backup/wal_archive_manager.h - NEW
  • src/backup/wal_archive_manager.cpp - NEW
  • src/backup/backup_scheduler.cpp - NEW
  • src/server/http_server.cpp - Backup endpoints
  • tests/test_backup_restore.cpp - NEW

5. HSM/eIDAS PKI Production-Ready (HIGH)

Status: Docs vorhanden (1,111 lines), keine HSM-Integration
Impact: Qualified eIDAS signatures für Production
Aufwand: 2 Wochen

5.1 Vault HSM Integration

Code-Status:

// ✅ VaultKeyProvider vorhanden (src/security/vault_key_provider.cpp)
// ✅ PKIClient vorhanden (src/security/vcc_pki_client.cpp)
// ❌ Keine HSM-Integration

TODO-Marker:

  • src/security/vcc_pki_client.cpp:348 - "TODO: Implement full X.509 chain validation"
  • docs/development/todo.md:60 - "eIDAS-konforme Signaturen / PKI Integration (Produktiv-Ready mit HSM) — TODO"

Implementierungs-Schritte:

  1. Vault Transit Engine (6-8h)

    class VaultHSMProvider : public PKIClient {
        std::string sign(const std::string& data) override {
            // POST /v1/transit/sign/my-key
            // HSM-backed signing
        }
    };
  2. X.509 Chain Validation (4-6h)

    • OpenSSL X509_verify_cert()
    • CRL checking
    • OCSP validation
  3. Qualified Timestamp Authority (6-8h)

    • RFC 3161 timestamp requests
    • Timestamp verification
    • Integration mit SAGA events
  4. eIDAS Compliance Tests (8-10h)

    • Qualified signature validation
    • Timestamp validation
    • Full audit trail test

DoD:

  • ✅ Vault Transit Engine integration
  • ✅ X.509 chain validation
  • ✅ Qualified TSA integration
  • ✅ eIDAS compliance validated
  • ✅ Production deployment guide

Files zu ändern:

  • src/security/vault_hsm_provider.cpp - NEW
  • src/security/vcc_pki_client.cpp - X.509 validation
  • src/utils/timestamp_authority.cpp - NEW
  • tests/test_eid as_compliance.cpp - NEW

Prioritäten-Matrix

Task Business Value Technical Complexity Effort Priority
LET/Subqueries ⭐⭐⭐⭐⭐ ⭐⭐⭐ 2-3 days P0
OR/NOT Index-Merge ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ 3-4 days P0
PDF/DOCX Extraction ⭐⭐⭐⭐ ⭐⭐⭐ 2-3 days P1
Incremental Backups ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ 5-7 days P1
Admin Tools Integration ⭐⭐⭐ ⭐⭐ 3-4 days P2
Hash-Join ⭐⭐⭐ ⭐⭐⭐⭐ 4-5 days P2
HSM/eIDAS ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 10-12 days P2
Chunking Optimization ⭐⭐⭐ ⭐⭐ 2-3 days P3

Empfohlene Roadmap

Sprint 1 (Week 1-2): AQL Advanced Features

Ziel: AQL von 65% auf 85%

  • Day 1-3: LET/Subqueries implementieren + tests
  • Day 4-7: OR/NOT mit Index-Merge
  • Day 8-10: Advanced Joins (Hash-Join Basis)

Deliverable: AQL Production-Ready für komplexe Queries


Sprint 2 (Week 3-4): Content Pipeline + Backups

Ziel: Content 30% → 60%, Backups 0% → 90%

  • Day 1-4: PDF/DOCX/XLSX Extraction
  • Day 5-6: Chunking Optimization
  • Day 7-10: WAL-Archiving + Point-in-Time Recovery

Deliverable: RAG-Ready Content Pipeline, Production Backups


Sprint 3 (Week 5-6): Admin Tools + HSM

Ziel: Admin Tools 27% → 70%, HSM Integration

  • Day 1-4: Admin Tools Backend-Integration + Tests
  • Day 5-10: Vault HSM + eIDAS Compliance

Deliverable: Operations-Ready Admin Suite, Qualified Signatures


Code-TODOs Priorisiert

CRITICAL (Sprint 1)

  1. src/query/aql_translator.cpp:31 - LET execution
  2. src/query/query_optimizer.cpp - OR cost model
  3. src/index/secondary_index.cpp - Index merge utilities

HIGH (Sprint 2)

  1. src/content/pdf_processor.cpp - NEW (PDF extraction)
  2. src/backup/wal_archive_manager.cpp - NEW (WAL archiving)
  3. src/server/http_server.cpp - Retention endpoints

MEDIUM (Sprint 3)

  1. src/security/vault_hsm_provider.cpp - NEW (HSM integration)
  2. src/security/vcc_pki_client.cpp:348 - X.509 validation
  3. tools/*/ViewModels/*.cs - Remove mock data

Success Metrics

Sprint 1 Goals:

  • ✅ AQL: 85% implementation (up from 65%)
  • ✅ LET: 15+ tests PASSING
  • ✅ OR: 20+ tests PASSING
  • ✅ Hash-Join: 10x speedup on large joins

Sprint 2 Goals:

  • ✅ Content: 60% implementation (up from 30%)
  • ✅ PDF/DOCX: Real-world extraction works
  • ✅ Backups: Point-in-Time Recovery validated
  • ✅ Automated backup jobs running

Sprint 3 Goals:

  • ✅ Admin Tools: 70% implementation (up from 27%)
  • ✅ All 7 tools with live backends
  • ✅ HSM: Vault Transit Engine integrated
  • ✅ eIDAS: Qualified signatures validated

Overall Target: 70% Gesamt-Implementierung (up from 61%)


Abhängigkeiten

External Libraries zu installieren:

  • poppler-cpp (PDF extraction)
  • libxml2 (DOCX extraction)
  • xlnt (XLSX extraction)
  • aws-sdk-cpp (S3 backups)
  • azure-storage-cpp (Azure backups)
  • google-cloud-cpp (GCS backups)

vcpkg.json Updates:

{
  "dependencies": [
    "poppler",
    "libxml2",
    "xlnt",
    "aws-sdk-cpp[s3]",
    "azure-storage-cpp",
    "google-cloud-cpp[storage]"
  ]
}

Risiken & Mitigations

Risiko Impact Wahrscheinlichkeit Mitigation
LET-Implementierung komplex HIGH MEDIUM Start mit einfachen Expressions, schrittweise erweitern
Index-Merge Performance MEDIUM LOW Benchmarks parallel zur Entwicklung
PDF-Library Integration MEDIUM MEDIUM POC mit poppler vor vollständiger Integration
HSM-Kosten HIGH LOW Dev-Umgebung mit Mock HSM, Production-Tests separat
Backup-Storage-Kosten MEDIUM MEDIUM Retention policies implementieren (auto-delete old backups)

Fazit

Empfohlene Next Steps (Reihenfolge):

  1. JETZT: LET/Subqueries (3 Tage) - BLOCKER für Production
  2. DANN: OR/NOT Index-Merge (4 Tage) - BLOCKER für komplexe Queries
  3. PARALLEL: Incremental Backups (5 Tage) - CRITICAL für Production
  4. DANACH: Content Pipeline (3 Tage) - Enables RAG
  5. SPÄTER: Admin Tools + HSM (2 Wochen) - Operations Excellence

Total Aufwand: ~6 Wochen für alle P0/P1 Tasks
Expected Outcome: 70% Gesamt-Implementierung, Production-Ready AQL, Operations Excellence

ThemisDB Dokumentation

Version: 1.3.0 | Stand: Dezember 2025


📋 Schnellstart


🏗️ Architektur


🗄️ Basismodell


💾 Storage & MVCC


📇 Indexe & Statistiken


🔍 Query & AQL


💰 Caching


📦 Content Pipeline


🔎 Suche


⚡ Performance & Benchmarks


🏢 Enterprise Features


✅ Qualitätssicherung


🧮 Vektor & GNN


🌍 Geo Features


🛡️ Sicherheit & Governance

Authentication

Schlüsselverwaltung

Verschlüsselung

TLS & Certificates

PKI & Signatures

PII Detection

Vault & HSM

Audit & Compliance

Security Audits

Gap Analysis


🚀 Deployment & Betrieb

Docker

Observability

Change Data Capture

Operations


💻 Entwicklung

API Implementations

Changefeed

Security Development

Development Overviews


📄 Publikation & Ablage


🔧 Admin-Tools


🔌 APIs


📚 Client SDKs


📊 Implementierungs-Zusammenfassungen


📅 Planung & Reports


📖 Dokumentation


📝 Release Notes


📖 Styleguide & Glossar


🗺️ Roadmap & Changelog


💾 Source Code Documentation

Main Programs

Source Code Module


🗄️ Archive


🤝 Community & Support


Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/

Clone this wiki locally