-
Notifications
You must be signed in to change notification settings - Fork 0
stemming
Stand: 5. Dezember 2025
Version: 1.0.0
Kategorie: Search
Status: ✅ Implemented (v1.1) – Per-Index Configuration
Themis supports optional stemming for fulltext indexes to improve text matching by reducing words to their root form. This increases recall by matching different word forms (e.g., "running" matches "run", "runs").
| Language | Code | Algorithm | Examples |
|---|---|---|---|
| English | en |
Porter Subset | running→run, cats→cat, played→play |
| German | de |
Suffix Removal | laufen→lauf, machte→macht, gruppen→grupp |
| None | none |
No stemming | Exact token matching only |
HTTP API:
POST /index/create
{
"table": "articles",
"column": "content",
"type": "fulltext",
"config": {
"stemming_enabled": true,
"language": "de",
"stopwords_enabled": true // optional: Stopwords entfernen
}
}C++ API:
SecondaryIndexManager::FulltextConfig config;
config.stemming_enabled = true;
config.language = "en";
auto status = indexMgr.createFulltextIndex("articles", "content", config);POST /index/create
{
"table": "articles",
"column": "content",
"type": "fulltext"
}
# Equivalent to:
# "config": {"stemming_enabled": false, "language": "none"}When a document is indexed with stemming enabled:
- Tokenization: Text is split on whitespace and punctuation
- Lowercase: Tokens are converted to lowercase
- Stemming: Tokens are reduced to their stem form (if enabled)
- Storage: Stemmed tokens are stored in the inverted index
Note: If stopwords are enabled, stopwords are filtered out before stemming. If umlaut normalization is enabled (German), normalization occurs before tokenization.
Example (English):
Input: "Machine learning algorithms are optimizing systems"
Tokens: ["machine", "learning", "algorithms", "are", "optimizing", "systems"]
Stems: ["machin", "learn", "algorithm", "are", "optim", "system"]
Example (German):
Input: "Die Maschinen lernen aus vergangenen Fehlern"
Tokens: ["die", "maschinen", "lernen", "aus", "vergangenen", "fehlern"]
Stems: ["die", "maschin", "lern", "aus", "vergangen", "fehl"]
When searching with stemming enabled:
- Query tokens are processed identically to index tokens
- Stemmed query terms match stemmed index terms
- BM25 scoring uses stemmed token statistics
Example Query:
POST /search/fulltext
{
"table": "articles",
"column": "content",
"query": "learning optimization",
"limit": 10
}With stemming enabled (language: "en"):
- Query stems to:
["learn", "optim"] - Matches documents containing: "learning", "learned", "learns", "optimize", "optimizing", "optimization"
Implements a simplified version of the Porter Stemmer:
Step 1a - Plurals:
-
sses→ss(caresses → caress) -
ies→i(ponies → poni) -
s→ `` (cats → cat)
Step 1b - Past Tense:
-
eed→ee(agreed → agree) -
ed→ `` (played → play, running → run with double consonant removal) -
ing→ `` (running → run)
Step 1c - Y suffix:
-
y→i(happy → happi, only if preceded by consonant)
Step 2 - Common Suffixes:
-
ational→ate(relational → relate) -
ation→ate(activation → activate) -
ness→ `` (goodness → good) -
enci→enc(valenci → valenc)
Limitations:
- Simplified subset (not full Porter)
- No Step 3-5 transformations
- Minimum word length: 3 characters
Removes common German suffixes in order:
Plurals and Declension:
-
ern,em,en,er,es,e,s
Derivational Suffixes:
-
ung(Handlung → Handl) -
heit(Freiheit → Frei) -
keit(Möglichkeit → Möglich) -
lich(freundlich → freund)
Limitations:
- No umlaut normalization (ä, ö, ü unchanged)
- No compound word splitting
- No strong verb handling (irregular forms)
- Order-dependent (may over-stem in edge cases)
Stemming configuration is persisted in RocksDB:
Key: ftidxmeta:table:column
Value (JSON):
{
"type": "fulltext",
"stemming_enabled": true,
"language": "de"
}Stemmed tokens are stored in the same index keys as non-stemmed:
-
Presence:
ftidx:table:column:token:PK→ "" (token is stemmed if config enabled) -
Term Frequency:
fttf:table:column:token:PK→ count -
Doc Length:
ftdlen:table:column:PK→ total_tokens
Indexes created before stemming support:
-
Behavior: Config lookup returns
{stemming_enabled: false, language: "none"} -
Migration: Recreate index with
POST /index/createand new config - No Auto-Migration: Existing indexes remain unchanged
-
POST /index/createwithoutconfigfield → no stemming (default) - Query API unchanged:
/search/fulltextautomatically uses index config - C++ API:
createFulltextIndex(table, column)→ default config
- Reduction: Stemming typically reduces unique token count by 10-30%
- Compression: Fewer unique tokens → better RocksDB compression
- Trade-off: Slight increase in false positives (over-matching)
- Impact: Negligible (stemming overhead < 1% of total query time)
- Optimization: Stemmer uses in-memory string manipulation
- Caching: Not needed (stemming is fast enough)
- Impact: +5-10% for large datasets (stemming overhead)
- Mitigation: Rebuild only needed when changing config
See tests/test_stemming.cpp:
// English stemming
EXPECT_EQ(Stemmer::stem("cats", EN), "cat");
EXPECT_EQ(Stemmer::stem("running", EN), "run");
EXPECT_EQ(Stemmer::stem("relational", EN), "relate");
// German stemming
EXPECT_EQ(Stemmer::stem("laufen", DE), "lauf");
EXPECT_EQ(Stemmer::stem("machte", DE), "macht");
EXPECT_EQ(Stemmer::stem("wirkung", DE), "wirk");// Create index with stemming
FulltextConfig config{true, "en"};
indexMgr->createFulltextIndex("articles", "content", config);
// Insert document
BaseEntity doc("doc1");
doc.setField("content", "running dogs");
indexMgr->put("articles", doc);
// Query with base form
auto [status, results] = indexMgr->scanFulltext("articles", "content", "run");
EXPECT_EQ(results.size(), 1); // Matches "running"✅ Enable stemming when:
- Content is in a supported language (EN/DE)
- Recall is more important than precision
- Users search with different word forms
- Text contains morphological variations (verbs, plurals)
❌ Disable stemming when:
- Exact matching is required (e.g., product codes, technical terms)
- Content is multilingual without dominant language
- Domain-specific terminology should not be normalized
- Precision is critical (avoid false positives)
-
Monolingual content: Use appropriate language code (
en,de) -
Mixed content: Choose dominant language or use
none -
Unknown language: Use
none(exact matching)
To change stemming config:
- Drop existing index:
POST /index/drop - Create new index with config:
POST /index/create - Data will be automatically re-indexed on next entity update
- Optional: Trigger rebuild via
POST /index/rebuild
// Umlaut normalization implemented in v1.3
- Umlaut normalization: ä→a, ö→o, ü→u for German
- More languages: FR, ES, IT, NL via Snowball integration
- Custom stemmers: Plugin interface for domain-specific rules
- Compound word splitting (German): "Fußballweltmeisterschaft" → ["fußball", "welt", "meisterschaft"]
- Lemmatization: More accurate than stemming ("better" → "good")
- N-grams: Partial matching and typo tolerance
- Phonetic matching: Soundex/Metaphone for fuzzy search
# Create index with German stemming
POST /index/create
{
"table": "gesetze",
"column": "text",
"type": "fulltext",
"config": {"stemming_enabled": true, "language": "de"}
}
# Insert document
PUT /entities/gesetze/bgb123
{"text": "Die Verträge müssen schriftlich geschlossen werden"}
# Search (matches "Vertrag", "Verträge", "Vertrags", etc.)
POST /search/fulltext
{
"table": "gesetze",
"column": "text",
"query": "Vertrag schriftlich",
"limit": 20
}# Create index with English stemming
POST /index/create
{
"table": "docs",
"column": "content",
"type": "fulltext",
"config": {"stemming_enabled": true, "language": "en"}
}
# Insert documents
PUT /entities/docs/ml101
{"content": "Machine learning algorithms optimize neural networks"}
PUT /entities/docs/ml102
{"content": "Optimizing machine learned models for production"}
# Search (matches both documents)
POST /search/fulltext
{
"table": "docs",
"column": "content",
"query": "optimize learning",
"limit": 10
}
# Response:
# [
# {"pk": "ml102", "score": 9.42}, # "Optimizing...learned"
# {"pk": "ml101", "score": 8.15} # "learning...optimize"
# ]Problem: Query returns empty results after enabling stemming
Diagnosis:
- Check if index was recreated with new config
- Verify documents were re-indexed after config change
- Test with non-stemmed query (exact token match)
Solution:
# Rebuild index to apply stemming to existing documents
POST /index/rebuild
{"table": "docs", "column": "content"}Problem: Query matches unrelated documents
Cause: Over-stemming (common with aggressive algorithms)
Example:
- "university" → "univers"
- "universal" → "univers"
- Both match despite different meanings
Solution:
- Disable stemming if precision is critical
- Use exact phrases with quotes (future feature)
- Add domain-specific stopwords
Problem: Poor results for multilingual content
Cause: Single-language stemmer applied to mixed content
Solution:
- Create separate indexes per language
- Use language detection to route queries
- Fallback to
language: "none"for mixed content
- Porter Stemmer: Martin Porter, 1980
- Snowball Algorithms: tartarus.org/martin/PorterStemmer/
-
BM25 Ranking: See
docs/search/fulltext_api.md -
HTTP API: See
openapi/openapi.yaml
Last Updated: 2025-11-02
Version: v1.1
Status: Production Ready
ThemisDB v1.3.4 | GitHub | Documentation | Discussions | License
Last synced: January 02, 2026 | Commit: 6add659
Version: 1.3.0 | Stand: Dezember 2025
- Übersicht
- Home
- Dokumentations-Index
- Quick Reference
- Sachstandsbericht 2025
- Features
- Roadmap
- Ecosystem Overview
- Strategische Übersicht
- Geo/Relational Storage
- RocksDB Storage
- MVCC Design
- Transaktionen
- Time-Series
- Memory Tuning
- Chain of Thought Storage
- Query Engine & AQL
- AQL Syntax
- Explain & Profile
- Rekursive Pfadabfragen
- Temporale Graphen
- Zeitbereichs-Abfragen
- Semantischer Cache
- Hybrid Queries (Phase 1.5)
- AQL Hybrid Queries
- Hybrid Queries README
- Hybrid Query Benchmarks
- Subquery Quick Reference
- Subquery Implementation
- Content Pipeline
- Architektur-Details
- Ingestion
- JSON Ingestion Spec
- Enterprise Ingestion Interface
- Geo-Processor Design
- Image-Processor Design
- Hybrid Search Design
- Fulltext API
- Hybrid Fusion API
- Stemming
- Performance Tuning
- Migration Guide
- Future Work
- Pagination Benchmarks
- Enterprise README
- Scalability Features
- HTTP Client Pool
- Build Guide
- Implementation Status
- Final Report
- Integration Analysis
- Enterprise Strategy
- Verschlüsselungsstrategie
- Verschlüsselungsdeployment
- Spaltenverschlüsselung
- Encryption Next Steps
- Multi-Party Encryption
- Key Rotation Strategy
- Security Encryption Gap Analysis
- Audit Logging
- Audit & Retention
- Compliance Audit
- Compliance
- Extended Compliance Features
- Governance-Strategie
- Compliance-Integration
- Governance Usage
- Security/Compliance Review
- Threat Model
- Security Hardening Guide
- Security Audit Checklist
- Security Audit Report
- Security Implementation
- Development README
- Code Quality Pipeline
- Developers Guide
- Cost Models
- Todo Liste
- Tool Todo
- Core Feature Todo
- Priorities
- Implementation Status
- Roadmap
- Future Work
- Next Steps Analysis
- AQL LET Implementation
- Development Audit
- Sprint Summary (2025-11-17)
- WAL Archiving
- Search Gap Analysis
- Source Documentation Plan
- Changefeed README
- Changefeed CMake Patch
- Changefeed OpenAPI
- Changefeed OpenAPI Auth
- Changefeed SSE Examples
- Changefeed Test Harness
- Changefeed Tests
- Dokumentations-Inventar
- Documentation Summary
- Documentation TODO
- Documentation Gap Analysis
- Documentation Consolidation
- Documentation Final Status
- Documentation Phase 3
- Documentation Cleanup Validation
- API
- Authentication
- Cache
- CDC
- Content
- Geo
- Governance
- Index
- LLM
- Query
- Security
- Server
- Storage
- Time Series
- Transaction
- Utils
Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/