⚠️ EXPERIMENTALThis is a proof-of-concept. It demonstrates feasibility and measures performance, but the approach is not production-ready. See Caveats below.
This project explores whether Neo4j's existing Lucene fulltext index can support learned sparse vector search (SPLADE) using the "fake words" trick — without any changes to Neo4j itself.
It includes:
- A demo notebook comparing BM25, SPLADE, and dense vector search side by side
- Scalability benchmarks up to 500K documents
- Accuracy evaluation on the MS MARCO benchmark (9,700+ queries)
- A custom Lucene analyzer plugin that reduces storage by 73%
# Clone and install
git clone https://github.com/neo4j-field/neo4j-sparse-vector-search.git
cd neo4j-sparse-vector-search
uv sync
# Run the demo
uv run jupyter notebook demo_sparse_vectors.ipynbRequirements:
- Neo4j instance with fulltext index support (
bolt://localhost:7687) - Python 3.10+
- ~5 minutes to run the demo (1,000 documents)
SPLADE is a neural model that expands a query into semantically related terms:
Query: "automobile prices"
BM25 searches for: automobile, prices
SPLADE expands to: car:2.26, price:2.15, vehicle:1.94, cost:1.85, auto:1.58, ...
The fake words trick stores these weighted terms in a fulltext-indexed property by repeating tokens proportionally to their weight, letting Lucene's inverted index handle the rest.
Our custom analyzer stores a compact token:count format instead of repeated tokens, reducing property storage by 73% while producing an identical Lucene index.
| Method | MRR@10 | Recall@10 | vs BM25 |
|---|---|---|---|
| BM25 (keyword) | 0.363 | 0.731 | baseline |
| SPLADE (fake words) | 0.495 | 0.884 | +36% MRR |
| SPLADE (custom analyzer) | 0.495 | 0.885 | identical |
SPLADE wins on 69% of queries where it disagrees with BM25. It finds documents using synonyms and related concepts that keyword search misses.
| Metric | BM25 | SPLADE (custom analyzer) |
|---|---|---|
| Query latency (p50) | 10 ms | 19 ms |
| Ingestion throughput | 3,105 docs/s | 4,898 docs/s |
| Fulltext index size | 76 MB | 402 MB |
| Property storage | 202 MB (text) | 511 MB (compact) |
| Naive fake words | Custom analyzer | Savings | |
|---|---|---|---|
| Property size (500K) | 1,902 MB | 511 MB | 73% |
| Lucene index size | ~400 MB | ~400 MB | Same |
| Ingestion speed | 3,105 docs/s | 4,898 docs/s | 58% faster |
demo_sparse_vectors.ipynb ← START HERE — interactive introduction
neo4j-sparse-analyzer/ Custom Lucene analyzer plugin (Java)
notebooks/ Benchmarks and evaluation
├── README.md Guide to each notebook
├── sparse_vector_poc.ipynb Original POC (archived)
├── scalability_benchmark.ipynb
├── scalability_benchmark_analyzer.ipynb
├── accuracy_evaluation.ipynb
└── *.json Pre-computed results
The neo4j-sparse-analyzer/ directory contains a Java plugin that registers a sparse-vector analyzer with Neo4j. It parses token:count format and expands into repeated tokens at index time.
Benefits: 73% less property storage, 58% faster ingestion, identical search results.
A pre-built JAR is included — just copy to your Neo4j plugins/ folder and restart. See neo4j-sparse-analyzer/README.md for details.
| Issue | Description | Status |
|---|---|---|
| BM25 scoring mismatch | Lucene uses BM25 (term saturation, IDF, length norm) instead of dot-product scoring that SPLADE was trained for. | Open — would require a custom Lucene Similarity plugin + Neo4j core changes to expose it |
| No native integration | This is a workaround, not a supported feature. May break across Neo4j versions. | Open — inherent to the approach |
| Issue | Description | Status |
|---|---|---|
| Storage overhead | SPLADE fulltext index is ~5x larger than BM25 text index (402 MB vs 76 MB at 500K). | Tested — manageable at 500K. Custom analyzer reduces property storage by 73%. |
| Ranking precision loss | Float-to-integer quantization loses precision. | Tested — MRR@10 of 0.495 suggests acceptable quality |
| Query-time latency | SPLADE model inference adds ~20-50ms per query. | Tested — total round-trip 25-70ms depending on corpus size |
| Model dependency | Requires SPLADE model at index and query time. Model updates require full re-indexing. | Open |
| Issue | Description | Status |
|---|---|---|
| TooManyClauses | Lucene boolean clause limit (~1024). | Tested — max 69 query terms, well within limits |
| String store performance | Large property strings stored externally. | Mitigated — custom analyzer reduces to ~1 KB avg (73% savings) |
| Tokenization mismatch | BERT tokenizer differs from Lucene's analyzer. | Solved — custom analyzer uses whitespace on pre-tokenized input |
-
Custom Lucene similarity — Replace BM25 with dot-product scoring. Requires Neo4j core changes to expose
fulltext.similarityconfiguration. Combined with the custom analyzer, this could address the main remaining accuracy caveat. -
Hybrid search testing — Combine BM25 + SPLADE + dense vectors and measure accuracy on MS MARCO. The demo notebook shows the approach; a rigorous evaluation with the full dataset is a natural next step.
-
Larger scale testing — Test beyond 500K documents (1M+, 5M+) to identify breaking points for latency, memory, and index size.