Skip to content

neo4j-field/neo4j-sparse-vector-search

Repository files navigation

Sparse Vector Search in Neo4j

⚠️ EXPERIMENTAL

This is a proof-of-concept. It demonstrates feasibility and measures performance, but the approach is not production-ready. See Caveats below.

What is this?

This project explores whether Neo4j's existing Lucene fulltext index can support learned sparse vector search (SPLADE) using the "fake words" trick — without any changes to Neo4j itself.

It includes:

  • A demo notebook comparing BM25, SPLADE, and dense vector search side by side
  • Scalability benchmarks up to 500K documents
  • Accuracy evaluation on the MS MARCO benchmark (9,700+ queries)
  • A custom Lucene analyzer plugin that reduces storage by 73%

Quick start

# Clone and install
git clone https://github.com/neo4j-field/neo4j-sparse-vector-search.git
cd neo4j-sparse-vector-search
uv sync

# Run the demo
uv run jupyter notebook demo_sparse_vectors.ipynb

Requirements:

  • Neo4j instance with fulltext index support (bolt://localhost:7687)
  • Python 3.10+
  • ~5 minutes to run the demo (1,000 documents)

How it works

SPLADE is a neural model that expands a query into semantically related terms:

Query: "automobile prices"

BM25 searches for:   automobile, prices
SPLADE expands to:   car:2.26, price:2.15, vehicle:1.94, cost:1.85, auto:1.58, ...

The fake words trick stores these weighted terms in a fulltext-indexed property by repeating tokens proportionally to their weight, letting Lucene's inverted index handle the rest.

Our custom analyzer stores a compact token:count format instead of repeated tokens, reducing property storage by 73% while producing an identical Lucene index.

Key findings

Accuracy (MS MARCO, 9,706 queries, 81K passages)

Method MRR@10 Recall@10 vs BM25
BM25 (keyword) 0.363 0.731 baseline
SPLADE (fake words) 0.495 0.884 +36% MRR
SPLADE (custom analyzer) 0.495 0.885 identical

SPLADE wins on 69% of queries where it disagrees with BM25. It finds documents using synonyms and related concepts that keyword search misses.

Scalability (500K documents, MacBook Pro M1 64GB)

Metric BM25 SPLADE (custom analyzer)
Query latency (p50) 10 ms 19 ms
Ingestion throughput 3,105 docs/s 4,898 docs/s
Fulltext index size 76 MB 402 MB
Property storage 202 MB (text) 511 MB (compact)

Storage comparison (custom analyzer vs naive)

Naive fake words Custom analyzer Savings
Property size (500K) 1,902 MB 511 MB 73%
Lucene index size ~400 MB ~400 MB Same
Ingestion speed 3,105 docs/s 4,898 docs/s 58% faster

Project structure

demo_sparse_vectors.ipynb      ← START HERE — interactive introduction
neo4j-sparse-analyzer/         Custom Lucene analyzer plugin (Java)
notebooks/                     Benchmarks and evaluation
  ├── README.md                Guide to each notebook
  ├── sparse_vector_poc.ipynb  Original POC (archived)
  ├── scalability_benchmark.ipynb
  ├── scalability_benchmark_analyzer.ipynb
  ├── accuracy_evaluation.ipynb
  └── *.json                   Pre-computed results

Custom analyzer plugin

The neo4j-sparse-analyzer/ directory contains a Java plugin that registers a sparse-vector analyzer with Neo4j. It parses token:count format and expands into repeated tokens at index time.

Benefits: 73% less property storage, 58% faster ingestion, identical search results.

A pre-built JAR is included — just copy to your Neo4j plugins/ folder and restart. See neo4j-sparse-analyzer/README.md for details.

Known caveats and limitations

High severity

Issue Description Status
BM25 scoring mismatch Lucene uses BM25 (term saturation, IDF, length norm) instead of dot-product scoring that SPLADE was trained for. Open — would require a custom Lucene Similarity plugin + Neo4j core changes to expose it
No native integration This is a workaround, not a supported feature. May break across Neo4j versions. Open — inherent to the approach

Medium severity

Issue Description Status
Storage overhead SPLADE fulltext index is ~5x larger than BM25 text index (402 MB vs 76 MB at 500K). Tested — manageable at 500K. Custom analyzer reduces property storage by 73%.
Ranking precision loss Float-to-integer quantization loses precision. Tested — MRR@10 of 0.495 suggests acceptable quality
Query-time latency SPLADE model inference adds ~20-50ms per query. Tested — total round-trip 25-70ms depending on corpus size
Model dependency Requires SPLADE model at index and query time. Model updates require full re-indexing. Open

Resolved or non-issues

Issue Description Status
TooManyClauses Lucene boolean clause limit (~1024). Tested — max 69 query terms, well within limits
String store performance Large property strings stored externally. Mitigated — custom analyzer reduces to ~1 KB avg (73% savings)
Tokenization mismatch BERT tokenizer differs from Lucene's analyzer. Solved — custom analyzer uses whitespace on pre-tokenized input

Possible next steps

  1. Custom Lucene similarity — Replace BM25 with dot-product scoring. Requires Neo4j core changes to expose fulltext.similarity configuration. Combined with the custom analyzer, this could address the main remaining accuracy caveat.

  2. Hybrid search testing — Combine BM25 + SPLADE + dense vectors and measure accuracy on MS MARCO. The demo notebook shows the approach; a rigorous evaluation with the full dataset is a natural next step.

  3. Larger scale testing — Test beyond 500K documents (1M+, 5M+) to identify breaking points for latency, memory, and index size.

References

About

Feasibility POC: Sparse vector search on Neo4j's existing Lucene fulltext index using the "fake words" trick. Enables semantic search (synonyms, term expansion) without product changes. Query "automobile" and find "car"—something BM25 alone cannot do.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors