Sparse Vector Search in Neo4j

⚠️ EXPERIMENTAL

This is a proof-of-concept. It demonstrates feasibility and measures performance, but the approach is not production-ready. See Caveats below.

What is this?

This project explores whether Neo4j's existing Lucene fulltext index can support learned sparse vector search (SPLADE) using the "fake words" trick — without any changes to Neo4j itself.

It includes:

A demo notebook comparing BM25, SPLADE, and dense vector search side by side
Scalability benchmarks up to 500K documents
Accuracy evaluation on the MS MARCO benchmark (9,700+ queries)
A custom Lucene analyzer plugin that reduces storage by 73%

Quick start

# Clone and install
git clone https://github.com/neo4j-field/neo4j-sparse-vector-search.git
cd neo4j-sparse-vector-search
uv sync

# Run the demo
uv run jupyter notebook demo_sparse_vectors.ipynb

Requirements:

Neo4j instance with fulltext index support (bolt://localhost:7687)
Python 3.10+
~5 minutes to run the demo (1,000 documents)

How it works

SPLADE is a neural model that expands a query into semantically related terms:

Query: "automobile prices"

BM25 searches for:   automobile, prices
SPLADE expands to:   car:2.26, price:2.15, vehicle:1.94, cost:1.85, auto:1.58, ...

The fake words trick stores these weighted terms in a fulltext-indexed property by repeating tokens proportionally to their weight, letting Lucene's inverted index handle the rest.

Our custom analyzer stores a compact token:count format instead of repeated tokens, reducing property storage by 73% while producing an identical Lucene index.

Key findings

Accuracy (MS MARCO, 9,706 queries, 81K passages)

Method	MRR@10	Recall@10	vs BM25
BM25 (keyword)	0.363	0.731	baseline
SPLADE (fake words)	0.495	0.884	+36% MRR
SPLADE (custom analyzer)	0.495	0.885	identical

SPLADE wins on 69% of queries where it disagrees with BM25. It finds documents using synonyms and related concepts that keyword search misses.

Scalability (500K documents, MacBook Pro M1 64GB)

Metric	BM25	SPLADE (custom analyzer)
Query latency (p50)	10 ms	19 ms
Ingestion throughput	3,105 docs/s	4,898 docs/s
Fulltext index size	76 MB	402 MB
Property storage	202 MB (text)	511 MB (compact)

Storage comparison (custom analyzer vs naive)

	Naive fake words	Custom analyzer	Savings
Property size (500K)	1,902 MB	511 MB	73%
Lucene index size	~400 MB	~400 MB	Same
Ingestion speed	3,105 docs/s	4,898 docs/s	58% faster

Project structure

demo_sparse_vectors.ipynb      ← START HERE — interactive introduction
neo4j-sparse-analyzer/         Custom Lucene analyzer plugin (Java)
notebooks/                     Benchmarks and evaluation
  ├── README.md                Guide to each notebook
  ├── sparse_vector_poc.ipynb  Original POC (archived)
  ├── scalability_benchmark.ipynb
  ├── scalability_benchmark_analyzer.ipynb
  ├── accuracy_evaluation.ipynb
  └── *.json                   Pre-computed results

Custom analyzer plugin

The neo4j-sparse-analyzer/ directory contains a Java plugin that registers a sparse-vector analyzer with Neo4j. It parses token:count format and expands into repeated tokens at index time.

Benefits: 73% less property storage, 58% faster ingestion, identical search results.

A pre-built JAR is included — just copy to your Neo4j plugins/ folder and restart. See neo4j-sparse-analyzer/README.md for details.

Known caveats and limitations

High severity

Issue	Description	Status
BM25 scoring mismatch	Lucene uses BM25 (term saturation, IDF, length norm) instead of dot-product scoring that SPLADE was trained for.	Open — would require a custom Lucene Similarity plugin + Neo4j core changes to expose it
No native integration	This is a workaround, not a supported feature. May break across Neo4j versions.	Open — inherent to the approach

Medium severity

Issue	Description	Status
Storage overhead	SPLADE fulltext index is ~5x larger than BM25 text index (402 MB vs 76 MB at 500K).	Tested — manageable at 500K. Custom analyzer reduces property storage by 73%.
Ranking precision loss	Float-to-integer quantization loses precision.	Tested — MRR@10 of 0.495 suggests acceptable quality
Query-time latency	SPLADE model inference adds ~20-50ms per query.	Tested — total round-trip 25-70ms depending on corpus size
Model dependency	Requires SPLADE model at index and query time. Model updates require full re-indexing.	Open

Resolved or non-issues

Issue	Description	Status
TooManyClauses	Lucene boolean clause limit (~1024).	Tested — max 69 query terms, well within limits
String store performance	Large property strings stored externally.	Mitigated — custom analyzer reduces to ~1 KB avg (73% savings)
Tokenization mismatch	BERT tokenizer differs from Lucene's analyzer.	Solved — custom analyzer uses whitespace on pre-tokenized input

Possible next steps

Custom Lucene similarity — Replace BM25 with dot-product scoring. Requires Neo4j core changes to expose fulltext.similarity configuration. Combined with the custom analyzer, this could address the main remaining accuracy caveat.
Hybrid search testing — Combine BM25 + SPLADE + dense vectors and measure accuracy on MS MARCO. The demo notebook shows the approach; a rigorous evaluation with the full dataset is a natural next step.
Larger scale testing — Test beyond 500K documents (1M+, 5M+) to identify breaking points for latency, memory, and index size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse Vector Search in Neo4j

What is this?

Quick start

How it works

Key findings

Accuracy (MS MARCO, 9,706 queries, 81K passages)

Scalability (500K documents, MacBook Pro M1 64GB)

Storage comparison (custom analyzer vs naive)

Project structure

Custom analyzer plugin

Known caveats and limitations

High severity

Medium severity

Resolved or non-issues

Possible next steps

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Sparse Vector Search in Neo4j

What is this?

Quick start

How it works

Key findings

Accuracy (MS MARCO, 9,706 queries, 81K passages)

Scalability (500K documents, MacBook Pro M1 64GB)

Storage comparison (custom analyzer vs naive)

Project structure

Custom analyzer plugin

Known caveats and limitations

High severity

Medium severity

Resolved or non-issues

Possible next steps

References