An educational Information Retrieval system built in Python (by ClaudeCode, no glory or blame for me), designed to demonstrate text search fundamentals -- from inverted indexes and classical ranking models to neural dense embeddings. Built with a clear two-pipeline architecture (storage and search) and a rich interactive CLI.
- 5 ranking algorithms: Boolean (TF), TF-IDF, BM25, Vector Space Model (cosine similarity), Dense Embeddings (sentence-transformers)
- Bilingual support (Italian + English) with automatic language detection
- Query expansion via bilingual thesaurus with toggle control
- Boolean operators (AND/OR) with multiple syntax forms (
&&,||,ORkeyword, prefix notation) - Rich CLI interface with colored tables, panels, and spinners (powered by Rich)
- 2D/3D vector proximity visualization using matplotlib and plotext (in-terminal)
- Benchmarking suite: 100 queries across all ranking modes with comparative timing tables
- Persistent inverted index serialized to JSON
- 32-test suite covering preprocessing, storage, and search (keyword + vector)
pip install -r requirements.txt
python main.pyOn first launch the system automatically downloads required NLTK resources (stopwords, punkt), builds the inverted index from the document collection, and persists it to index/inverted_index.json.
- Python 3.8+
- nltk >= 3.8
- rich >= 13.0
- matplotlib >= 3.7
- scikit-learn >= 1.3
- sentence-transformers >= 2.2
- plotext >= 5.2
The system follows a two-pipeline design that separates indexing from retrieval.
Storage Pipeline -- runs once (or on +rebuild):
documents/ --> data_collection --> preprocessing --> inverted index --> JSON persistence
Search Pipeline -- runs on every query:
raw query --> query_parser --> query_preprocessor --> query_representation
--> comparison --> ranking --> result_formatter (Rich output)
For the dense mode, the search pipeline bypasses the keyword stages and routes the raw query directly to the sentence-transformers ranker.
| Mode | Algorithm | Description |
|---|---|---|
boolean |
Term Frequency (TF) | Pure term-frequency sum |
tfidf |
TF-IDF | Penalizes common terms via inverse document freq. |
bm25 |
Okapi BM25 (k1=1.5, b=0.75) | Probabilistic model with length normalization |
vsm |
Vector Space Model + Cosine Sim. | TF-IDF weighted vectors with cosine similarity |
dense |
Dense Embeddings | sentence-transformers embeddings + cosine similarity |
Switch modes at runtime with +mode <name>.
| Command | Description |
|---|---|
+help |
Show all available commands |
+mode <name> |
Switch ranking mode (boolean, tfidf, bm25, vsm, dense) |
+thesaurus |
Toggle query expansion on/off |
+thesaurus on|off |
Explicitly enable or disable query expansion |
+d<N> |
Display document N from the last search results |
+d<N> <term> |
Display document N with occurrences of term highlighted |
+index |
Show the full inverted index |
+index <term> |
Filter the inverted index by term |
+threshold <val> |
Set minimum cosine threshold for dense mode (default 0.0) |
+topk <N> |
Set max results for dense mode (default 10, 0 = all) |
+bench |
Benchmark: 100 queries x all modes, ranked by speed |
+plot |
2D vector proximity chart (matplotlib) |
+plot3d |
3D vector proximity chart (matplotlib, rotatable) |
+plotext |
2D vector proximity chart (in-terminal via plotext) |
+info |
Index statistics and current configuration |
+rebuild |
Rebuild the inverted index from scratch |
+rebuild dense |
Rebuild dense embeddings |
+clear |
Clear the screen |
+exit / +quit |
Exit the program |
| Syntax | Operator | Example |
|---|---|---|
term1 term2 |
AND | planet mars |
term1 && term2 |
AND | planet && mars |
| `term1 | | term2` | OR |
term1 OR term2 |
OR | planet OR mars |
OR:term1 term2 term3 |
OR | OR:planet mars sun |
AND:term1 term2 |
AND | AND:intelligenza artificiale |
Default operator is AND (all terms must appear). The OR operator matches documents containing at least one term.
The documents/ directory contains 15 text files:
| Range | Category | Topics |
|---|---|---|
| 01--05 | English | Solar System, Climate Change, Python Programming, Space Exploration, Machine Learning |
| 06--10 | Italian | Intelligenza Artificiale, Storia di Roma, Cucina Italiana, Musica Classica, Rinascimento |
| 11--15 | Polysemic | Banks & Rivers, Python the Snake, Mars Mythology, Cells (biology/prison), Mercury Element |
The polysemic documents are designed to test disambiguation and ranking precision across ambiguous terms (bank, python, mars, cell, mercury).
Run the full 32-test suite:
python test.pyOutput follows a compact one-line-per-test format. Tests requiring sentence-transformers are automatically skipped if the library is not installed. Results are also logged to test.log.
- documentation.md -- comprehensive technical documentation (English)
- DOCUMENTAZIONE.md -- comprehensive technical documentation (Italian)
See LICENSE file for details.
