Skip to content

subnetdusk/didactic_search_engine

Repository files navigation

Didactic Search Engine

An educational Information Retrieval system built in Python (by ClaudeCode, no glory or blame for me), designed to demonstrate text search fundamentals -- from inverted indexes and classical ranking models to neural dense embeddings. Built with a clear two-pipeline architecture (storage and search) and a rich interactive CLI.

CLI Screenshot

Features

  • 5 ranking algorithms: Boolean (TF), TF-IDF, BM25, Vector Space Model (cosine similarity), Dense Embeddings (sentence-transformers)
  • Bilingual support (Italian + English) with automatic language detection
  • Query expansion via bilingual thesaurus with toggle control
  • Boolean operators (AND/OR) with multiple syntax forms (&&, ||, OR keyword, prefix notation)
  • Rich CLI interface with colored tables, panels, and spinners (powered by Rich)
  • 2D/3D vector proximity visualization using matplotlib and plotext (in-terminal)
  • Benchmarking suite: 100 queries across all ranking modes with comparative timing tables
  • Persistent inverted index serialized to JSON
  • 32-test suite covering preprocessing, storage, and search (keyword + vector)

Quick Start

pip install -r requirements.txt
python main.py

On first launch the system automatically downloads required NLTK resources (stopwords, punkt), builds the inverted index from the document collection, and persists it to index/inverted_index.json.

Requirements

  • Python 3.8+
  • nltk >= 3.8
  • rich >= 13.0
  • matplotlib >= 3.7
  • scikit-learn >= 1.3
  • sentence-transformers >= 2.2
  • plotext >= 5.2

Architecture

The system follows a two-pipeline design that separates indexing from retrieval.

Storage Pipeline -- runs once (or on +rebuild):

documents/ --> data_collection --> preprocessing --> inverted index --> JSON persistence

Search Pipeline -- runs on every query:

raw query --> query_parser --> query_preprocessor --> query_representation
          --> comparison --> ranking --> result_formatter (Rich output)

For the dense mode, the search pipeline bypasses the keyword stages and routes the raw query directly to the sentence-transformers ranker.

Ranking Modes

Mode Algorithm Description
boolean Term Frequency (TF) Pure term-frequency sum
tfidf TF-IDF Penalizes common terms via inverse document freq.
bm25 Okapi BM25 (k1=1.5, b=0.75) Probabilistic model with length normalization
vsm Vector Space Model + Cosine Sim. TF-IDF weighted vectors with cosine similarity
dense Dense Embeddings sentence-transformers embeddings + cosine similarity

Switch modes at runtime with +mode <name>.

CLI Commands

Command Description
+help Show all available commands
+mode <name> Switch ranking mode (boolean, tfidf, bm25, vsm, dense)
+thesaurus Toggle query expansion on/off
+thesaurus on|off Explicitly enable or disable query expansion
+d<N> Display document N from the last search results
+d<N> <term> Display document N with occurrences of term highlighted
+index Show the full inverted index
+index <term> Filter the inverted index by term
+threshold <val> Set minimum cosine threshold for dense mode (default 0.0)
+topk <N> Set max results for dense mode (default 10, 0 = all)
+bench Benchmark: 100 queries x all modes, ranked by speed
+plot 2D vector proximity chart (matplotlib)
+plot3d 3D vector proximity chart (matplotlib, rotatable)
+plotext 2D vector proximity chart (in-terminal via plotext)
+info Index statistics and current configuration
+rebuild Rebuild the inverted index from scratch
+rebuild dense Rebuild dense embeddings
+clear Clear the screen
+exit / +quit Exit the program

Query Syntax

Syntax Operator Example
term1 term2 AND planet mars
term1 && term2 AND planet && mars
`term1 | term2` OR
term1 OR term2 OR planet OR mars
OR:term1 term2 term3 OR OR:planet mars sun
AND:term1 term2 AND AND:intelligenza artificiale

Default operator is AND (all terms must appear). The OR operator matches documents containing at least one term.

Document Collection

The documents/ directory contains 15 text files:

Range Category Topics
01--05 English Solar System, Climate Change, Python Programming, Space Exploration, Machine Learning
06--10 Italian Intelligenza Artificiale, Storia di Roma, Cucina Italiana, Musica Classica, Rinascimento
11--15 Polysemic Banks & Rivers, Python the Snake, Mars Mythology, Cells (biology/prison), Mercury Element

The polysemic documents are designed to test disambiguation and ranking precision across ambiguous terms (bank, python, mars, cell, mercury).

Testing

Run the full 32-test suite:

python test.py

Output follows a compact one-line-per-test format. Tests requiring sentence-transformers are automatically skipped if the library is not installed. Results are also logged to test.log.

Documentation

License

See LICENSE file for details.