Didactic Search Engine

An educational Information Retrieval system built in Python (by ClaudeCode, no glory or blame for me), designed to demonstrate text search fundamentals -- from inverted indexes and classical ranking models to neural dense embeddings. Built with a clear two-pipeline architecture (storage and search) and a rich interactive CLI.

Features

5 ranking algorithms: Boolean (TF), TF-IDF, BM25, Vector Space Model (cosine similarity), Dense Embeddings (sentence-transformers)
Bilingual support (Italian + English) with automatic language detection
Query expansion via bilingual thesaurus with toggle control
Boolean operators (AND/OR) with multiple syntax forms (&&, ||, OR keyword, prefix notation)
Rich CLI interface with colored tables, panels, and spinners (powered by Rich)
2D/3D vector proximity visualization using matplotlib and plotext (in-terminal)
Benchmarking suite: 100 queries across all ranking modes with comparative timing tables
Persistent inverted index serialized to JSON
32-test suite covering preprocessing, storage, and search (keyword + vector)

Quick Start

pip install -r requirements.txt
python main.py

On first launch the system automatically downloads required NLTK resources (stopwords, punkt), builds the inverted index from the document collection, and persists it to index/inverted_index.json.

Requirements

Python 3.8+
nltk >= 3.8
rich >= 13.0
matplotlib >= 3.7
scikit-learn >= 1.3
sentence-transformers >= 2.2
plotext >= 5.2

Architecture

The system follows a two-pipeline design that separates indexing from retrieval.

Storage Pipeline -- runs once (or on +rebuild):

documents/ --> data_collection --> preprocessing --> inverted index --> JSON persistence

Search Pipeline -- runs on every query:

raw query --> query_parser --> query_preprocessor --> query_representation
          --> comparison --> ranking --> result_formatter (Rich output)

For the dense mode, the search pipeline bypasses the keyword stages and routes the raw query directly to the sentence-transformers ranker.

Ranking Modes

Mode	Algorithm	Description
`boolean`	Term Frequency (TF)	Pure term-frequency sum
`tfidf`	TF-IDF	Penalizes common terms via inverse document freq.
`bm25`	Okapi BM25 (k1=1.5, b=0.75)	Probabilistic model with length normalization
`vsm`	Vector Space Model + Cosine Sim.	TF-IDF weighted vectors with cosine similarity
`dense`	Dense Embeddings	sentence-transformers embeddings + cosine similarity

Switch modes at runtime with +mode <name>.

CLI Commands

Command	Description
`+help`	Show all available commands
`+mode <name>`	Switch ranking mode (boolean, tfidf, bm25, vsm, dense)
`+thesaurus`	Toggle query expansion on/off
`+thesaurus on\|off`	Explicitly enable or disable query expansion
`+d<N>`	Display document N from the last search results
`+d<N> <term>`	Display document N with occurrences of term highlighted
`+index`	Show the full inverted index
`+index <term>`	Filter the inverted index by term
`+threshold <val>`	Set minimum cosine threshold for dense mode (default 0.0)
`+topk <N>`	Set max results for dense mode (default 10, 0 = all)
`+bench`	Benchmark: 100 queries x all modes, ranked by speed
`+plot`	2D vector proximity chart (matplotlib)
`+plot3d`	3D vector proximity chart (matplotlib, rotatable)
`+plotext`	2D vector proximity chart (in-terminal via plotext)
`+info`	Index statistics and current configuration
`+rebuild`	Rebuild the inverted index from scratch
`+rebuild dense`	Rebuild dense embeddings
`+clear`	Clear the screen
`+exit` / `+quit`	Exit the program

Query Syntax

Syntax	Operator	Example
`term1 term2`	AND	`planet mars`
`term1 && term2`	AND	`planet && mars`
`term1 \|	term2`	OR
`term1 OR term2`	OR	`planet OR mars`
`OR:term1 term2 term3`	OR	`OR:planet mars sun`
`AND:term1 term2`	AND	`AND:intelligenza artificiale`

Default operator is AND (all terms must appear). The OR operator matches documents containing at least one term.

Document Collection

The documents/ directory contains 15 text files:

Range	Category	Topics
01--05	English	Solar System, Climate Change, Python Programming, Space Exploration, Machine Learning
06--10	Italian	Intelligenza Artificiale, Storia di Roma, Cucina Italiana, Musica Classica, Rinascimento
11--15	Polysemic	Banks & Rivers, Python the Snake, Mars Mythology, Cells (biology/prison), Mercury Element

The polysemic documents are designed to test disambiguation and ranking precision across ambiguous terms (bank, python, mars, cell, mercury).

Testing

Run the full 32-test suite:

python test.py

Output follows a compact one-line-per-test format. Tests requiring sentence-transformers are automatically skipped if the library is not installed. Results are also logged to test.log.

Documentation

documentation.md -- comprehensive technical documentation (English)
DOCUMENTAZIONE.md -- comprehensive technical documentation (Italian)

License

See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Didactic Search Engine

Features

Quick Start

Requirements

Architecture

Ranking Modes

CLI Commands

Query Syntax

Document Collection

Testing

Documentation

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
documents		documents
preprocessing		preprocessing
search		search
storage		storage
.gitignore		.gitignore
DOCUMENTAZIONE.md		DOCUMENTAZIONE.md
LICENSE		LICENSE
README.md		README.md
documentation.md		documentation.md
i18n.py		i18n.py
logging_config.py		logging_config.py
main.py		main.py
requirements.txt		requirements.txt
screenshot.jpg		screenshot.jpg
test.py		test.py

License

subnetdusk/didactic_search_engine

Folders and files

Latest commit

History

Repository files navigation

Didactic Search Engine

Features

Quick Start

Requirements

Architecture

Ranking Modes

CLI Commands

Query Syntax

Document Collection

Testing

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages