Simple Python implementation of statistical n-gram language models built from scratch using Good-Turing and Kneser–Ney smoothing.
- Text corpus preprocessing and tokenization
- Flexible n-gram order (n ≥ 1)
- Good-Turing smoothing with back-off
- Kneser–Ney smoothing
- Sentence probability and log-probability scoring
- Perplexity-based evaluation
- Model saving/loading and n-gram probability export
Good-Turing smoothing redistributes probability mass from seen n-grams to unseen n-grams.
- Instead of trusting raw counts, it adjusts counts using frequency-of-frequencies
- If an n-gram appears
rtimes, its adjusted count depends on how many n-grams appearr+1times - Unseen n-grams receive probability mass based on how many n-grams were seen exactly once
- For higher-order n-grams, the model backs off to lower-order models when data is sparse
➡️ Best for handling rare and unseen n-grams in small or sparse datasets
Kneser–Ney smoothing improves probability estimates by focusing on how words appear in different contexts.
- Uses absolute discounting: subtracts a fixed value from each observed n-gram count
- The removed probability mass is redistributed to lower-order models
- Lower-order probabilities are based on continuation counts
(how many different contexts a word appears in) - This makes common but context-specific words less dominant
➡️ Especially effective for higher-order n-grams and realistic language modeling
corpus/
├── file1.text.txt
├── file2.text.txt
- Python 3.8+
- No external libraries
Train Good-Turing model: python main.py corpus/ gt -n 3 -o gt_model.pkl
Train Kneser–Ney model: python main.py corpus/ kn -n 3 -o kn_model.pkl
Score a sentence: python main.py corpus/ gt -n 3 --sentence "This is a test sentence."
- Model quality is measured using perplexity
- Lower perplexity means a better language model
Educational project for learning:
- NLP fundamentals
- Classical language modeling
- Smoothing techniques in language models
- Good (1953) — Good-Turing estimation
- Kneser & Ney (1995) — Improved back-off language models
- Jurafsky & Martin — Speech and Language Processing
For educational and research use.