📘 N-Gram Language Models (Good-Turing & Kneser–Ney)

Simple Python implementation of statistical n-gram language models built from scratch using Good-Turing and Kneser–Ney smoothing.

✨ Features

Text corpus preprocessing and tokenization
Flexible n-gram order (n ≥ 1)
Good-Turing smoothing with back-off
Kneser–Ney smoothing
Sentence probability and log-probability scoring
Perplexity-based evaluation
Model saving/loading and n-gram probability export

🧠 How the models work

Good-Turing smoothing

Good-Turing smoothing redistributes probability mass from seen n-grams to unseen n-grams.

Instead of trusting raw counts, it adjusts counts using frequency-of-frequencies
If an n-gram appears r times, its adjusted count depends on how many n-grams appear r+1 times
Unseen n-grams receive probability mass based on how many n-grams were seen exactly once
For higher-order n-grams, the model backs off to lower-order models when data is sparse

➡️ Best for handling rare and unseen n-grams in small or sparse datasets

Kneser–Ney smoothing

Kneser–Ney smoothing improves probability estimates by focusing on how words appear in different contexts.

Uses absolute discounting: subtracts a fixed value from each observed n-gram count
The removed probability mass is redistributed to lower-order models
Lower-order probabilities are based on continuation counts
(how many different contexts a word appears in)
This makes common but context-specific words less dominant

➡️ Especially effective for higher-order n-grams and realistic language modeling

📂 Corpus format

corpus/
├── file1.text.txt
├── file2.text.txt

⚙️ Requirements

Python 3.8+
No external libraries

🚀 Usage

Train Good-Turing model: python main.py corpus/ gt -n 3 -o gt_model.pkl

Train Kneser–Ney model: python main.py corpus/ kn -n 3 -o kn_model.pkl

Score a sentence: python main.py corpus/ gt -n 3 --sentence "This is a test sentence."

📊 Evaluation

Model quality is measured using perplexity
Lower perplexity means a better language model

🎓 Purpose

Educational project for learning:

NLP fundamentals
Classical language modeling
Smoothing techniques in language models

📚 References

Good (1953) — Good-Turing estimation
Kneser & Ney (1995) — Improved back-off language models
Jurafsky & Martin — Speech and Language Processing

✅ License

For educational and research use.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
korpus		korpus
README.md		README.md
Statistics		Statistics
bigram_gt.pkl		bigram_gt.pkl
bigram_gt.tsv		bigram_gt.tsv
bigram_kn.pkl		bigram_kn.pkl
bigram_kn.tsv		bigram_kn.tsv
main.py		main.py
trigram_gt.pkl		trigram_gt.pkl
trigram_gt.tsv		trigram_gt.tsv
trigram_kn.pkl		trigram_kn.pkl
trigram_kn.tsv		trigram_kn.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📘 N-Gram Language Models (Good-Turing & Kneser–Ney)

✨ Features

🧠 How the models work

Good-Turing smoothing

Kneser–Ney smoothing

📂 Corpus format

⚙️ Requirements

🚀 Usage

📊 Evaluation

🎓 Purpose

📚 References

✅ License

About

Uh oh!

Releases

Packages

Languages

MarkoMilenovic01/NGramLanguageModels

Folders and files

Latest commit

History

Repository files navigation

📘 N-Gram Language Models (Good-Turing & Kneser–Ney)

✨ Features

🧠 How the models work

Good-Turing smoothing

Kneser–Ney smoothing

📂 Corpus format

⚙️ Requirements

🚀 Usage

📊 Evaluation

🎓 Purpose

📚 References

✅ License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages