Skip to content

Implementation of classical n-gram language models with Good-Turing and Kneser–Ney smoothing for NLP.

Notifications You must be signed in to change notification settings

MarkoMilenovic01/NGramLanguageModels

Repository files navigation

📘 N-Gram Language Models (Good-Turing & Kneser–Ney)

Simple Python implementation of statistical n-gram language models built from scratch using Good-Turing and Kneser–Ney smoothing.

✨ Features

  • Text corpus preprocessing and tokenization
  • Flexible n-gram order (n ≥ 1)
  • Good-Turing smoothing with back-off
  • Kneser–Ney smoothing
  • Sentence probability and log-probability scoring
  • Perplexity-based evaluation
  • Model saving/loading and n-gram probability export

🧠 How the models work

Good-Turing smoothing

Good-Turing smoothing redistributes probability mass from seen n-grams to unseen n-grams.

  • Instead of trusting raw counts, it adjusts counts using frequency-of-frequencies
  • If an n-gram appears r times, its adjusted count depends on how many n-grams appear r+1 times
  • Unseen n-grams receive probability mass based on how many n-grams were seen exactly once
  • For higher-order n-grams, the model backs off to lower-order models when data is sparse

➡️ Best for handling rare and unseen n-grams in small or sparse datasets


Kneser–Ney smoothing

Kneser–Ney smoothing improves probability estimates by focusing on how words appear in different contexts.

  • Uses absolute discounting: subtracts a fixed value from each observed n-gram count
  • The removed probability mass is redistributed to lower-order models
  • Lower-order probabilities are based on continuation counts
    (how many different contexts a word appears in)
  • This makes common but context-specific words less dominant

➡️ Especially effective for higher-order n-grams and realistic language modeling


📂 Corpus format

corpus/
├── file1.text.txt
├── file2.text.txt

⚙️ Requirements

  • Python 3.8+
  • No external libraries

🚀 Usage

Train Good-Turing model: python main.py corpus/ gt -n 3 -o gt_model.pkl

Train Kneser–Ney model: python main.py corpus/ kn -n 3 -o kn_model.pkl

Score a sentence: python main.py corpus/ gt -n 3 --sentence "This is a test sentence."

📊 Evaluation

  • Model quality is measured using perplexity
  • Lower perplexity means a better language model

🎓 Purpose

Educational project for learning:

  • NLP fundamentals
  • Classical language modeling
  • Smoothing techniques in language models

📚 References

  • Good (1953) — Good-Turing estimation
  • Kneser & Ney (1995) — Improved back-off language models
  • Jurafsky & Martin — Speech and Language Processing

✅ License

For educational and research use.

About

Implementation of classical n-gram language models with Good-Turing and Kneser–Ney smoothing for NLP.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages