A practical notebook for binary sentiment analysis on the classic IMDB 50K Reviews dataset.
Clean EDA → strong classical baselines (NB / LogReg / Linear SVM + calibration) → F1-based threshold tuning → explainability → optional BiLSTM baseline.
- Kaggle-friendly: path-flexible loading, deterministic seeds, artifacts saved.
- Clear EDA: class distribution, text lengths, top n-grams.
- Strong baselines: TF-IDF + MultinomialNB / Logistic Regression / Linear SVM (calibrated).
- Robust evaluation: stratified CV, ROC/PR curves, F1-optimized threshold, calibration plot, Brier score.
- Explainability: top weighted terms from Logistic Regression (no leakage).
- Error analysis: quick FP/FN peek.
- Deep learning (optional): compact BiLSTM baseline with tokenization, embedding, learning curves, confusion matrix.
- Source file:
IMDB Dataset.csv - Rows: 50,000
- Columns:
review— raw movie review textsentiment—positive/negative
The dataset file is not included in this repo.
For local runs, place it under:data/raw/IMDB Dataset.csv
Data loading supports local data/raw/ and Kaggle /kaggle/input/ via repo_utils/pathing.py.
.
├── text-sentiment-classification.ipynb
├── data/
│ └── raw/ # put IMDB Dataset.csv here (local runs)
├── artifacts/ # saved models / vectorizer / tables
├── repo_utils/
│ └── pathing.py # local + Kaggle path helpers
├── CASE_STUDY.md
├── requirements.txt
├── requirements-dev.txt
└── .gitignore
- Setup & Imports
- Load & Peek
- Light Cleaning (HTML strip, lowercasing, punctuation/digits removal; optional stopwords & lemmatization)
- EDA (distributions, text lengths, n-grams)
- Vectorization (TF-IDF)
- Classical Models (NB / LogReg / LinearSVM with calibration, stratified CV)
- Holdout Evaluation (metrics, ROC/PR curves, confusion matrix)
- Calibration & Brier score
- Threshold tuning (F1)
- Explainability (LogReg coefficients)
- Error analysis (FP/FN)
- BiLSTM Baseline (2 epochs)
- Artifacts saved (vectorizer, best model, summary CSV)
- Python: 3.10–3.12
- Core:
pandas,numpy,scikit-learn,matplotlib,seaborn,joblib - NLP (optional):
contractions,nltk - DL (optional):
tensorflow>=2.15
pip install -r requirements.txtNotes:
- For classical models only,
requirements.txtis enough. - To run the BiLSTM section, install TensorFlow separately:
pip install "tensorflow>=2.15"
git clone https://github.com/tarekmasryo/text-sentiment-analysis.git
cd text-sentiment-analysis
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install -r requirements.txt
jupyter notebook text-sentiment-classification.ipynb- Place
IMDB Dataset.csvunderdata/raw/if not running on Kaggle. - Alternatively, set a full path with
DATA_PATH:- Windows (PowerShell):
$env:DATA_PATH="C:\path\IMDB Dataset.csv" - macOS/Linux:
export DATA_PATH="/path/IMDB Dataset.csv"
- Windows (PowerShell):
These checks are lightweight and do not run the notebook (no data required):
pip install -r requirements.txt -r requirements-dev.txt
ruff check .Notes:
- Ruff is configured to exclude
.ipynbfiles (CI stays stable). - Auto-fix import order and simple issues:
ruff check . --fix
- CV table: mean ± std for Accuracy / F1 / ROC-AUC across folds.
- Curves: ROC, Precision-Recall, Calibration.
- Confusion matrices: default 0.5 and F1-optimized threshold.
- Explainability: top +/− terms from Logistic Regression.
- Saved artifacts (in
artifacts/): vectorizer, best model, metrics tables.
Example results (from one run):
- LinearSVM (calibrated): Acc ≈ 0.90 · F1 ≈ 0.90
- BiLSTM (2 epochs): Acc ≈ 0.85
- No leakage: vectorizer fit on train only.
- Calibration: SVM calibrated; Brier score reported.
- Thresholding: F1-optimal threshold from PR curve.
- Reproducibility:
SEED=42, stratified splits.
See CASE_STUDY.md for the project story, decisions, and takeaways.
Dataset: IMDB 50K Reviews (Kaggle).
Author: Tarek Masryo · GitHub / Kaggle / HuggingFace