🎬 IMDB Reviews — EDA + Classical Models + BiLSTM Baseline

A practical notebook for binary sentiment analysis on the classic IMDB 50K Reviews dataset.
Clean EDA → strong classical baselines (NB / LogReg / Linear SVM + calibration) → F1-based threshold tuning → explainability → optional BiLSTM baseline.

🚀 Why this notebook?

Kaggle-friendly: path-flexible loading, deterministic seeds, artifacts saved.
Clear EDA: class distribution, text lengths, top n-grams.
Strong baselines: TF-IDF + MultinomialNB / Logistic Regression / Linear SVM (calibrated).
Robust evaluation: stratified CV, ROC/PR curves, F1-optimized threshold, calibration plot, Brier score.
Explainability: top weighted terms from Logistic Regression (no leakage).
Error analysis: quick FP/FN peek.
Deep learning (optional): compact BiLSTM baseline with tokenization, embedding, learning curves, confusion matrix.

📂 Dataset

Source file: IMDB Dataset.csv
Rows: 50,000
Columns:
- review — raw movie review text
- sentiment — positive / negative

The dataset file is not included in this repo.
For local runs, place it under: data/raw/IMDB Dataset.csv

Data loading supports local data/raw/ and Kaggle /kaggle/input/ via repo_utils/pathing.py.

📁 Repo layout

.
├── text-sentiment-classification.ipynb
├── data/
│   └── raw/               # put IMDB Dataset.csv here (local runs)
├── artifacts/             # saved models / vectorizer / tables
├── repo_utils/
│   └── pathing.py         # local + Kaggle path helpers
├── CASE_STUDY.md
├── requirements.txt
├── requirements-dev.txt
└── .gitignore

🧱 Notebook Outline

Setup & Imports
Load & Peek
Light Cleaning (HTML strip, lowercasing, punctuation/digits removal; optional stopwords & lemmatization)
EDA (distributions, text lengths, n-grams)
Vectorization (TF-IDF)
Classical Models (NB / LogReg / LinearSVM with calibration, stratified CV)
Holdout Evaluation (metrics, ROC/PR curves, confusion matrix)
Calibration & Brier score
Threshold tuning (F1)
Explainability (LogReg coefficients)
Error analysis (FP/FN)
BiLSTM Baseline (2 epochs)
Artifacts saved (vectorizer, best model, summary CSV)

🛠️ Environment

Python: 3.10–3.12
Core: pandas, numpy, scikit-learn, matplotlib, seaborn, joblib
NLP (optional): contractions, nltk
DL (optional): tensorflow>=2.15

pip install -r requirements.txt

Notes:

For classical models only, requirements.txt is enough.
To run the BiLSTM section, install TensorFlow separately:
```
pip install "tensorflow>=2.15"
```

⚡ Quick Start

git clone https://github.com/tarekmasryo/text-sentiment-analysis.git
cd text-sentiment-analysis

python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate

pip install -r requirements.txt
jupyter notebook text-sentiment-classification.ipynb

Place IMDB Dataset.csv under data/raw/ if not running on Kaggle.
Alternatively, set a full path with DATA_PATH:
- Windows (PowerShell): $env:DATA_PATH="C:\path\IMDB Dataset.csv"
- macOS/Linux: export DATA_PATH="/path/IMDB Dataset.csv"

✅ Quality checks (optional)

These checks are lightweight and do not run the notebook (no data required):

pip install -r requirements.txt -r requirements-dev.txt
ruff check .

Notes:

Ruff is configured to exclude .ipynb files (CI stays stable).
Auto-fix import order and simple issues:
```
ruff check . --fix
```

📈 Outputs & Artifacts

CV table: mean ± std for Accuracy / F1 / ROC-AUC across folds.
Curves: ROC, Precision-Recall, Calibration.
Confusion matrices: default 0.5 and F1-optimized threshold.
Explainability: top +/− terms from Logistic Regression.
Saved artifacts (in artifacts/): vectorizer, best model, metrics tables.

Example results (from one run):

LinearSVM (calibrated): Acc ≈ 0.90 · F1 ≈ 0.90
BiLSTM (2 epochs): Acc ≈ 0.85

🔍 Notes on Methodology

No leakage: vectorizer fit on train only.
Calibration: SVM calibrated; Brier score reported.
Thresholding: F1-optimal threshold from PR curve.
Reproducibility: SEED=42, stratified splits.

🧾 Case Study

See CASE_STUDY.md for the project story, decisions, and takeaways.

🙌 Credits

Dataset: IMDB 50K Reviews (Kaggle).
Author: Tarek Masryo · GitHub / Kaggle / HuggingFace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 IMDB Reviews — EDA + Classical Models + BiLSTM Baseline

🚀 Why this notebook?

📂 Dataset

📁 Repo layout

🧱 Notebook Outline

🛠️ Environment

⚡ Quick Start

✅ Quality checks (optional)

📈 Outputs & Artifacts

🔍 Notes on Methodology

🧾 Case Study

🙌 Credits

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
artifacts		artifacts
data		data
repo_utils		repo_utils
.gitignore		.gitignore
CASE_STUDY.md		CASE_STUDY.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
text-sentiment-classification.ipynb		text-sentiment-classification.ipynb

License

tarekmasryo/text-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

🎬 IMDB Reviews — EDA + Classical Models + BiLSTM Baseline

🚀 Why this notebook?

📂 Dataset

📁 Repo layout

🧱 Notebook Outline

🛠️ Environment

⚡ Quick Start

✅ Quality checks (optional)

📈 Outputs & Artifacts

🔍 Notes on Methodology

🧾 Case Study

🙌 Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages