This repository contains the implementation and analysis of a machine learning-based spam detection system. The project evaluates different models and feature extraction techniques to classify messages as spam or ham (not spam). It provides insights into model performance, feature representation, and error analysis.
The main objective of this project is to build and evaluate machine learning models to classify text messages into two categories: spam and ham. The project explores multiple techniques, including TF-IDF, Bag of Words (BOW), and Multinomial Naive Bayes (MNB), while focusing on key evaluation metrics like precision, recall, and accuracy.
- Preprocessing of raw text messages.
- Feature extraction using TF-IDF and Bag of Words (BOW).
- Model comparison: Logistic Regression (TF-IDF and BOW) and Multinomial Naive Bayes (MNB).
- Evaluation using confusion matrices, classification reports, and cross-validation.
- Error analysis for misclassified messages.
The dataset used for this project contains labeled text messages categorized as either spam or ham. The dataset includes a imbalanced mix of both categories.
-
Logistic Regression with TF-IDF:
- Balanced performance with 94% precision and 91% recall for spam.
- Suitable for general-purpose spam detection.
-
Logistic Regression with Bag of Words (BOW):
- High precision (98%) but lower recall (90%) for spam.
- Ideal for applications where avoiding false positives is critical.
-
Multinomial Naive Bayes (MNB):
- High recall (95%) for spam but slightly lower precision (91%).
- Best for scenarios prioritizing spam detection over false positives.
| Model | Precision (Spam) | Recall (Spam) | Accuracy |
|---|---|---|---|
| Logistic Regression (TF-IDF) | 94% | 91% | 98% |
| Logistic Regression (BOW) | 98% | 90% | 98% |
| Multinomial Naive Bayes | 91% | 95% | 98% |
Key Insight: The choice of model depends on whether precision (avoiding false positives) or recall (capturing more spam) is prioritized.
An error analysis of misclassified messages revealed:
- Some spam messages were misclassified as ham due to insufficient explicit keywords or ambiguity in phrasing.
- Certain ham messages were misclassified as spam due to excessive use of numeric data or terms resembling spam-like patterns.
This analysis highlights potential improvements in:
- Text preprocessing (e.g., removing numeric sequences or certain stopwords).
- Adjusting the model to better handle edge cases.
The project uses cross-validation during hyperparameter tuning for logistic regression models. The variation in accuracy across folds was visualized to identify:
- Stability of the models across different parameter combinations.
- Minimal signs of overfitting or underfitting, confirming robust generalization.
- Python 3.8+
- Libraries:
scikit-learn,numpy,pandas,matplotlib