Spam Detection Using Machine Learning

This repository contains the implementation and analysis of a machine learning-based spam detection system. The project evaluates different models and feature extraction techniques to classify messages as spam or ham (not spam). It provides insights into model performance, feature representation, and error analysis.

Project Overview

The main objective of this project is to build and evaluate machine learning models to classify text messages into two categories: spam and ham. The project explores multiple techniques, including TF-IDF, Bag of Words (BOW), and Multinomial Naive Bayes (MNB), while focusing on key evaluation metrics like precision, recall, and accuracy.

Key Features

Preprocessing of raw text messages.
Feature extraction using TF-IDF and Bag of Words (BOW).
Model comparison: Logistic Regression (TF-IDF and BOW) and Multinomial Naive Bayes (MNB).
Evaluation using confusion matrices, classification reports, and cross-validation.
Error analysis for misclassified messages.

Data

The dataset used for this project contains labeled text messages categorized as either spam or ham. The dataset includes a imbalanced mix of both categories.

Models Evaluated

Logistic Regression with TF-IDF:
- Balanced performance with 94% precision and 91% recall for spam.
- Suitable for general-purpose spam detection.
Logistic Regression with Bag of Words (BOW):
- High precision (98%) but lower recall (90%) for spam.
- Ideal for applications where avoiding false positives is critical.
Multinomial Naive Bayes (MNB):
- High recall (95%) for spam but slightly lower precision (91%).
- Best for scenarios prioritizing spam detection over false positives.

Performance Comparison

Model	Precision (Spam)	Recall (Spam)	Accuracy
Logistic Regression (TF-IDF)	94%	91%	98%
Logistic Regression (BOW)	98%	90%	98%
Multinomial Naive Bayes	91%	95%	98%

Key Insight: The choice of model depends on whether precision (avoiding false positives) or recall (capturing more spam) is prioritized.

Error Analysis

An error analysis of misclassified messages revealed:

Some spam messages were misclassified as ham due to insufficient explicit keywords or ambiguity in phrasing.
Certain ham messages were misclassified as spam due to excessive use of numeric data or terms resembling spam-like patterns.

This analysis highlights potential improvements in:

Text preprocessing (e.g., removing numeric sequences or certain stopwords).
Adjusting the model to better handle edge cases.

Cross-Validation Insights

The project uses cross-validation during hyperparameter tuning for logistic regression models. The variation in accuracy across folds was visualized to identify:

Stability of the models across different parameter combinations.
Minimal signs of overfitting or underfitting, confirming robust generalization.

Requirements

Python 3.8+
Libraries: scikit-learn, numpy, pandas, matplotlib

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Spam_Detection.ipynb		Spam_Detection.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Detection Using Machine Learning

Project Overview

Key Features

Data

Models Evaluated

Performance Comparison

Error Analysis

Cross-Validation Insights

Requirements

About

Uh oh!

Releases

Packages

Languages

santiagoasp98/spam-detection

Folders and files

Latest commit

History

Repository files navigation

Spam Detection Using Machine Learning

Project Overview

Key Features

Data

Models Evaluated

Performance Comparison

Error Analysis

Cross-Validation Insights

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages