A Machine Learning–Based Spam and Phishing Email Detection System
Author: Janice Sakina Akiza Nizigama
University: Aalborg University Copenhagen
Supervisor: Yan Kyaw Tun
Project Period: Spring 2025
Repository: JSanizi/no-phishing-zone
No Phishing Zone is a research-based engineering project exploring how various machine learning algorithms can be trained and compared to detect spam and phishing emails effectively.
The project implements a complete spam-filtering pipeline, from dataset preprocessing and vectorization to model training, tuning, and evaluation, culminating in a functional spam filter application that simulates email classification.
The study compares seven models:
- Autoencoder (Anomaly Detection)
- Naive Bayes
- Logistic Regression
- Random Forest
- AdaBoost
- Gradient Boost
- K-Nearest Neighbors
After comprehensive testing, Random Forest and Logistic Regression achieved the highest performance, with weighted F1-scores of 0.86 and 0.80, respectively.
-
Data Preprocessing & Cleaning
- Loads datasets (
SpamAssassin.csv,CEAS_08.csv) - Cleans text and removes noise
- Converts text into TF-IDF vectors
- Loads datasets (
-
Model Training & Hyperparameter Tuning
- Trains seven ML models using scikit-learn
- Performs parameter optimization with
GridSearchCV - Saves the best-performing models for deployment
-
Spam Filter Application
- Connects to an email inbox (via IMAP)
- Sends test emails to simulate spam delivery
- Classifies incoming emails using trained models
- Displays confusion matrices and evaluation reports
- Trained only on non-spam samples
- Detects spam by measuring reconstruction error thresholds
- Python 3.10 or newer
pippackage manager
pip install -r requirements.txtStep-by-step on how to run your scripts:
# 1️⃣ Preprocess the datasets
python training_and_tunning_models/data_preprocessing.py
# 2️⃣ Train and tune models
python training_and_tunning_models/train_and_parametertuning.py
# 3️⃣ Run the spam filter application
python application/ac_filter.pyModels are evaluated with Accuracy, Precision, Recall, Weighted F1-score, and a Confusion Matrix.
| Model | Weighted F1 | Precision | Recall | Accuracy |
|---|---|---|---|---|
| Random Forest | 0.86 | 0.87 | 0.84 | 0.86 |
| Logistic Regression | 0.80 | 0.82 | 0.78 | 0.81 |
| AdaBoost | 0.78 | 0.76 | 0.79 | 0.78 |
| Gradient Boost | 0.77 | 0.74 | 0.78 | 0.77 |
| Naive Bayes | 0.75 | 0.73 | 0.76 | 0.75 |
| KNN | 0.71 | 0.69 | 0.70 | 0.71 |
| Autoencoder | 0.68 | 0.66 | 0.70 | 0.68 |
no-phishing-zone/
├── application/
│ ├── ac_filter.py
│ ├── connect_to_email.py
│ └── sending_mail.py
│
├── tests/
│ ├── test_integration.py
│ └── test_unit.py
│
├── training_and_tunning_models/
│ ├── ac_parameter_tuning.py
│ ├── autoencoder.py
│ ├── data_preprocessing.py
│ ├── parametertuning_table.py
│ ├── split_data.py
│ └── train_and_parametertuning.py
│
├── spam_filter.py
├── start_parametertuning.py
├── .gitattributes
├── .gitignore
└── README.md
Janice Sakina Akiza Nizigama
🔗 LinkedIn: https://www.linkedin.com/in/janice-nizigama
💻 GitHub: https://github.com/JSanizi
© 2025 Aalborg University — Academic / Non-Commercial use permitted with citation. For other uses, please contact the author.