ICS5110 - MSc Applied Machine Learning (Group Project)

This repository contains the full workflow for the ICS5110 Applied Machine Learning group project focused on Malta traffic accidents. The work combines unstructured police press releases and local news articles, engineers a rich set of geographic, temporal, weather, text-based, and severity-related features, and exports curated tabular datasets ready for downstream modelling and visualisation.

Repository layout

0. Datasets/Inputs/ – raw accident narratives sourced from police press releases and local news articles.
0. Datasets/Output/ – curated crash dataset exports (e.g., crash_final.csv).
1. Setup/ – environment definitions (environment.yml, requirements.txt), API key config (openrouter_key.env), and caches.
1. Setup/Cache/ – persistent caches used during data preparation (e.g., weather lookups).
1. Setup/Localities/ – locality metadata used for geocoding Maltese / Gozitan locations.
1. Setup/Localities/0_locality_viewer.ipynb – optional notebook for exploring locality metadata.
1. Setup/weather_code.json – weather code mapping used for semantic labelling.
2. Jupyter Notebooks/1_data_preparation.ipynb – main notebook that cleans data, performs feature engineering, and exports the curated datasets.
2. Jupyter Notebooks/TEMP/ – intermediate artefacts from data prep (LLM extraction outputs, severity fusion exports, heatmaps, etc.). The notebooks create and update files here.
2. Jupyter Notebooks/Models/ – persisted model bundles (SVM.pkl, RF.pkl, LogR.pkl, GB.pkl).
2. Jupyter Notebooks/4_results_comparison.ipynb – single-pipeline comparison of all models with metrics, ROC curves, significance tests, and ethical analysis.
3. Documentation/Notebook PDFs/ – rendered PDF exports of the notebooks.
3. Documentation/ – project documentation and submission artefacts (e.g., plagiarism form).
githubAssets/ – README images and figures.

Notebook guide

2. Jupyter Notebooks/1_data_preparation.ipynb – cleans raw narratives, runs LLM-assisted extraction (with caching + validation), resolves localities, enriches weather, and fuses severity signals; writes 0. Datasets/Output/crash_final.csv plus temporary artefacts under 2. Jupyter Notebooks/TEMP/ (e.g., extracted_*_features.csv, data_featured.csv, severity_fused_scores.csv, malta_heatmap.html).
2. Jupyter Notebooks/2_exploratory_data_analysis.ipynb – EDA on crash_final.csv covering severity distributions, temporal patterns, weather effects, spatial risk, PCA, and research-question-aligned diagnostics.
Modelling notebooks (all expect ../0. Datasets/Output/crash_final.csv):
- 3a_svm_DavidFarrugia.ipynb – SVM classifiers with Optuna tuning, class imbalance handling, and model persistence to 2. Jupyter Notebooks/Models/SVM.pkl.
- 3b_rf_AndreaFilibertoLucas.ipynb – Random Forest classification + regression, temporal feature engineering, Optuna tuning, optional SHAP, and model persistence to 2. Jupyter Notebooks/Models/RF.pkl.
- 3c_logr_CharlonCurmi.ipynb – Logistic Regression with cyclic encoding, interaction terms, Optuna tuning, SHAP analysis, spatial hotspot mapping, and model persistence to 2. Jupyter Notebooks/Models/LogR.pkl.
- 3d_gb_AntonioGaldes.ipynb – Gradient Boosting with imputation/encoding, Optuna TPE tuning, and model persistence to 2. Jupyter Notebooks/Models/GB.pkl.
2. Jupyter Notebooks/4_results_comparison.ipynb – unified preprocessing + hold-out test evaluation for SVM, RF, LogR, and GB with metrics tables, significance testing, trade-off analysis, and ethical/fairness review.

Run the preparation notebook first so the modelling notebooks can load crash_final.csv.

EDA snapshots

Severity class counts and aggregated severity score distributions from the EDA notebook:

Data sources

File	Description
`0. Datasets/Inputs/police_press_releases.csv`	111 police-issued traffic press releases with publication/modified dates and free-text descriptions.
`0. Datasets/Inputs/local_news_articles.csv`	321 local news articles with metadata (URL, outlet, author, publish date, tags, etc.) and corresponding narrative content.

Both sources describe traffic collisions happening across Malta and Gozo. The 1_data_preparation.ipynb notebook orchestrates the pipeline, unifies the schemas, aligns timestamps, extracts localities, and creates structured representations of the incident context.

Feature engineering workflow

The notebook 2. Jupyter Notebooks/1_data_preparation.ipynb orchestrates the pipeline:

Data inspection: schema/type checks, descriptive stats, and missing-data heatmaps.
Cleaning & harmonisation: drop duplicates, standardise date formats, and tidy narrative text.
LLM feature extraction: cached, deterministic prompts with an OpenRouter model selector, plus low-confidence reprocessing and optional crash-presence validation.
Geographic enrichment: locality extraction, LLM-assisted locality resolution for unmatched cases, and Malta heatmap rendering.
Weather enrichment: deterministic cache + Open-Meteo API calls keyed by timestamp/coordinates, including fallback to Malta-wide defaults.
Severity scoring (tri-mode): rule-based cues, spaCy contextual scoring, and LLM-derived severity fused into an aggregated severity score.
Final exports: merged dataset written to 0. Datasets/Output/crash_final.csv with temporary artefacts saved under 2. Jupyter Notebooks/TEMP/.
Weather code mapping: numeric codes mapped to descriptive labels for the final dataset.

Modelling overview and hyperparameter tuning

All modelling notebooks use ../0. Datasets/Output/crash_final.csv as input, but apply their own preprocessing (encoding, scaling, and feature filtering) to answer specific research questions. Key tuning details are captured here for quick reference.

SVM (Optuna, 5-fold CV; Notebook `3a_svm_DavidFarrugia.ipynb`)

Search space: C in [1e-3, 1e2] (log), kernel in {rbf, linear, poly, sigmoid}, gamma in [1e-4, 1e-1] (log, non-linear kernels), degree in [2, 5] (poly only).
Objective: maximize balanced accuracy, weighted F1, and macro F1 (multi-objective).
Best trial (weighted F1): C=6.9595, kernel=rbf, gamma=0.01994 with balanced accuracy 0.392, weighted F1 0.678, macro F1 0.383.

Random Forest (Optuna + RandomizedSearchCV; Notebook `3b_rf_AndreaFilibertoLucas.ipynb`)

Classification tuning (Optuna, stratified CV on train only; weighted F1):
- n_estimators 600–2000, max_depth in {None, 8, 12, 16, 20}, min_samples_split 2–40, min_samples_leaf 1–20,
- max_features in {sqrt, log2, 0.3, 0.5, 0.8}, max_samples in {0.6, 0.8, 1.0}, bootstrap=True, class_weight=balanced_subsample.
- Best params: n_estimators=923, max_depth=8, min_samples_split=3, min_samples_leaf=1, max_features=sqrt, max_samples=1.0 (weighted F1 0.6866, macro F1 0.3742, OOB 0.7535).
Regression tuning (RandomizedSearchCV): best n_estimators=600, max_depth=None, min_samples_split=2, min_samples_leaf=2.

Logistic Regression (Notebook `3c_logr_CharlonCurmi.ipynb`)

Logistic regression tuning (Optuna, weighted F1): C in [0.01, 10.0] (log), class_weight=balanced, solver=lbfgs, max_iter=2000.
Best logistic C=1.225613 with weighted F1 0.4602 (saved in 2. Jupyter Notebooks/Models/LogR.pkl).

Gradient Boosting (Optuna, 5-fold stratified CV; Notebook `3d_gb_AntonioGaldes.ipynb`)

Search space: n_estimators 50–200, max_depth 2–6, learning_rate 0.01–0.3 (log), min_samples_split 10–50, min_samples_leaf 10–30, subsample 0.5–1.0, max_features in {sqrt, log2, None}.
Best CV balanced accuracy 0.4610 with: n_estimators=72, max_depth=4, learning_rate=0.13923, min_samples_split=36, min_samples_leaf=30, subsample=0.8417, max_features=None.

Results comparison (Notebook `4_results_comparison.ipynb`)

All models are evaluated on the same preprocessing pipeline and stratified hold-out split (test size = 0.20). The notebook outputs a comprehensive metrics table (accuracy, balanced accuracy, macro/weighted F1, log loss, ROC-AUC), plus model size and prediction latency.

Additional outputs in the notebook:

Side-by-side ROC curves (micro-average) and metric comparison plots.
Bootstrap significance testing (with optional McNemar) for pairwise model comparisons.
Trade-off analysis (accuracy vs interpretability, model size, and prediction latency).
Ethical/fairness analysis covering proxy variables, subgroup performance gaps, and deployment risks.

Refer to the notebook for the current best model selection and statistical significance results.

Getting started

1. Clone the repository

git clone https://github.com/DavidF-22/ICS5110-AppliedML_Project.git
cd ICS5110-AppliedML_Project

2. Create a Python environment

Choose either Conda or pip:

# Conda (recommended)
conda env create -f "1. Setup/environment.yml"
conda activate AML

# or pip / venv
python -m venv .venv
source .venv/bin/activate
pip install -r "1. Setup/requirements.txt"

The 1. Setup/requirements.txt file includes the core dependencies used across the data preparation, EDA, and modelling notebooks (scikit-learn, optuna, shap, seaborn, imbalanced-learn, ipywidgets, etc.).

If you are using the LLM-assisted steps, also set your OpenRouter key in 1. Setup/openrouter_key.env:

OPENROUTER_API_KEY=your_key_here

3. Run the notebooks

jupyter lab  # or jupyter notebook

Open 2. Jupyter Notebooks/1_data_preparation.ipynb, adjust any notebook parameters, and run the cells top-to-bottom. Outputs (including crash_final.csv) are written to 0. Datasets/Output/.
(Optional) Explore 2. Jupyter Notebooks/2_exploratory_data_analysis.ipynb.
Open and run the modelling notebooks (3a_* through 3d_*); each loads ../0. Datasets/Output/crash_final.csv by default.

If you want to enable the optional LLM-based extraction and severity scoring, ensure OPENROUTER_API_KEY is set in 1. Setup/openrouter_key.env before running the data preparation notebook.

Acknowledgments

This project was developed as part of the ICS5110 course at the University of Malta.

Contact

For any inquiries or feedback, please contact Andrea Filiberto Lucas, Antonio Galdes, David Farrugia & Charlon Curmi.

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
0. Datasets		0. Datasets
1. Setup		1. Setup
2. Jupyter Notebooks		2. Jupyter Notebooks
3. Documentation		3. Documentation
githubAssets		githubAssets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICS5110 - MSc Applied Machine Learning (Group Project)

Repository layout

Notebook guide

EDA snapshots

Data sources

Feature engineering workflow

Modelling overview and hyperparameter tuning

SVM (Optuna, 5-fold CV; Notebook `3a_svm_DavidFarrugia.ipynb`)

Random Forest (Optuna + RandomizedSearchCV; Notebook `3b_rf_AndreaFilibertoLucas.ipynb`)

Logistic Regression (Notebook `3c_logr_CharlonCurmi.ipynb`)

Gradient Boosting (Optuna, 5-fold stratified CV; Notebook `3d_gb_AntonioGaldes.ipynb`)

Results comparison (Notebook `4_results_comparison.ipynb`)

Getting started

1. Clone the repository

2. Create a Python environment

3. Run the notebooks

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

DavidF-22/ICS5110-AppliedML_Project

Folders and files

Latest commit

History

Repository files navigation

ICS5110 - MSc Applied Machine Learning (Group Project)

Repository layout

Notebook guide

EDA snapshots

Data sources

Feature engineering workflow

Modelling overview and hyperparameter tuning

SVM (Optuna, 5-fold CV; Notebook 3a_svm_DavidFarrugia.ipynb)

Random Forest (Optuna + RandomizedSearchCV; Notebook 3b_rf_AndreaFilibertoLucas.ipynb)

Logistic Regression (Notebook 3c_logr_CharlonCurmi.ipynb)

Gradient Boosting (Optuna, 5-fold stratified CV; Notebook 3d_gb_AntonioGaldes.ipynb)

Results comparison (Notebook 4_results_comparison.ipynb)

Getting started

1. Clone the repository

2. Create a Python environment

3. Run the notebooks

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

SVM (Optuna, 5-fold CV; Notebook `3a_svm_DavidFarrugia.ipynb`)

Random Forest (Optuna + RandomizedSearchCV; Notebook `3b_rf_AndreaFilibertoLucas.ipynb`)

Logistic Regression (Notebook `3c_logr_CharlonCurmi.ipynb`)

Gradient Boosting (Optuna, 5-fold stratified CV; Notebook `3d_gb_AntonioGaldes.ipynb`)

Results comparison (Notebook `4_results_comparison.ipynb`)

Packages