This repository contains the full workflow for the ICS5110 Applied Machine Learning group project focused on Malta traffic accidents. The work combines unstructured police press releases and local news articles, engineers a rich set of geographic, temporal, weather, text-based, and severity-related features, and exports curated tabular datasets ready for downstream modelling and visualisation.
0. Datasets/Inputs/– raw accident narratives sourced from police press releases and local news articles.0. Datasets/Output/– curated crash dataset exports (e.g.,crash_final.csv).1. Setup/– environment definitions (environment.yml,requirements.txt), API key config (openrouter_key.env), and caches.1. Setup/Cache/– persistent caches used during data preparation (e.g., weather lookups).1. Setup/Localities/– locality metadata used for geocoding Maltese / Gozitan locations.1. Setup/Localities/0_locality_viewer.ipynb– optional notebook for exploring locality metadata.1. Setup/weather_code.json– weather code mapping used for semantic labelling.2. Jupyter Notebooks/1_data_preparation.ipynb– main notebook that cleans data, performs feature engineering, and exports the curated datasets.2. Jupyter Notebooks/TEMP/– intermediate artefacts from data prep (LLM extraction outputs, severity fusion exports, heatmaps, etc.). The notebooks create and update files here.2. Jupyter Notebooks/Models/– persisted model bundles (SVM.pkl,RF.pkl,LogR.pkl,GB.pkl).2. Jupyter Notebooks/4_results_comparison.ipynb– single-pipeline comparison of all models with metrics, ROC curves, significance tests, and ethical analysis.3. Documentation/Notebook PDFs/– rendered PDF exports of the notebooks.3. Documentation/– project documentation and submission artefacts (e.g., plagiarism form).githubAssets/– README images and figures.
2. Jupyter Notebooks/1_data_preparation.ipynb– cleans raw narratives, runs LLM-assisted extraction (with caching + validation), resolves localities, enriches weather, and fuses severity signals; writes0. Datasets/Output/crash_final.csvplus temporary artefacts under2. Jupyter Notebooks/TEMP/(e.g.,extracted_*_features.csv,data_featured.csv,severity_fused_scores.csv,malta_heatmap.html).2. Jupyter Notebooks/2_exploratory_data_analysis.ipynb– EDA oncrash_final.csvcovering severity distributions, temporal patterns, weather effects, spatial risk, PCA, and research-question-aligned diagnostics.- Modelling notebooks (all expect
../0. Datasets/Output/crash_final.csv):3a_svm_DavidFarrugia.ipynb– SVM classifiers with Optuna tuning, class imbalance handling, and model persistence to2. Jupyter Notebooks/Models/SVM.pkl.3b_rf_AndreaFilibertoLucas.ipynb– Random Forest classification + regression, temporal feature engineering, Optuna tuning, optional SHAP, and model persistence to2. Jupyter Notebooks/Models/RF.pkl.3c_logr_CharlonCurmi.ipynb– Logistic Regression with cyclic encoding, interaction terms, Optuna tuning, SHAP analysis, spatial hotspot mapping, and model persistence to2. Jupyter Notebooks/Models/LogR.pkl.3d_gb_AntonioGaldes.ipynb– Gradient Boosting with imputation/encoding, Optuna TPE tuning, and model persistence to2. Jupyter Notebooks/Models/GB.pkl.
2. Jupyter Notebooks/4_results_comparison.ipynb– unified preprocessing + hold-out test evaluation for SVM, RF, LogR, and GB with metrics tables, significance testing, trade-off analysis, and ethical/fairness review.
Run the preparation notebook first so the modelling notebooks can load crash_final.csv.
Severity class counts and aggregated severity score distributions from the EDA notebook:
| File | Description |
|---|---|
0. Datasets/Inputs/police_press_releases.csv |
111 police-issued traffic press releases with publication/modified dates and free-text descriptions. |
0. Datasets/Inputs/local_news_articles.csv |
321 local news articles with metadata (URL, outlet, author, publish date, tags, etc.) and corresponding narrative content. |
Both sources describe traffic collisions happening across Malta and Gozo. The 1_data_preparation.ipynb notebook orchestrates the pipeline, unifies the schemas, aligns timestamps, extracts localities, and creates structured representations of the incident context.
The notebook 2. Jupyter Notebooks/1_data_preparation.ipynb orchestrates the pipeline:
- Data inspection: schema/type checks, descriptive stats, and missing-data heatmaps.
- Cleaning & harmonisation: drop duplicates, standardise date formats, and tidy narrative text.
- LLM feature extraction: cached, deterministic prompts with an OpenRouter model selector, plus low-confidence reprocessing and optional crash-presence validation.
- Geographic enrichment: locality extraction, LLM-assisted locality resolution for unmatched cases, and Malta heatmap rendering.
- Weather enrichment: deterministic cache + Open-Meteo API calls keyed by timestamp/coordinates, including fallback to Malta-wide defaults.
- Severity scoring (tri-mode): rule-based cues, spaCy contextual scoring, and LLM-derived severity fused into an aggregated severity score.
- Final exports: merged dataset written to
0. Datasets/Output/crash_final.csvwith temporary artefacts saved under2. Jupyter Notebooks/TEMP/. - Weather code mapping: numeric codes mapped to descriptive labels for the final dataset.
All modelling notebooks use ../0. Datasets/Output/crash_final.csv as input, but apply their own preprocessing (encoding, scaling, and feature filtering) to answer specific research questions. Key tuning details are captured here for quick reference.
- Search space:
Cin[1e-3, 1e2](log),kernelin{rbf, linear, poly, sigmoid},gammain[1e-4, 1e-1](log, non-linear kernels),degreein[2, 5](poly only). - Objective: maximize balanced accuracy, weighted F1, and macro F1 (multi-objective).
- Best trial (weighted F1):
C=6.9595,kernel=rbf,gamma=0.01994with balanced accuracy0.392, weighted F10.678, macro F10.383.
- Classification tuning (Optuna, stratified CV on train only; weighted F1):
n_estimators600–2000,max_depthin{None, 8, 12, 16, 20},min_samples_split2–40,min_samples_leaf1–20,max_featuresin{sqrt, log2, 0.3, 0.5, 0.8},max_samplesin{0.6, 0.8, 1.0},bootstrap=True,class_weight=balanced_subsample.- Best params:
n_estimators=923,max_depth=8,min_samples_split=3,min_samples_leaf=1,max_features=sqrt,max_samples=1.0(weighted F10.6866, macro F10.3742, OOB0.7535).
- Regression tuning (RandomizedSearchCV): best
n_estimators=600,max_depth=None,min_samples_split=2,min_samples_leaf=2.
- Logistic regression tuning (Optuna, weighted F1):
Cin[0.01, 10.0](log),class_weight=balanced,solver=lbfgs,max_iter=2000. - Best logistic
C=1.225613with weighted F10.4602(saved in2. Jupyter Notebooks/Models/LogR.pkl).
- Search space:
n_estimators50–200,max_depth2–6,learning_rate0.01–0.3 (log),min_samples_split10–50,min_samples_leaf10–30,subsample0.5–1.0,max_featuresin{sqrt, log2, None}. - Best CV balanced accuracy
0.4610with:n_estimators=72,max_depth=4,learning_rate=0.13923,min_samples_split=36,min_samples_leaf=30,subsample=0.8417,max_features=None.
All models are evaluated on the same preprocessing pipeline and stratified hold-out split (test size = 0.20). The notebook outputs a comprehensive metrics table (accuracy, balanced accuracy, macro/weighted F1, log loss, ROC-AUC), plus model size and prediction latency.
Additional outputs in the notebook:
- Side-by-side ROC curves (micro-average) and metric comparison plots.
- Bootstrap significance testing (with optional McNemar) for pairwise model comparisons.
- Trade-off analysis (accuracy vs interpretability, model size, and prediction latency).
- Ethical/fairness analysis covering proxy variables, subgroup performance gaps, and deployment risks.
Refer to the notebook for the current best model selection and statistical significance results.
git clone https://github.com/DavidF-22/ICS5110-AppliedML_Project.git
cd ICS5110-AppliedML_ProjectChoose either Conda or pip:
# Conda (recommended)
conda env create -f "1. Setup/environment.yml"
conda activate AML
# or pip / venv
python -m venv .venv
source .venv/bin/activate
pip install -r "1. Setup/requirements.txt"The 1. Setup/requirements.txt file includes the core dependencies used across the data preparation, EDA, and modelling notebooks (scikit-learn, optuna, shap, seaborn, imbalanced-learn, ipywidgets, etc.).
If you are using the LLM-assisted steps, also set your OpenRouter key in 1. Setup/openrouter_key.env:
OPENROUTER_API_KEY=your_key_herejupyter lab # or jupyter notebook- Open
2. Jupyter Notebooks/1_data_preparation.ipynb, adjust any notebook parameters, and run the cells top-to-bottom. Outputs (includingcrash_final.csv) are written to0. Datasets/Output/. - (Optional) Explore
2. Jupyter Notebooks/2_exploratory_data_analysis.ipynb. - Open and run the modelling notebooks (
3a_*through3d_*); each loads../0. Datasets/Output/crash_final.csvby default.
If you want to enable the optional LLM-based extraction and severity scoring, ensure OPENROUTER_API_KEY is set in 1. Setup/openrouter_key.env before running the data preparation notebook.
This project was developed as part of the ICS5110 course at the University of Malta.
For any inquiries or feedback, please contact Andrea Filiberto Lucas, Antonio Galdes, David Farrugia & Charlon Curmi.

