Detecting Anomalous Activity in Ship Engines Using Machine Learning
This project develops a systematic approach to anomaly detection for ship engine performance monitoring. In the shipping industry, abnormal behavior in engine parameters can lead to increased fuel consumption, safety risks, and operational downtime. Detecting such anomalies early enables proactive preventative maintenance, reducing both costs and risks.
This analysis uses unsupervised machine learning techniques to identify early signs of potential engine malfunctions, enabling proactive maintenance strategies through intelligent anomaly detection.
This analysis uses six critical engine performance metrics from ship operations:
- Engine RPM: Rotational speed variations indicating potential performance issues.
- Lubrication Oil Pressure: Abnormal values may indicate lubrication deficiencies or blockages.
- Fuel Pressure: Variations can imply issues related to combustion efficiency and fuel delivery.
- Coolant Pressure: Deviations may point to leaks or cooling system faults.
- Lubrication Oil Temperature: Abnormal temperatures affect the oil’s lubricating efficacy.
- Coolant Temperature: Elevated or reduced temperatures can indicate cooling system failures.
Data Source: Devabrat, M., 2022. Predictive Maintenance on Ship's Main Engine using AI. https://dx.doi.org/10.21227/g3za-v415.
Dataset Characteristics: Anomalies constitute approximately 1-5% of data points, presenting a realistic class-imbalanced scenario typical of real-world anomaly detection challenges.
Assessing data quality, distribution, missing values and anomalies in the input features.
EDA revealed the data distributions were generally non-normal, with many extreme outliers and possible structure within the dataset. Therefore, methods assuming a normal distribution (such as standard deviation or Z-score) were NOT suitable for this analysis, but IQR could be considered as an alternative.
Implementing statistical methods for anomaly detection using Interquartile Range
Statistical Analysis with IQR gave a reasonable percentage of anomalies within the dataset (2.16%). However, further investigation revealed potential relationships between features impacting anomalous datapoints. This suggested that IQR may not be the most appropriate method, as it is generally not suitable for anomaly detection in non-linear and multivariate data. Therefore, unsupervised machine learning was considered as an alternative for multidimensional relationships.
Implementing unsupervised machine learning techniques (SVM, Isolation Forest) to detect anomalous behavior.
-
One-Class SVM: Assessed first as a highly tunable method for detecting outliers. This model was optimised to minimise the chance of missed anomalies (false negatives). However, it is a distance-based method and therefore required scaling of the dataset; if done improperly this can mask outliers by inappropriately compressing the data range.
-
Isolation Forest: A second method that did not require scaling of the data was performed and results compared. Isolation Forest is a tree-based unsupervised ML method effective at identifying anomalies that are only detectable when considered in a multidimensional/multi-feature context.
Assessing and comparing model performance.
Isolation Forest corroborated some results of One-Class SVM, i.e. potential protective effects of certain features against anomalies, but failed to identify an obvious visual outlier that SVM detected. This suggested that while there are multidimensional effects in this dataset, individual feature effects are still important. One-Class SVM was therefore proposed as the most suitable anomaly detection model for this dataset.
There was a considerable difference in samples identified as anomalous between the two ML methods (anomalies identified in both comprised 53.7% of Isolation Forest anomalies, and 53.8% of SVM anomalies), indicating the methods identified different observations as anomalous.
- Feature Interactions: Evidence of protective effects when certain parameters remain within optimal ranges
- Method Complementarity: Different ML approaches identify distinct anomaly patterns
- Multidimensional Effects: Anomalies are often only detectable when considering multiple features simultaneously
- Individual Feature Importance: Single-feature effects remain significant despite multidimensional relationships
- Investigate nature and relative importance of shared vs. non-shared anomalies identified across different ML methods
- Analyse feature relationships and protective effects in optimal parameter ranges
- Develop ensemble approaches combining multiple detection methods
- Validate findings with domain expert knowledge and operational data
anomaly_detection_svm_if/
├── data/
│ └── ship_engine_data.csv # Raw engine performance dataset
├── notebooks/
│ └── anomaly_detection_with_machine_learning.ipynb # Notebook with full analysis & workflow
└── docs/
└── anomaly_detection_report.pdf # Detailed technical report
- Clone the repository
- Install required dependencies (see notebook for package requirements)
- Run the Jupyter notebook for step-by-step analysis
- Refer to the technical report for detailed methodology and results
- Python 3
- Scikit-learn (PCA, One-Class SVM, Isolation Forest)
- Pandas, NumPy, SciPy (Data manipulation)
- Matplotlib, Seaborn, Networkx (Visualisation)
- Jupyter Notebook / Google Colab
This project was completed as part of the Data Science Career Accelerator at the University of Cambridge (2024).
This project demonstrates practical application of unsupervised machine learning for industrial anomaly detection, with potential applications across maritime and other heavy industry sectors.