A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.
2021-11-29.23-08-56_Trim.mp4
With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.
| Description | Value |
|---|---|
| Source | Microsoft Malware Prediction (Kaggle) |
| Training Set Size | 8,920,441 rows × 83 features |
| Test Set Size | 7,653,424 rows × 83 features |
| File Size | Approx. 8 GB for train.csv |
| Target Variable | HasDetections (1 = Malware detected, 0 = No malware detected) |
| Data Type | Tabular, mixed categorical & numerical |
| Class Imbalance | Slight imbalance (~50:50 ratio, needs careful validation) |
| Category | Tools/Libraries | Reason |
|---|---|---|
| Language | Python 3.11 | Versatile and widely used for ML workflows |
| Data Handling | pandas, dask, numpy |
Efficient large dataset processing |
| Visualization | seaborn, matplotlib, plotly |
EDA and visual storytelling |
| Machine Learning | LightGBM |
High-speed gradient boosting on large datasets |
| Evaluation Metrics | scikit-learn |
Classification reports, confusion matrices |
- Used Dask to handle large CSV files without exceeding system memory.
- Loaded over 8.9 million records with 83 features in a distributed manner.
- Dropped columns with over 40% missing values.
- Removed high-cardinality columns (>500 unique values) to avoid sparse matrices.
- Label-encoded categorical columns.
- Dropped identifier columns like
MachineIdentifier. - Cleaned and transformed data while optimizing memory usage.
- Visualized missing values using heatmaps.
- Explored target variable distribution.
- Plotted feature distributions and their relationship with malware detection.
- Analyzed cardinality of categorical variables.
- Implemented a LightGBM Classifier with tuned hyperparameters:
num_leaves = 64learning_rate = 0.1feature_fraction = 0.8bagging_fraction = 0.8max_depth = 8
- Split dataset into 85% train and 15% validation.
- Evaluated using:
- Classification Reports (Precision, Recall, F1-score)
- AUC-ROC Curve
- Normalized Confusion Matrices
- Feature Importance Plot
- Processed test set in memory-efficient batches.
- Generated malware detection probability predictions.
- Saved results to
result.csv.
| Metric | Validation Set Value |
|---|---|
| Accuracy | ~0.734 |
| AUC Score | ~0.79 |
| F1 Score | ~0.73 |
- The LightGBM model displayed a strong ability to discriminate between infected and safe machines.
- Feature Importance Plot revealed critical features like
SmartScreen,AVProductStatesIdentifier, andPlatform.
- Implement cross-validation for more robust performance estimation.
- Integrate hyperparameter tuning using
OptunaorGridSearchCV. - Apply advanced missing value imputation instead of row removal.
- Try additional algorithms (XGBoost, CatBoost) for benchmarking.
- Deploy a scalable API service to accept telemetry data and predict malware probability in real-time.
- Clone the repository
git clone https://github.com/yourusername/malware-prediction-ml.git
cd malware-prediction-ml