Comparative machine learning study evaluating XGBoost, KNN, Logistic Regression, and Random Forest for malware detection, with XGBoost achieving malware identification and KNN excelling at benign file recognition to minimize false positives.
- Implements and compares four ML algorithms: XGBoost, Random Forest, Logistic Regression and KNN
- Analyzes 19,243 malware samples with 79 distinct features per file
- Python - Primary programming language
- Jupyter Notebook
- Pandas - Data manipulation and preprocessing
- NumPy - Numerical computing and array operations
- scikit-learn - Binomial Logistic Regression, Random Forest, KNN
- XGBoost - Extreme Gradient Boosting Classifier implementation
- Matplotlib / Seaborn - Data visualization
- Utilized publicly available malware dataset.
- Collected 19,243 file samples (malware and benign files)
- Extracted 79 distinct features per file for comprehensive analysis
- Cleaned and normalized clinical features using Pandas.
- Handled missing values and outliers.
- Applied feature engineering and data splitting.
- Implemnented cross validation techniques.
- Trained four different machine learning algorithms (XGBoost, Random Forest, Binomial Logistic Regression, KNN).
- Generates binary classification (Malware vs. Benign) for file samples.
- Computed multiple metrics: Precision, Recall, F1 Score, and Accuracy.
- Confusion matrices: Visual representation of true vs predicted classifications.
- Cross-algorithm comparison for optimal model selection.
XGBoost Classifier achieved the best overall performance: