Skip to content

💻Comparative machine learning study evaluating XGBoost, KNN, Logistic Regression, and Random Forest for malware detection, with XGBoost achieving malware identification and KNN excelling at benign file recognition to minimize false positives.

Notifications You must be signed in to change notification settings

KayteKatelyn/Detecting-Malware

Repository files navigation

💻 Detecting-Malware

Comparative machine learning study evaluating XGBoost, KNN, Logistic Regression, and Random Forest for malware detection, with XGBoost achieving malware identification and KNN excelling at benign file recognition to minimize false positives.

Key Features

  • Implements and compares four ML algorithms: XGBoost, Random Forest, Logistic Regression and KNN
  • Analyzes 19,243 malware samples with 79 distinct features per file

Tech Stack

  • Python - Primary programming language
  • Jupyter Notebook
  • Pandas - Data manipulation and preprocessing
  • NumPy - Numerical computing and array operations
  • scikit-learn - Binomial Logistic Regression, Random Forest, KNN
  • XGBoost - Extreme Gradient Boosting Classifier implementation
  • Matplotlib / Seaborn - Data visualization

How it Works?

Data Collection

  • Utilized publicly available malware dataset.
  • Collected 19,243 file samples (malware and benign files)
  • Extracted 79 distinct features per file for comprehensive analysis

Data Preprocessing

  • Cleaned and normalized clinical features using Pandas.
  • Handled missing values and outliers.
  • Applied feature engineering and data splitting.
  • Implemnented cross validation techniques.

Model Comparison and Training

  • Trained four different machine learning algorithms (XGBoost, Random Forest, Binomial Logistic Regression, KNN).

Prediction & Classification

  • Generates binary classification (Malware vs. Benign) for file samples.

Performance Evaluation and Visualization

  • Computed multiple metrics: Precision, Recall, F1 Score, and Accuracy.
  • Confusion matrices: Visual representation of true vs predicted classifications.
  • Cross-algorithm comparison for optimal model selection.

Key Findings

XGBoost Classifier achieved the best overall performance:

Note

Data Access: Due to licensing and privacy considerations, datasets are not included in this repository.

About

💻Comparative machine learning study evaluating XGBoost, KNN, Logistic Regression, and Random Forest for malware detection, with XGBoost achieving malware identification and KNN excelling at benign file recognition to minimize false positives.

Topics

Resources

Stars

Watchers

Forks