Predict weather precipitation (Rain vs. Snow) using Machine Learning classifiers: Random Forest, KNN, Naive Bayes, and Logistic Regression. Includes comprehensive evaluation using ROC-AUC and Learning Curves.
The objective of this project is to perform Weather Classification to predict whether a specific weather condition will result in Rain or Snow.
This analysis compares the performance of four popular Supervised Learning algorithms:
- K-Nearest Neighbors (KNN)
- Random Forest Classifier
- Naive Bayes (GaussianNB)
- Logistic Regression
The goal is to identify the best model with the highest Accuracy and AUC Score, minimizing the False Positive Rate using ROC-AUC Curve evaluation.
The dataset contains historical weather data including atmospheric physical features.
- Target Variable:
Precip Type(Binary: Rain / Snow). - Features: Temperature, Humidity, Wind Speed, Pressure, Visibility, etc.
- Source: π Click here to view Kaggle Dataset
- Cleaning: Removed rows with missing values in the target variable.
- Feature Selection: Selected relevant numerical features (Temperature, Humidity, Pressure, Wind Speed).
- Splitting: Split the data into 80% Training and 20% Testing sets.
- Scaling: Applied
StandardScalerto normalize features, ensuring optimal performance for distance-based algorithms like KNN.
- KNN (K-Nearest Neighbors): An instance-based algorithm that classifies data based on the majority class of its nearest neighbors.
- Random Forest: An ensemble method that utilizes multiple decision trees to achieve high accuracy and prevent overfitting.
- Naive Bayes (GaussianNB): A probabilistic algorithm suitable for normally distributed numerical data (e.g., temperature).
- Logistic Regression: A linear baseline model used for binary classification tasks.
Model evaluation was conducted using Accuracy and ROC-AUC Score. Below is the performance summary:
| Model | Accuracy | AUC Score | Performance Analysis |
|---|---|---|---|
| Random Forest | [1.0] | [1.0] | Best Model. extremely robust in capturing non-linear relationships between features. |
| KNN | [0.98] | [0.984] | Performed reasonably well but is computationally expensive on large datasets. |
| Naive Bayes | [0.94] | [0.987] | Fast and efficient; serves as a strong baseline model. |
| Logistic Reg | [0.99] | [1.0] | Provided solid results for simple linear relationships. |
Note: Random Forest typically outperforms other models in this dataset due to its ability to handle complex interactions (e.g., Low Temperature + High Humidity = Snow).
The chart above illustrates the ROC Curve comparison. The closer the curve is to the top-left corner (AUC close to 1.0), the better the model distinguishes between Rain and Snow.
The Learning Curve is used to diagnose whether the model is suffering from Overfitting or Underfitting as the training size increases.
- Clone this repository:
git clone [https://github.com/nicolausprima/weather-classification.git](https://github.com/nicolausprima/weather-classification.git)
- Install the required libraries:
pip install pandas numpy matplotlib seaborn scikit-learn
- Run the notebook:
jupyter notebook Classification_Weather.ipynb
This experiment concludes that the Random Forest Classifier is the most accurate model for predicting precipitation type. It effectively handles non-linear feature interactions and shows greater resistance to noise compared to linear models like Logistic Regression.
Created by [Nicolaus Prima Dharma]