A machine learning project 🚀 that predicts wine quality (0-10 scale) based on physicochemical properties. Achieves 82% accuracy using XGBoost classifier. 🍇
- 🔍 Comprehensive EDA with histograms, correlation heatmaps, and feature analysis
- 🤖 Multiple ML models comparison (XGBoost, SVM, Logistic Regression)
- ⚙️ Advanced preprocessing with missing value imputation and feature scaling
- 📊 Model evaluation using ROC-AUC scores and classification reports
- 🧹 Clean codebase with PEP8 compliance and modular structure
- 🐼 Pandas: Data handling
- 🔢 NumPy: Array operations
- 📊 Seaborn/Matplotlib: Data visualization
- 🤖 scikit-learn (sklearn): Machine learning tasks
- 🚀 XGBoost: Advanced boosting algorithm
The dataset contains 11 fundamental wine features that help determine wine quality:
- 🍋 Fixed acidity
- 🌬️ Volatile acidity
- 🍊 Citric acid
- 🍬 Residual sugar
- 🧂 Chlorides
- 🫧 Free sulfur dioxide
- ⚖️ Density
- 🧪 pH
- 🧪 Sulphates
- 🍷 Alcohol
- 🏆 Quality (target)
Each feature provides unique insight into the chemistry and characteristics of the wine, ultimately influencing its quality.
Explore key statistics such as mean, standard deviation, min, max, and quartiles for each wine feature. These insights help you understand data distribution, variability, and potential outliers in your dataset. 🧮
EDA is an approach to analyzing data using visual techniques. It helps you discover trends, patterns, and check assumptions through statistical summaries and graphical representations. 🕵️♂️ Let’s start by checking the number of null values in each column of the dataset to ensure data quality and completeness. 🧐
There are times the data provided to us contains redundant features which do not help with increasing the model's performance. That is why we remove them before training our model.
From the above heatmap, we can conclude that the 'total sulphur dioxide' and 'free sulphur dioxide' are highly correlated features, so we will remove them.
| Model | 🏋️ Training Accuracy | 🧪 Validation Accuracy |
|---|---|---|
| Logistic Regression | 0.698 | 0.686 |
| XGBoost Classifier | 0.976 | 0.805 |
| SVC (RBF Kernel) | 0.720 | 0.707 |
- XGBoost Classifier delivered the highest validation accuracy! 🚀
| Model | Training AUC | Validation AUC |
|---|---|---|
| Logistic Regression | 0.70 | 0.69 |
| XGBoost | 0.98 | 0.80 |
| SVM (RBF Kernel) | 0.72 | 0.71 |
Best Model (XGBoost) Classification Report: precision recall f1-score support 0 0.76 0.74 0.75 474 1 0.86 0.86 0.86 826
- 🍴 Fork the repository
- 🌿 Create your feature branch (
git checkout -b feature/AmazingFeature) - 💾 Commit changes (
git commit -m 'Add some AmazingFeature') - 🚀 Push to branch (
git push origin feature/AmazingFeature) - 🔄 Open Pull Request
Distributed under the MIT License. See LICENSE for more information.
This project has been created as my final submission for Stanford’s Code in Place 2025! 🚀 The project applies the foundational Python and data science skills learned in Code in Place to a real-world machine learning challenge: predicting wine quality based on physicochemical features. The program uses a well-known dataset to train and evaluate several machine learning models, focusing on clean code, data analysis, and model comparison.
- Adityabaan Tripathy - Initial work
- Wine Quality Dataset - UCI Machine Learning Repository







