This project was developed as part of the Kaggle competition House Prices - Advanced Regression Techniques. The goal was to predict house sale prices using a dataset with 79 explanatory variables describing nearly every aspect of residential homes in Ames, Iowa.
Build a robust regression model capable of accurately predicting house prices using advanced data preprocessing, feature engineering, and ensemble modeling techniques.
- Programming Language: Python;
- Data Manipulation:
pandas,numpy; - Visualization:
matplotlib,seaborn; - Modeling:
XGBoost,LightGBM,CatBoost,GradientBoostingRegressor; - Ensemble Modeling:
StackingRegressor; - Hyperparameter Tuning:
Optuna; - Preprocessing:
MaxAbsScaler,K-Fold Target Encoding,VIF.
- Missing values were treated case-by-case, with contextual understanding of each feature;
- No blind imputation or mass dropping of data to preserve useful information.
- Outliers were also evaluated individually to avoid discarding valuable data;
- Visual and statistical analysis were used to detect and decide on their treatment.
- Target variable distribution analysis and normalization;
- K-Fold Target Encoding for categorical features, followed by correlation analysis with the target;
- Feature importance and redundancy evaluation using correlation matrix and Variance Inflation Factor (VIF);
- Feature Engineering to uncover new patterns and relationships between features and the target.
- Used MaxAbsScaler to normalize the data and maintain sparsity of the features.
- Trained four regression models: LightGBM, XGBoost, CatBoost, and GradientBoosting;
- Used Optuna for hyperparameter tuning to optimize model performance;
- Developed a final stacking ensemble, using XGBoost Regressor as meta-model, achieving an RMSE of 0.10.
Four base models were trained and tuned:
- LightGBM Regressor;
- XGBoost Regressor;
- CatBoost Regressor;
- GradientBoosting Regressor.
All models underwent hyperparameter tuning using Optuna, a powerful optimization framework.
- A Stacking Ensemble was built using the four models above, with XGBoost Regressor as the meta-model;
- Achieved a final score of RMSE = 0.10 on the test set.
├── data/ # Raw and processed data
├── notebooks/ # Jupyter Notebooks (EDA, modeling, etc.)
├── models/ # Saved model files and evaluation results
├── visuals/ # Plots and visualizations
├── README.md # Project documentation
└── requirements.txt # Required Python packages- Final RMSE: 0.10;
- Strong performance thanks to:
- Feature engineering and encoding;
- Individualized outlier/missing data treatment;
- Model stacking and hyperparameter tuning.
- Importance of understanding data context before applying preprocessing steps;
- Advanced encoding techniques like K-Fold Target Encoding;
- Combining multiple models through stacking for better performance;
- Effectiveness of Optuna for hyperparameter optimization.
- Add cross-validation visual analysis for each model;
- Deploy the model using a simple API;
- Automate the preprocessing pipeline with custom transformers.
Fábio Galdino