🏡 House Price Investigation

🎯 Aim: Predict the sales price of residential homes in Ames, Iowa

Project Summary

This projects focuses into the investigaion of house prices specializing into
Data cleaning, Feature Selection, Hyperparameter Tuning, Model development and Model production.
It is the second most popular competion on Kaggle. Submission file scores at 5% of globe rank.

Project Structure

🔍 Dataset - Data source
🧹 Data Cleaning & Engineering - Outliers, missing values handling
📊 Feature Selection - Univariate approaches, correlation elimination, feature importances
🧩 Hyperparameter Tuning - Bayesian optimization methods
⚙️ Model Development - Extreme Gradient Boosting
🛒 Model Production - joblib files for production
🚀 Future Enhancements - Next steps and improvements
🔁 Reproducibility - Install dependencies

🔍 Dataset

I used the Kaggle API (Program > kaggle_data_download.py) to automatically load the dataset.
Navigate to the Reproducibility for more details regarding Kaggle API.
More info about data is available at the following link.

📥 Download Source

🧹 Data Cleaning & Engineering

Ordinal features were encoded according to their natural order, instead of being treated as purely categorical.
Categorical features with low variance or limited predictive value (small RMSE after univariate association with the target) were removed to reduce noise.
Missing values were imputed using business logic relevant to the domain.
Outliers were identified using standardized numerical features.
A new feature (ColX) was created by computing the mean of all numeric values across each row.
Observations with ColX values outside the range mean±4×sd were excluded from analysis (~1% removed).
Building age and garage age were calculated, along with a flag for previous house remodeling.

📊 Feature Selection

Two complementary approaches were performed:

Univariate analysis:
The relationship between the target and each feature was assessed using linear regression.
Performance metrics such as Entropy (for categorical features), RMSE, and R² were recorded to guide feature selection.
Categorical features with low entropy (less than one-seventh of the maximum possible) were removed.
For numerical features, any pair with correlation above a predefined cut-off (0.6) was considered redundant.
In such cases, the feature with the higher RMSE in the univariate analysis was removed.
Multivariate ranking: A tuned XGBoost model with regularization was trained on the remaining features to assess their importance.
Features with zero importance were excluded from the final set.

🧩 Hyperparameter Tuning

Hyperparameter tuning was performed using Bayesian optimization with the Tree-structured Parzen Estimator (TPE), implemented via the treeparzen module.
Unlike traditional grid/random search, Bayesian optimization constructs a probabilistic surrogate model of the loss function, which it uses to intelligently select the most promising hyperparameter configurations.
TPE, in particular, models the distribution of “good” versus “bad” hyperparameter configurations and chooses new candidates that maximize the expected improvement based on the given hyperparameter ranges.

⚙️ Model Development

Spliting the data into train, test (20%) adn valid (15%), an XGBmodel tuned trained under multiple trials of subjective selections between final features. Finally, a model with 45 features were selected for production.

🛒 Model Production

The selected features were saved in an .xlsx file.
The trained XGBoost model was serialized and stored as a .joblib file for production use.
The model_in_production.py script implements the prediction procedure, generating an output file containing the ID and the predicted value, with a timestamped filename.

🚀 Future Enhancements

Temporla features could highly improve model performance, reducing seasonal variance.
Monitoring & Model Maintenance for model drift detection to monitor performance over time as market trends change.
Schedule periodic retraining using the latest data to maintain model accuracy.
Integration with Web/BI Tools for real-time scoring API to allow users to input house features and receive predictions instantly.

🔁 Reproducibility

1. Clone repo and cd

git clone https://github.com/Papagiannopoulos/house-price-investigation
cd 'ecommerce-business-analytics'

2. Create a fresh virtual env

uv venv
Note: If uv is not already installed, run the following command in PowerShell.

On macOS and Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
On Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

3. Sync environment

uv sync
Note: At this step, Microsoft Visual C++ is required. If sync crashes follow the provided steps.

4. Kaggle's API

Create a Kaggle account
Go to Account Settings and click "Create New API Token" to download the kaggle.json file
Navigate to C:\Users<your_user_name> on your computer
Create a new folder named .kaggle
Move the downloaded kaggle.json file into the .kaggle folder

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Output/Model_in_production		Output/Model_in_production
Programs		Programs
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏡 House Price Investigation

🎯 Aim: Predict the sales price of residential homes in Ames, Iowa

Project Summary

Project Structure

Table of Contents

🔍 Dataset

🧹 Data Cleaning & Engineering

📊 Feature Selection

🧩 Hyperparameter Tuning

⚙️ Model Development

🛒 Model Production

🚀 Future Enhancements

🔁 Reproducibility

1. Clone repo and cd

2. Create a fresh virtual env

3. Sync environment

4. Kaggle's API

5. You are ready!!!

About

Uh oh!

Releases

Packages

Languages

Papagiannopoulos/house-price-investigation

Folders and files

Latest commit

History

Repository files navigation

🏡 House Price Investigation

🎯 Aim: Predict the sales price of residential homes in Ames, Iowa

Project Summary

Project Structure

Table of Contents

🔍 Dataset

🧹 Data Cleaning & Engineering

📊 Feature Selection

🧩 Hyperparameter Tuning

⚙️ Model Development

🛒 Model Production

🚀 Future Enhancements

🔁 Reproducibility

1. Clone repo and cd

2. Create a fresh virtual env

3. Sync environment

4. Kaggle's API

5. You are ready!!!

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages