The Sustainability Impact Predictor is a machine learning project that aims to predict the environmental impact of various business activities, specifically focusing on CO2 emissions. This project uses data from the EPA's Greenhouse Gas Reporting Program (GHGRP) to train models that can forecast CO2 emissions based on various factors.
sustainability-impact-predictor/
│
├── data/
│ ├── raw/
│ │ └── ghgrp_data_2022.csv
│ └── processed/
│ └── feature_engineered_data.csv
│
├── models/
│ ├── best_model.joblib
│ ├── preprocessor.joblib
│ ├── random_forest_feature_importance.csv
│ ├── gradient_boosting_feature_importance.csv
│ ├── random_forest_feature_importance.png
│ ├── gradient_boosting_feature_importance.png
│ └── residual_plot.png
│
├── src/
│ ├── data_preprocessing.py
│ ├── feature_engineering.py
│ └── train_models.py
│
├── notebooks/
│ └── exploratory_data_analysis.ipynb
│
├── requirements.txt
├── README.md
└── .gitignore
-
Clone this repository:
git clone https://github.com/yourusername/sustainability-impact-predictor.git cd sustainability-impact-predictor -
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` -
Install the required packages:
pip install -r requirements.txt
-
Data Preprocessing:
python src/data_preprocessing.py -
Feature Engineering:
python src/feature_engineering.py -
Train Models:
python src/train_models.py -
For exploratory data analysis, open the Jupyter notebook:
jupyter notebook notebooks/exploratory_data_analysis.ipynb
This project uses data from the EPA's Greenhouse Gas Reporting Program (GHGRP). The raw data can be found in data/raw/ghgrp_data_2022.csv. After preprocessing and feature engineering, the processed data is stored in data/processed/feature_engineered_data.csv.
To obtain the raw data:
- Visit https://www.epa.gov/ghgreporting/ghg-reporting-program-data-sets
- Navigate to the "2022 Data" section
- Download the "2022 Data Summary Spreadsheets (zip)" file
- Extract the contents and place the main CSV file in the
data/raw/directory
We train and compare two models:
- Random Forest Regressor
- Gradient Boosting Regressor
The best performing model is saved as models/best_model.joblib. The data preprocessor is saved as models/preprocessor.joblib.
After training, the following results are generated:
- Feature importance plots:
models/random_forest_feature_importance.pngandmodels/gradient_boosting_feature_importance.png - Feature importance data:
models/random_forest_feature_importance.csvandmodels/gradient_boosting_feature_importance.csv - Residual plot:
models/residual_plot.png
Model performance metrics, including R2 score, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE), are printed to the console during training.
Contributions to this project are welcome! Please fork the repository and submit a pull request with your proposed changes.
This project is licensed under the MIT License - see the LICENSE file for details.