GitHub - job28/Autompg-predictor-regression: Jupyter notebook for end-to-end MPG prediction: EDA, data cleaning (missing/categorical), feature scaling, train/test split, and scikit-learn regression with metrics (R², MAE, RMSE).

job28 / Autompg-predictor-regression Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Jupyter notebook for end-to-end MPG prediction: EDA, data cleaning (missing/categorical), feature scaling, train/test split, and scikit-learn regression with metrics (R², MAE, RMSE).

MIT license

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.txt		README.txt
Regression_AutoMpg.ipynb		Regression_AutoMpg.ipynb
auto-mpg.csv		auto-mpg.csv
gitignore.txt		gitignore.txt
requirements.txt		requirements.txt

Repository files navigation

AUTO-MPG — LINEAR REGRESSION (JUPYTER NOTEBOOK)
================================================

Project Summary
---------------
This project builds a simple regression model to predict a car’s fuel efficiency (MPG) using the classic Auto MPG dataset. The workflow is implemented in a single notebook:

    Regression_AutoMpg.ipynb

It covers:
1) Data loading and basic cleaning
2) Exploratory analysis (correlations and scatter-matrix)
3) Feature engineering (adding squared terms)
4) Train/test split
5) Linear Regression model training and evaluation (R² and RMSE)
6) Plot export for quick visualization

Repository Structure (expected)
-------------------------------
.
├─ Regression_AutoMpg.ipynb       ← Main analysis notebook
├─ data/
│  └─ auto-mpg.csv                ← Dataset file (you add this)
└─ plots/
   └─ Regression_autompg_Scatter.png  ← Generated by the notebook

Dataset
-------
Name: Auto MPG  
Source: UCI Machine Learning Repository (originally from StatLib)  
Target column: mpg

The notebook expects a CSV at:

    data/auto-mpg.csv

with these columns in this exact order (no header row in source is fine as the notebook assigns names):

    mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, origin, car_name

Note: The original UCI data may contain “?” for horsepower. Ensure your CSV has numeric values (convert or drop rows with “?”) before running. The notebook drops the text columns `origin` and `car_name` and engineers polynomial features for a few numeric fields.

Environment & Requirements
--------------------------
Python 3.9+ recommended.

Core libraries used in the notebook:
- pandas
- numpy
- scikit-learn
- matplotlib

Quick Setup (virtual environment)
---------------------------------
Linux / macOS
1) python -m venv .venv
2) source .venv/bin/activate
3) pip install pandas numpy scikit-learn matplotlib

Windows (PowerShell)
1) python -m venv .venv
2) .venv\Scripts\Activate.ps1
3) pip install pandas numpy scikit-learn matplotlib

Preparing Folders & Data
------------------------
1) Create folders if missing:

    mkdir -p data plots

2) Place your dataset at:

    data/auto-mpg.csv

Make sure the columns match the list given above and non-numeric entries (e.g., “?”) are handled.

How to Run
----------
Option A: Jupyter
1) jupyter notebook
2) Open `Regression_AutoMpg.ipynb`
3) Run all cells (Kernel → Restart & Run All)

Option B: VS Code / other IDE
- Open the notebook and run all cells from the UI.

What the Notebook Does
----------------------
1) Reads the dataset and assigns column names
2) Drops non-numeric text columns: `origin`, `car_name`
3) Exploratory analysis:
   - Correlation matrix
   - Scatter-matrix (saved to `plots/Regression_autompg_Scatter.png`)
4) Feature engineering:
   - Adds squared terms for selected numeric features (e.g., horsepower, displacement, weight)
5) Train/test split (scikit-learn `train_test_split`, random_state=1)
6) Fits `LinearRegression`
7) Reports:
   - R² on the (full) dataset
   - RMSE on the test set

Outputs You Should See
----------------------
- A scatter-matrix plot saved to:

    plots/Regression_autompg_Scatter.png

- Printed metrics in the notebook output, including:
   - R squared: <value>
   - RMSE: <value>

(Exact values depend on your cleaned dataset.)

Reproducing Results
-------------------
- Ensure your CSV is clean (no “?” / non-numeric in numeric columns).
- Run all cells in order.
- The plot and metrics will be generated automatically.

Common Pitfalls & Tips
----------------------
- If you get parsing or dtype errors, check for non-numeric values in `horsepower` and other numeric columns. Convert them with pandas (e.g., `pd.to_numeric(..., errors="coerce")`) and drop rows with NaNs if necessary.
- If `plots/` does not exist, create it before running, or the save call will fail.
- Results will change if you alter the random seed, features, or data cleaning choices.

Extending the Project
---------------------
- Try adding more polynomial/interaction terms and compare RMSE.
- Standardize/normalize features and see if it helps (especially for regularized models).
- Evaluate alternative models (Ridge, Lasso, RandomForestRegressor, Gradient Boosting).
- Cross-validate with KFold and compare performance.

Credits & Attribution
---------------------
- Dataset: Auto MPG, UCI Machine Learning Repository.
- Libraries: pandas, numpy, scikit-learn, matplotlib.

License
-------
This project is provided for educational purposes. If you plan to distribute, consider adding a LICENSE file (e.g., MIT) to clarify usage terms.

Contact
-------
For questions or issues, please open an issue in the repository or contact the maintainer.

About

Jupyter notebook for end-to-end MPG prediction: EDA, data cleaning (missing/categorical), feature scaling, train/test split, and scikit-learn regression with metrics (R², MAE, RMSE).