-
Notifications
You must be signed in to change notification settings - Fork 0
Jupyter notebook for end-to-end MPG prediction: EDA, data cleaning (missing/categorical), feature scaling, train/test split, and scikit-learn regression with metrics (R², MAE, RMSE).
License
job28/Autompg-predictor-regression
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
AUTO-MPG — LINEAR REGRESSION (JUPYTER NOTEBOOK)
================================================
Project Summary
---------------
This project builds a simple regression model to predict a car’s fuel efficiency (MPG) using the classic Auto MPG dataset. The workflow is implemented in a single notebook:
Regression_AutoMpg.ipynb
It covers:
1) Data loading and basic cleaning
2) Exploratory analysis (correlations and scatter-matrix)
3) Feature engineering (adding squared terms)
4) Train/test split
5) Linear Regression model training and evaluation (R² and RMSE)
6) Plot export for quick visualization
Repository Structure (expected)
-------------------------------
.
├─ Regression_AutoMpg.ipynb ← Main analysis notebook
├─ data/
│ └─ auto-mpg.csv ← Dataset file (you add this)
└─ plots/
└─ Regression_autompg_Scatter.png ← Generated by the notebook
Dataset
-------
Name: Auto MPG
Source: UCI Machine Learning Repository (originally from StatLib)
Target column: mpg
The notebook expects a CSV at:
data/auto-mpg.csv
with these columns in this exact order (no header row in source is fine as the notebook assigns names):
mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, origin, car_name
Note: The original UCI data may contain “?” for horsepower. Ensure your CSV has numeric values (convert or drop rows with “?”) before running. The notebook drops the text columns `origin` and `car_name` and engineers polynomial features for a few numeric fields.
Environment & Requirements
--------------------------
Python 3.9+ recommended.
Core libraries used in the notebook:
- pandas
- numpy
- scikit-learn
- matplotlib
Quick Setup (virtual environment)
---------------------------------
Linux / macOS
1) python -m venv .venv
2) source .venv/bin/activate
3) pip install pandas numpy scikit-learn matplotlib
Windows (PowerShell)
1) python -m venv .venv
2) .venv\Scripts\Activate.ps1
3) pip install pandas numpy scikit-learn matplotlib
Preparing Folders & Data
------------------------
1) Create folders if missing:
mkdir -p data plots
2) Place your dataset at:
data/auto-mpg.csv
Make sure the columns match the list given above and non-numeric entries (e.g., “?”) are handled.
How to Run
----------
Option A: Jupyter
1) jupyter notebook
2) Open `Regression_AutoMpg.ipynb`
3) Run all cells (Kernel → Restart & Run All)
Option B: VS Code / other IDE
- Open the notebook and run all cells from the UI.
What the Notebook Does
----------------------
1) Reads the dataset and assigns column names
2) Drops non-numeric text columns: `origin`, `car_name`
3) Exploratory analysis:
- Correlation matrix
- Scatter-matrix (saved to `plots/Regression_autompg_Scatter.png`)
4) Feature engineering:
- Adds squared terms for selected numeric features (e.g., horsepower, displacement, weight)
5) Train/test split (scikit-learn `train_test_split`, random_state=1)
6) Fits `LinearRegression`
7) Reports:
- R² on the (full) dataset
- RMSE on the test set
Outputs You Should See
----------------------
- A scatter-matrix plot saved to:
plots/Regression_autompg_Scatter.png
- Printed metrics in the notebook output, including:
- R squared: <value>
- RMSE: <value>
(Exact values depend on your cleaned dataset.)
Reproducing Results
-------------------
- Ensure your CSV is clean (no “?” / non-numeric in numeric columns).
- Run all cells in order.
- The plot and metrics will be generated automatically.
Common Pitfalls & Tips
----------------------
- If you get parsing or dtype errors, check for non-numeric values in `horsepower` and other numeric columns. Convert them with pandas (e.g., `pd.to_numeric(..., errors="coerce")`) and drop rows with NaNs if necessary.
- If `plots/` does not exist, create it before running, or the save call will fail.
- Results will change if you alter the random seed, features, or data cleaning choices.
Extending the Project
---------------------
- Try adding more polynomial/interaction terms and compare RMSE.
- Standardize/normalize features and see if it helps (especially for regularized models).
- Evaluate alternative models (Ridge, Lasso, RandomForestRegressor, Gradient Boosting).
- Cross-validate with KFold and compare performance.
Credits & Attribution
---------------------
- Dataset: Auto MPG, UCI Machine Learning Repository.
- Libraries: pandas, numpy, scikit-learn, matplotlib.
License
-------
This project is provided for educational purposes. If you plan to distribute, consider adding a LICENSE file (e.g., MIT) to clarify usage terms.
Contact
-------
For questions or issues, please open an issue in the repository or contact the maintainer.
About
Jupyter notebook for end-to-end MPG prediction: EDA, data cleaning (missing/categorical), feature scaling, train/test split, and scikit-learn regression with metrics (R², MAE, RMSE).
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published