📖 About the Project This project focuses on building a regression model that predicts the fare amount of Uber rides based on various factors such as pickup/drop-off coordinates, passenger count, and trip distance. The dataset is derived from NYC Uber trips and aims to demonstrate practical applications of data cleaning, feature engineering, and model evaluation.
✨ Features
- Data cleaning and preprocessing of real-world trip records
- Feature extraction from timestamps and geolocation data
- Visualization of data distributions and correlations
- Distance calculation using the Haversine formula
- Model training using Linear Regression and Random Forest Regressor
- Performance comparison using RMSE and R² metrics
- Language: Python
- Environment: Jupyter Notebook
- Libraries Used:
- pandas for data manipulation
- numpy for numerical operations
- matplotlib and seaborn for data visualization
- scikit-learn for machine learning models and metrics
- math for Haversine distance calculation
Dataset: Uber NYC fare data
- Removed missing values
- Dropped rows with negative or zero distances/fare amounts
- Filtered unrealistic coordinates
- Extracted hour, weekday, and month from pickup datetime
- Calculated distance between pickup and drop-off using the Haversine formula
Multiple regression models were trained and evaluated to predict Uber fare amounts:
- Linear Regression: Used as a baseline model to establish a point of comparison. It used all numeric and engineered features but was limited in handling complex, non-linear relationships.
- Random Forest Regressor: An ensemble-based model that improved prediction accuracy by capturing feature interactions and reducing overfitting through averaging.
- XGBoost: A gradient boosting model known for its speed and performance, especially on structured/tabular data.
- LightGBM: A high-performance boosting framework that is faster and more efficient with large datasets. It delivered the best overall results in this project.
- CatBoost: A gradient boosting model optimized for categorical features. It performed competitively and required minimal preprocessing.
Each model was evaluated using: Root Mean Square Error (RMSE): To measure prediction error. R² Score: To quantify the proportion of variance explained by the model.
- RMSE (Root Mean Square Error)
- R² Score (Coefficient of Determination)
| Model | RMSE | R² Score |
|---|---|---|
| Random Forest | 3.24 | 0.65 |
| XGBoost | 3.07 | 0.69 |
| LightGBM | 2.99 | 0.70 |
| Model | RMSE | R² Score |
|---|---|---|
| Linear Regression | 5.563649 | -0.026717 |
| XGBoost | 2.777773 | 0.744068 |
| LightGBM | 2.992365 | 0.702997 |
| Model | Metric | Value |
|---|---|---|
| XGBoost | RMSE | 3.1918 |
| R² Score | 0.7744 | |
| LGBM | RMSE | 3.1142 |
| R² Score | 0.7852 |
| Model | Metric | Value |
|---|---|---|
| LightGBM | RMSE | 2.8719 |
| R² Score | 0.8173 | |
| Final Model | RMSE | 2.8007 |
| R² Score | 0.8263 |
- R² Score close to 1: Model makes accurate predictions.
- R² Score close to 0 or negative: Poor predictive performance.
The best model is the one that:
- Minimizes RMSE
- Shows consistent and stable predictions
- Gives predicted fares close to actual fares
- Clone the repository:
git clone https://github.com/your-username/your-repo-name.git && cd your-repo-name - Install dependencies:
pip install -r requirements.txt - Train the model:
python train_model.py - Make predictions:
python predict.py --input data/sample_input.csv - Evaluate the model:
python evaluate.py
Ensure required datasets are placed in the
data/folder before execution.
- Md Altamash Alam
- Amreen Perween
This project is protected under copyright © Md Altamash Alam, 2025.
All rights reserved. Unauthorized copying, distribution, modification, or use of any part of this project without explicit permission is strictly prohibited.
If you wish to use or reference any part of this project for academic, personal, or commercial purposes, please contact the author for permission.
© Md Altamash Alam, 2025 – All Rights Reserved.



