Skip to content

Machine learning model to predict NYC taxi fares with rush hour comparison and inflation tracking (2016-2025). Random Forest model achieving $1.79 RMSE.

Notifications You must be signed in to change notification settings

fardinhossain007/nyc-taxi-fare-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚕 NYC Taxi Fare Prediction with Rush Hour Analysis

A complete machine learning project that predicts NYC taxi fares with realistic pricing, including airport fare handling, rush hour comparison, and inflation tracking from 2016 to 2025.

Python ML Accuracy Status

📊 Project Results

Model Performance (1M training samples):

  • RMSE: $1.79
  • R² Score: 0.853
  • Training Time: 45-60 minutes
  • Unreasonable Predictions: 0.00% ✅

Key Features:

  • ✅ Realistic airport fares (JFK: $52 → $77 with inflation)
  • ✅ Rush hour vs regular pricing comparison
  • ✅ 48% inflation tracking (2016-2025)
  • ✅ 20+ engineered features
  • ✅ 0% unreasonable predictions

🎯 Live Demo Output

Training Pipeline

$ python3 main.py

NYC TAXI FARE PREDICTION - COMPLETE PIPELINE

STEP 1: DATA PREPROCESSING Loading data from data/train.csv... Loaded 1,000,000 records Removing outliers (STRICT MODE with airport exceptions)... Kept 50,849 airport trips (relaxed: $1-15/mile) Kept 781,413 regular trips (strict: $1.50-10/mile) Removed 225,748 outliers (22.58%) Final dataset shape: (774,242, 15) Final fare range: $2.50 - $26.00 Median fare: $8.50

STEP 2: FEATURE ENGINEERING Creating distance features... Creating location features... Final feature count: 28 Distance range: 0.20 - 14.71 miles Median distance: 1.44 miles Fare per mile statistics: Mean: $6.05/mile Median: $5.87/mile Range: $1.00 - $14.95/mile

STEP 3: MODEL TRAINING & EVALUATION Feature matrix shape: (774,242, 20) Train size: 619,393, Test size: 154,849

Training Random Forest... Random Forest training complete === Model Evaluation === Random Forest: RMSE: $1.79 MAE: $1.25 R² Score: 0.8530 Unreasonable predictions: 0.00%

🏆 Best Model: Random Forest

Prediction Examples

Regular Trip: Times Square → Central Park (2.02 miles) 2016: $9.69 ($4.81/mile) 2025: $14.35 ($7.12/mile) Inflation: +48%

Airport Trip: Manhattan → JFK Airport (13.34 miles) 2016: $52.00 ($3.90/mile) ← JFK flat rate applied 2025: $76.96 ($5.77/mile) Inflation: +48%

Rush Hour Comparison: Times Square → Central Park Regular Hours: $14.35 Rush Hour: $15.44 (+$1.10, +7.6%)

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • 8GB+ RAM (16GB recommended for 1M rows)
  • ~10GB disk space

Installation

# 1. Clone the repository
git clone https://github.com/yourusername/nyc-taxi-fare-prediction.git
cd nyc-taxi-fare-prediction

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip3 install -r requirements.txt

# 4. Create directories
mkdir -p data models results

Download Dataset
-------------------------------------------------
Visit Kaggle NYC Taxi Fare Competition
Download train.csv (5.7 GB)
Place in data/train.csv

Run the Project
-------------------------------------------------
bash# 
Train the model (45-60 minutes)
python3 main.py

# Make predictions with rush hour comparison
python3 predict.py

# Generate inflation visualizations
python3 visualize_inflation.py

📁 Project Structure
--------------------------------------------------
nyc-taxi-fare-prediction/
│
├── src/
│   ├── __init__.py
│   ├── data_preprocessing.py      # Dual-tier outlier removal
│   ├── feature_engineering.py     # 20+ feature creation
│   └── model_training.py          # 4 model comparison
│
├── data/
│   ├── train.csv                  # Download from Kaggle
│   ├── preprocessed_data.csv      # Generated by pipeline
│   └── featured_data.csv          # Generated by pipeline
│
├── models/
│   └── best_model.pkl             # Trained Random Forest
│
├── results/
│   ├── model_metrics.csv          # Performance comparison
│   ├── model_predictions.png      # Scatter plots
│   ├── feature_importance.png     # Top features
│   ├── fare_vs_distance.png       # Distance analysis
│   └── fare_inflation_analysis.png # Inflation charts
│
├── main.py                        # Complete pipeline
├── predict.py                     # Rush hour comparison
├── visualize_inflation.py         # Inflation charts
├── requirements.txt               # Python packages
└── README.md                      # This file

🎓 What Makes This Project Unique
1. Realistic Airport Fare Handling
Most models underpredict JFK fares. We implemented:
- Dual-tier data cleaning (relaxed for airports, strict for regular trips)
- JFK flat rate baseline ($52 in 2016)
- Minimum fare enforcement based on NYC TLC rules

Result: JFK predictions match real-world pricing ($52-60 vs typical ML models showing $20-30)

2. Rush Hour Comparison Feature
Compares same trip during rush hour vs regular hours:
- Morning rush: 7-9 AM weekdays
- Evening rush: 5-7 PM weekdays
- Shows premium: Typically +7-15%

3. Temporal Inflation Analysis
Tracks fare evolution across 9 years:
- 2016-2020: +11% (steady growth)
- 2020-2021: +5% (COVID period)
- 2021-2022: +9% (high inflation)
- 2022-2025: +16% (congestion pricing)
- Total: +48% cumulative

4. Zero Unreasonable Predictions
Our validation ensures:
- No fares below NYC minimum ($2.50)
- No fare-per-mile ratios outside $1-15/mile range
- Airport trips respect flat rate structures

📊 Model Performance Details

Model Comparison
--------------------------------------------------------------------
Model              | RMSE  | MAE   | R^2   | Training Time
Linear Regression  | $2.05 | $1.46 | 0.807 | <1 min
Ridge Regression   | $2.05 | $1.46 | 0.807 | <1 min
Random Forest      | $1.79 | $1.25 | 0.853 | 45-60 min
Gradient Boosting  | $1.82 | $1.27 | 0.847 | 35-50 min

Winner: Random Forest (best accuracy with reasonable training time)

Feature Importance (Top 10)
--------------------------------------------------------------------
1. distance_miles - 64.2%
2. distance_squared - 9.1%
3. log_distance - 6.8%
4. hour - 4.7%
5. manhattan_distance - 3.9%
6. distance_to_center - 2.8%
7. pickup_distance_to_jfk - 2.1%
8. is_rush_hour - 1.9%
9. dropoff_distance_to_jfk - 1.6%
10. is_airport_trip - 1.4%

Insight: Distance features account for 80%+ of predictions

🔧 Usage Examples
--------------------------------------------------------------------
Basic Prediction (python)

from predict import create_trip_features, predict_fare_by_year
import joblib

# Load model
model = joblib.load('models/best_model.pkl')

# Create features
features = create_trip_features(
    pickup_lat=40.7580,
    pickup_lon=-73.9855,
    dropoff_lat=40.7829,
    dropoff_lon=-73.9654,
    pickup_datetime="2024-06-15 18:30:00",
    passenger_count=2
)

# Predict
fares = predict_fare_by_year(model, features)
print(f"2025 Fare: ${fares[2025]:.2f}")

Rush Hour Comparison (python)

from predict import compare_rush_hour
import joblib

model = joblib.load('models/best_model.pkl')

# Compare same trip at different times
compare_rush_hour(
    model,
    pickup_lat=40.7580,
    pickup_lon=-73.9855,
    dropoff_lat=40.7829,
    dropoff_lon=-73.9654,
    pickup_datetime="2024-06-15 18:30:00",
    passenger_count=2,
    trip_name="My Custom Trip"
)

📈 Data Processing Pipeline
-------------------------------------
Step 1: Data Preprocessing (5-10 min)

- Load 1M rows from Kaggle CSV
- Remove missing values (10 rows)
- Dual-tier outlier removal:
   - Airport trips: $1-15/mile (relaxed)
   - Regular trips: $1.50-10/mile (strict)

- Extract datetime features
- Result: 774k clean rows (77% retained)

Step 2: Feature Engineering (3-5 min)

- Calculate 8 distance metrics
- Compute 7 temporal features
- Create 6 location features
- Validate all features (no NaN/inf)
- Result: 20 engineered features

Step 3: Model Training (45-60 min)

- Split 80/20 train/test
- Train 4 models in parallel
- Evaluate with RMSE, MAE, R²
- Save best model (Random Forest)
- Result: $1.79 RMSE, 85.3% R²

🎯 Key Findings
-----------------------------------
Distance Patterns

- Short trips (<2 mi): $4-6/mile (high base fare impact)
- Medium trips (2-8 mi): $3-5/mile (optimal range)
- Long trips (8-15 mi): $3-4/mile (volume discount)
- Airport (>10 mi): $3.50-4/mile + flat rate

Temporal Patterns

- Rush hour: +7-15% premium (varies by route)
- Late night (12-5 AM): +10-15% premium
- Weekends: 5-10% lower than weekdays
- Hour of day: Peak at 8 AM and 6 PM

Geographic Patterns

- Manhattan center trips: 15-20% premium
- Airport proximity: +$5-10 base premium
- Bridge/tunnel routes: +$3-7 (model doesn't detect tolls)

🐛 Troubleshooting
-----------------------------------
Issue: "Module not found"

pip3 install -r requirements.txt

Issue: "Data file not found"

Download train.csv from Kaggle and place in data/ folder

Issue: Airport fares too low

# In main.py, use more data:
df = load_data(data_path, nrows=1500000)  # 1.5M rows

Issue: Training takes too long

# In main.py, reduce data:
df = load_data(data_path, nrows=500000)  # 500k rows

Issue: Memory error

# In src/model_training.py, reduce trees:
'Random Forest': RandomForestRegressor(
    n_estimators=100,  # Reduce from 150
    n_jobs=2           # Use fewer cores
)

👤 Author

LinkedIn: www.linkedin.com/in/fardin-hossain-tanmoy
Email: fardintonu@gmail.com
GitHub: @fardinhossain007

🙏 Acknowledgments

Kaggle - NYC Taxi Fare dataset
NYC TLC - Fare regulations and historical data
Scikit-learn - Machine learning framework
Python Community - Open source tools


⭐ If this project helped you, please star it on GitHub!
📧 Questions? Open an issue or reach out on LinkedIn!