🚕 NYC Taxi Fare Prediction with Rush Hour Analysis
A complete machine learning project that predicts NYC taxi fares with realistic pricing, including airport fare handling, rush hour comparison, and inflation tracking from 2016 to 2025.
Model Performance (1M training samples):
- RMSE: $1.79
- R² Score: 0.853
- Training Time: 45-60 minutes
- Unreasonable Predictions: 0.00% ✅
Key Features:
- ✅ Realistic airport fares (JFK: $52 → $77 with inflation)
- ✅ Rush hour vs regular pricing comparison
- ✅ 48% inflation tracking (2016-2025)
- ✅ 20+ engineered features
- ✅ 0% unreasonable predictions
NYC TAXI FARE PREDICTION - COMPLETE PIPELINE
STEP 1: DATA PREPROCESSING Loading data from data/train.csv... Loaded 1,000,000 records Removing outliers (STRICT MODE with airport exceptions)... Kept 50,849 airport trips (relaxed: $1-15/mile) Kept 781,413 regular trips (strict: $1.50-10/mile) Removed 225,748 outliers (22.58%) Final dataset shape: (774,242, 15) Final fare range: $2.50 - $26.00 Median fare: $8.50
STEP 2: FEATURE ENGINEERING Creating distance features... Creating location features... Final feature count: 28 Distance range: 0.20 - 14.71 miles Median distance: 1.44 miles Fare per mile statistics: Mean: $6.05/mile Median: $5.87/mile Range: $1.00 - $14.95/mile
STEP 3: MODEL TRAINING & EVALUATION Feature matrix shape: (774,242, 20) Train size: 619,393, Test size: 154,849
Training Random Forest... Random Forest training complete === Model Evaluation === Random Forest: RMSE: $1.79 MAE: $1.25 R² Score: 0.8530 Unreasonable predictions: 0.00%
🏆 Best Model: Random Forest
Regular Trip: Times Square → Central Park (2.02 miles) 2016: $9.69 ($4.81/mile) 2025: $14.35 ($7.12/mile) Inflation: +48%
Airport Trip: Manhattan → JFK Airport (13.34 miles) 2016: $52.00 ($3.90/mile) ← JFK flat rate applied 2025: $76.96 ($5.77/mile) Inflation: +48%
Rush Hour Comparison: Times Square → Central Park Regular Hours: $14.35 Rush Hour: $15.44 (+$1.10, +7.6%)
- Python 3.8+
- 8GB+ RAM (16GB recommended for 1M rows)
- ~10GB disk space
# 1. Clone the repository
git clone https://github.com/yourusername/nyc-taxi-fare-prediction.git
cd nyc-taxi-fare-prediction
# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip3 install -r requirements.txt
# 4. Create directories
mkdir -p data models results
Download Dataset
-------------------------------------------------
Visit Kaggle NYC Taxi Fare Competition
Download train.csv (5.7 GB)
Place in data/train.csv
Run the Project
-------------------------------------------------
bash#
Train the model (45-60 minutes)
python3 main.py
# Make predictions with rush hour comparison
python3 predict.py
# Generate inflation visualizations
python3 visualize_inflation.py
📁 Project Structure
--------------------------------------------------
nyc-taxi-fare-prediction/
│
├── src/
│ ├── __init__.py
│ ├── data_preprocessing.py # Dual-tier outlier removal
│ ├── feature_engineering.py # 20+ feature creation
│ └── model_training.py # 4 model comparison
│
├── data/
│ ├── train.csv # Download from Kaggle
│ ├── preprocessed_data.csv # Generated by pipeline
│ └── featured_data.csv # Generated by pipeline
│
├── models/
│ └── best_model.pkl # Trained Random Forest
│
├── results/
│ ├── model_metrics.csv # Performance comparison
│ ├── model_predictions.png # Scatter plots
│ ├── feature_importance.png # Top features
│ ├── fare_vs_distance.png # Distance analysis
│ └── fare_inflation_analysis.png # Inflation charts
│
├── main.py # Complete pipeline
├── predict.py # Rush hour comparison
├── visualize_inflation.py # Inflation charts
├── requirements.txt # Python packages
└── README.md # This file
🎓 What Makes This Project Unique
1. Realistic Airport Fare Handling
Most models underpredict JFK fares. We implemented:
- Dual-tier data cleaning (relaxed for airports, strict for regular trips)
- JFK flat rate baseline ($52 in 2016)
- Minimum fare enforcement based on NYC TLC rules
Result: JFK predictions match real-world pricing ($52-60 vs typical ML models showing $20-30)
2. Rush Hour Comparison Feature
Compares same trip during rush hour vs regular hours:
- Morning rush: 7-9 AM weekdays
- Evening rush: 5-7 PM weekdays
- Shows premium: Typically +7-15%
3. Temporal Inflation Analysis
Tracks fare evolution across 9 years:
- 2016-2020: +11% (steady growth)
- 2020-2021: +5% (COVID period)
- 2021-2022: +9% (high inflation)
- 2022-2025: +16% (congestion pricing)
- Total: +48% cumulative
4. Zero Unreasonable Predictions
Our validation ensures:
- No fares below NYC minimum ($2.50)
- No fare-per-mile ratios outside $1-15/mile range
- Airport trips respect flat rate structures
📊 Model Performance Details
Model Comparison
--------------------------------------------------------------------
Model | RMSE | MAE | R^2 | Training Time
Linear Regression | $2.05 | $1.46 | 0.807 | <1 min
Ridge Regression | $2.05 | $1.46 | 0.807 | <1 min
Random Forest | $1.79 | $1.25 | 0.853 | 45-60 min
Gradient Boosting | $1.82 | $1.27 | 0.847 | 35-50 min
Winner: Random Forest (best accuracy with reasonable training time)
Feature Importance (Top 10)
--------------------------------------------------------------------
1. distance_miles - 64.2%
2. distance_squared - 9.1%
3. log_distance - 6.8%
4. hour - 4.7%
5. manhattan_distance - 3.9%
6. distance_to_center - 2.8%
7. pickup_distance_to_jfk - 2.1%
8. is_rush_hour - 1.9%
9. dropoff_distance_to_jfk - 1.6%
10. is_airport_trip - 1.4%
Insight: Distance features account for 80%+ of predictions
🔧 Usage Examples
--------------------------------------------------------------------
Basic Prediction (python)
from predict import create_trip_features, predict_fare_by_year
import joblib
# Load model
model = joblib.load('models/best_model.pkl')
# Create features
features = create_trip_features(
pickup_lat=40.7580,
pickup_lon=-73.9855,
dropoff_lat=40.7829,
dropoff_lon=-73.9654,
pickup_datetime="2024-06-15 18:30:00",
passenger_count=2
)
# Predict
fares = predict_fare_by_year(model, features)
print(f"2025 Fare: ${fares[2025]:.2f}")
Rush Hour Comparison (python)
from predict import compare_rush_hour
import joblib
model = joblib.load('models/best_model.pkl')
# Compare same trip at different times
compare_rush_hour(
model,
pickup_lat=40.7580,
pickup_lon=-73.9855,
dropoff_lat=40.7829,
dropoff_lon=-73.9654,
pickup_datetime="2024-06-15 18:30:00",
passenger_count=2,
trip_name="My Custom Trip"
)
📈 Data Processing Pipeline
-------------------------------------
Step 1: Data Preprocessing (5-10 min)
- Load 1M rows from Kaggle CSV
- Remove missing values (10 rows)
- Dual-tier outlier removal:
- Airport trips: $1-15/mile (relaxed)
- Regular trips: $1.50-10/mile (strict)
- Extract datetime features
- Result: 774k clean rows (77% retained)
Step 2: Feature Engineering (3-5 min)
- Calculate 8 distance metrics
- Compute 7 temporal features
- Create 6 location features
- Validate all features (no NaN/inf)
- Result: 20 engineered features
Step 3: Model Training (45-60 min)
- Split 80/20 train/test
- Train 4 models in parallel
- Evaluate with RMSE, MAE, R²
- Save best model (Random Forest)
- Result: $1.79 RMSE, 85.3% R²
🎯 Key Findings
-----------------------------------
Distance Patterns
- Short trips (<2 mi): $4-6/mile (high base fare impact)
- Medium trips (2-8 mi): $3-5/mile (optimal range)
- Long trips (8-15 mi): $3-4/mile (volume discount)
- Airport (>10 mi): $3.50-4/mile + flat rate
Temporal Patterns
- Rush hour: +7-15% premium (varies by route)
- Late night (12-5 AM): +10-15% premium
- Weekends: 5-10% lower than weekdays
- Hour of day: Peak at 8 AM and 6 PM
Geographic Patterns
- Manhattan center trips: 15-20% premium
- Airport proximity: +$5-10 base premium
- Bridge/tunnel routes: +$3-7 (model doesn't detect tolls)
🐛 Troubleshooting
-----------------------------------
Issue: "Module not found"
pip3 install -r requirements.txt
Issue: "Data file not found"
Download train.csv from Kaggle and place in data/ folder
Issue: Airport fares too low
# In main.py, use more data:
df = load_data(data_path, nrows=1500000) # 1.5M rows
Issue: Training takes too long
# In main.py, reduce data:
df = load_data(data_path, nrows=500000) # 500k rows
Issue: Memory error
# In src/model_training.py, reduce trees:
'Random Forest': RandomForestRegressor(
n_estimators=100, # Reduce from 150
n_jobs=2 # Use fewer cores
)
👤 Author
LinkedIn: www.linkedin.com/in/fardin-hossain-tanmoy
Email: fardintonu@gmail.com
GitHub: @fardinhossain007
🙏 Acknowledgments
Kaggle - NYC Taxi Fare dataset
NYC TLC - Fare regulations and historical data
Scikit-learn - Machine learning framework
Python Community - Open source tools
⭐ If this project helped you, please star it on GitHub!
📧 Questions? Open an issue or reach out on LinkedIn!