Production-ready ML model achieving 11.04-minute MAE with 57.39% accuracy in predicting delivery times
An end-to-end machine learning solution that predicts DoorDash delivery duration from order creation to customer delivery, optimized for real-time customer-facing applications.
| Metric | Value | Status |
|---|---|---|
| Mean Absolute Error | 11.04 minutes (662.43 sec) | β |
| RΒ² Score | 0.307 | β |
| Accuracy (Β±10 min) | 57.39% | β |
| Extreme Error Rate | 4.71% | β (<5% guardrail) |
| Statistical Validation | p-value < 0.001 (McNemar's test) | β |
DoorDash needs accurate real-time predictions of total delivery duration (seconds between order creation and delivery) to:
- Provide reliable ETAs to customers
- Optimize dasher allocation
- Enable proactive operations management
- Reduce customer complaints and churn
Challenge: Complex marketplace dynamics with non-linear relationships between features including order details, restaurant characteristics, and real-time marketplace load.
Data Pipeline β Feature Engineering β Model Training β Validation β Deployment Ready
β β β β β
197K records Temporal features XGBoost SHAP + API-ready
99.3% clean Cyclic encoding Tuned McNemar <200ms latency
π Simple & Effective Feature Engineering
- Cyclic encoding for temporal features (hour, day of week)
- No complex transformations (no log, no scaling - tree models don't need them)
- Label encoding for categorical features
- Focus on interpretability and production simplicity
π§ Iterative Model Development
- 5 model iterations: Drop Missing β Simple Imputation β XGBoost Native β Temporal Features β Hyperparameter Tuning
- XGBoost selected for native missing value handling and non-linear pattern capture
- RandomizedSearchCV with TimeSeriesSplit for robust hyperparameter tuning
β Rigorous Validation
- Temporal train-test split (80/20) to prevent data leakage
- Business-aligned metrics (MAE in minutes, accuracy rate, extreme error rate)
- SHAP analysis for model explainability
- McNemar's test for statistical significance validation
doordash-delivery-time-prediction/
β
βββ notebooks/
β βββ doordash_delivery_prediction.ipynb # Complete analysis & modeling
β
βββ data/
β βββ historical_data.csv # Training dataset (197K records)
β
βββ models/
β βββ doordash_eta_xgb_artifacts.joblib # Saved model + encoders + feature schema
β
βββ README.md # This file
βββ requirements.txt # Python dependencies
Python 3.8+
pip or conda- Clone the repository
git clone https://github.com/MohammedMoseena/doordash-delivery-time-prediction.git
cd doordash-delivery-time-prediction- Install dependencies
pip install -r requirements.txt- Run the notebook
jupyter notebook notebooks/doordash_delivery_prediction.ipynbimport joblib
import pandas as pd
import numpy as np
# Load saved artifacts
artifacts = joblib.load("models/doordash_eta_xgb_artifacts.joblib")
model = artifacts["model"]
le_store_cat = artifacts["store_category_encoder"]
feature_columns = artifacts["feature_columns"]
# Example order
new_order = {
"created_at": "2025-11-23 12:34:56",
"market_id": 3,
"store_id": 12345,
"store_primary_category": "Pizza",
"order_protocol": 2,
"total_items": 5,
"num_distinct_items": 3,
"subtotal": 2400,
"min_item_price": 200,
"max_item_price": 800,
"total_onshift_dashers": 25,
"total_busy_dashers": 15,
"total_outstanding_orders": 30,
"estimated_order_place_duration": 600,
"estimated_store_to_consumer_driving_duration": 900,
}
# Preprocess and predict
df_new = pd.DataFrame([new_order])
X_new = preprocess_for_inference(df_new) # See notebook for function
pred_seconds = model.predict(X_new)[0]
print(f"Predicted ETA: {pred_seconds/60:.2f} minutes")- Records: 197,428 β 196,076 after cleaning (99.3% retained)
- Features: 18 features (14 original + 4 engineered temporal features)
- Target:
delivery_duration_seconds = actual_delivery_time - created_at - Time Period: 2015 data (US/Pacific timezone)
| Category | Features | Examples |
|---|---|---|
| Time | 1 | market_id (city/region) |
| Store | 3 | store_id, store_primary_category, order_protocol |
| Order Details | 5 | total_items, subtotal, num_distinct_items, min/max_item_price |
| Marketplace | 3 | total_onshift_dashers, total_busy_dashers, total_outstanding_orders |
| Model Predictions | 2 | estimated_order_place_duration, estimated_store_to_consumer_driving_duration |
| Temporal (Engineered) | 4 | hour_sin/cos, dayofweek_sin/cos |
- Analyzed 197,428 historical delivery records
- Identified right-skewed distributions in all numeric features (except driving duration)
- Discovered low correlations (<0.30) confirming non-linear relationships
- Found marketplace "inconsistencies" (
busy_dashers > onshift_dashers) are real operational signals, not errors
# Key cleaning steps:
β No duplicates found
β Removed min_price > max_price (0.4% of data)
β Removed prices > subtotal (0.3% of data)
β Removed extreme outliers: total_items=411, prices=14700
β Removed impossible deliveries >12 hours (only 7 records)
β Kept marketplace inconsistencies (real operational signal)
β Final dataset: 196,076 records (99.3% retention)| Model | Approach | Pros | Cons |
|---|---|---|---|
| Model 1 | Drop missing rows | Clean data, no noise | Loses ~10% data |
| Model 2 | Simple imputation (median/mode) | Fast, keeps all data | Adds mild noise |
| Model 3 | XGBoost native handling | Best for marketplace data | Slightly complex |
Winner: Model 3 (XGBoost handles missing values optimally)
Temporal Features (Model 4 improvement):
# Cyclic encoding preserves temporal continuity
hour_sin = sin(2Ο Γ hour / 24)
hour_cos = cos(2Ο Γ hour / 24)
dayofweek_sin = sin(2Ο Γ dayofweek / 7)
dayofweek_cos = cos(2Ο Γ dayofweek / 7)Why cyclic? 11 PM is closer to 12 AM than to 3 PM
| Model | Strategy | MAE (sec) | RΒ² | Accuracy β€10min | Key Insight |
|---|---|---|---|---|---|
| Model 1 | Drop missing + RF | 705.60 | 0.223 | 54.02% | Baseline |
| Model 2 | Simple imputation + RF | 714.56 | 0.227 | 53.26% | Keeps data but adds noise |
| Model 3 | XGBoost native missing | 684.43 | 0.274 | 55.83% | Best missing handling |
| Model 4 | + Temporal features | 668.21 | 0.296 | 57.24% | Time patterns matter! |
| Model 5 | + Hyperparameter tuning | 662.43 | 0.307 | 57.39% | Production model |
Best parameters:
βββ n_estimators: 300
βββ max_depth: 5
βββ learning_rate: 0.1
βββ subsample: 0.9
βββ colsample_bytree: 0.9
Method: RandomizedSearchCV (20 iterations)
CV Strategy: TimeSeriesSplit (3 folds)
Result: +6 seconds MAE improvement, meets all guardrails| Metric | Value | Interpretation |
|---|---|---|
| MAE | 662.43 sec (11.04 min) | Average prediction error |
| RMSE | 967.79 sec (16.13 min) | Penalizes large errors |
| RΒ² | 0.307 | Explains 30.7% of variance |
| Accuracy β€10 min | 57.39% | Over half highly accurate |
| Extreme errors >30 min | 4.71% | Meets <5% safety threshold |
| Rank | Feature | Impact | Business Interpretation |
|---|---|---|---|
| 1 π₯ | total_outstanding_orders | π΄ Very Strong + | High demand β congestion β delays |
| 2 π₯ | total_onshift_dashers | π’ Very Strong - | More dashers β faster pickups |
| 3 π₯ | estimated_store_to_consumer_driving_duration | π΄ Strong + | Distance = delivery time |
| 4 | hour_sin | π΄ Moderate + | Peak hours cause delays |
| 5 | subtotal | π΄ Moderate + | Large orders = longer prep |
Key Insight: Supply-demand imbalance (#1 and #2) is the primary driver of delivery delays.
Question: Is Model 5 significantly better than Model 1?
Test: McNemar's test (paired predictions)
Result: p-value = 2Γ10β»βΆ (<< 0.05)
β
Conclusion: Improvement is statistically significant, not random chance1. π Optimize Dasher Deployment
- Deploy more dashers when
total_outstanding_ordersis high - Target markets with consistently high marketplace load
- Expected Impact: Reduce average delivery time
2. β±οΈ Dynamic ETA Buffers
- Add safety margins during peak hours (
hour_sinpatterns) - Personalize buffers based on
subtotal(order complexity) - Expected Impact: Reduction in customer complaints
3. π¨ Proactive Monitoring
- Auto-flag orders predicted >45 min for ops review
- Real-time alerts when marketplace saturation detected
- Expected Impact: Faster response to delays
4. π€ Restaurant Partnerships
- Share
estimated_order_place_durationinsights - Collaborate on reducing order receiving time variability
- Expected Impact: Improvement in prep time accuracy
- β Customer Satisfaction: Fewer missed ETAs β higher trust
- β Operational Efficiency: Better dasher allocation β lower costs
- β Revenue Growth: Improved retention β reduced churn
- β Competitive Advantage: Industry-leading ETA accuracy
- π Static 2015 dataset - Marketplace has evolved
- π¦ No real-time traffic data - Weather, accidents, road closures
- π Missing restaurant capacity - Kitchen busyness, staff levels
- π Accuracy ceiling at ~57% - Inherent data limitations
Future enhancements:
βββ Integrate real-time traffic APIs (Google Maps, Waze)
βββ Add weather conditions (rain, snow β slower deliveries)
βββ Include restaurant historical prep time patterns
βββ Build market-specific models (different cities, different patterns)
βββ Weekly retraining pipeline (capture seasonality, drift)
βββ A/B testing framework (5% traffic rollout before full deployment)
βββ MLOps monitoring (Prometheus/Grafana for drift detection)| Category | Tools |
|---|---|
| Language | Python 3.8+ |
| ML/DL | XGBoost, Scikit-learn |
| Data Processing | Pandas, NumPy |
| Visualization | Matplotlib, Seaborn |
| Explainability | SHAP |
| Validation | TimeSeriesSplit, RandomizedSearchCV, McNemar's Test |
| Deployment | Joblib (model serialization) |
# Core Data Science Libraries
numpy==2.0.2
pandas==2.2.2
# Visualization
matplotlib==3.10.0
seaborn==0.13.2
# Machine Learning
scikit-learn==1.6.1
xgboost==3.1.1
# Model Explainability
shap==0.50.0
# Model Serialization
joblib==1.5.2
# Statistical Testing
statsmodels==0.14.5
- π Jupyter Notebook - Complete analysis with step-by-step explanations
- π Medium Article - Detailed writeup with business insights
Md Moseena
- π LinkedIn: linkedin.com/in/mdmoseena
- π GitHub: github.com/MohammedMoseena
- π Medium: medium.com/@mdmoseena22
β Production-Ready
- Model meets all business guardrails (<5% extreme errors)
- Statistically validated improvement (p < 0.001)
- Complete deployment artifacts saved (model + encoders + schema)
- Ready for A/B testing and gradual rollout