Predicting Yellow Cab Ride Tips

Predicting Tips for NYC yellow cab rides using data provided by the Taxi & Limousine Commission as well as weather data pulled from the Dark Sky API.

Methodology

Pulled October 2018 data from Dark Sky API and NYC TLC Trip Record Data
Dependent variable: tip amounts (in USD)
Indepdent variables tested: trip distance, fare amount, toll amount, trip time, temperature, inclimate weather, passenger count, rate code (i.e. standard fare or trip from airport), payment, day of week, time of day, neighborhoods and boroughs.

Challenges

Large data set: almost 9 million records with an initial sample of 250,000 later reduced to 50,000
- Caused certain algorithms to take a long time to run
Tips were not normally distributed
Many tip amounts just zero as seen above
Most engineered features and all inserted variables did not correlate
Transformations of data did nothing to improve models

Correlation Heat Map

Light green means little to no correlation (approaching 0)
Dark green, white and red means at least some correlation

Final Models

Model 1	Model 2

Chose two to account for zero tip valuesL one with them included and one without
Despite all variables tested, the only ones that could help build the model were fare amount, toll amount, trip type (rate code) and payment type
- All coefficients were positive with both models starting below zero, one significantly more than the other
- Rate Code 4, which are trips from Nassau and Westchester, had the most positive effect on tips for both models, while Rate Code 3, trips from Newark, had the lowest
In Model 1 Payment Type 1, credit card, had a much higher effect than 2 which was cash. This is explained by all of the zero values and was the driver for why these variables were included in the first place.
Naturally, the model assuming at least some tip performed much better

Analyzing Performance

Model 1 QQ Plot	Model 2 QQ Plot

Average Error: $0.83	Average Error: $0.41

Both models do a decent job of predicting values that fall within the middle quantitles, but tail off on both ends.

Takeaways

Taxi dataset isnt' perfect: quite possible a number of cab drivers didn't report their tips in order not to be taxed on them
Trying to predict instantaneous human decision-making, which tipping often is, is incredibly difficult!
Because tip amounts were heavily skewed, using a linear regression model probably is not the best method to create a prediction.
The core part of the model in fare amount is arguably not indepdent from tip amount
With more time I would deive into different neighborhoods more. Creating dummy variables for each pickup and dropoff location would have created more than 500 to play with. Not ideal for a 3-day project.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
CSVs		CSVs
Charts		Charts
Jupyter Notebooks		Jupyter Notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Yellow Cab Ride Tips

Methodology

Challenges

Correlation Heat Map

Final Models

Analyzing Performance

Takeaways

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predicting Yellow Cab Ride Tips

Methodology

Challenges

Correlation Heat Map

Final Models

Analyzing Performance

Takeaways

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages