This project aimed for building a predictive algorithm that decides whether a driver registered to provide rides for customers of a mobile ride-hailing app and who is requested to take a ride offer, is going to eventually accept that offer or not. Assuming nearby drivers are available, the backend of the app sends a booking requests to a driver, who can accept or decline the ride. If the driver declines, the app can query one or more extra drivers (under certain conditions), therefore issuing more booking requests for the same ride request.
Due to the time constraint (10 hours) the aim was to create a model of reasonable prediction quality (minimum viable product - MVP - approach) driven by a few engineered features. Assuming this project was real data science project in the industry a subsequent step would be acquiring deeper domain knowledge about ride-hailing transportation sector especially regarding the dynamics and influencing factors of time and location-dependent supply and demand of both drivers and ride hailing customers alike which would support creating features of even better predictive quality.
image source: link
The dataset cointained 3 simplified log files representing 24hr worth of various data. The raw data must remain confidental.
Each line represents a user requesting a ride from a pickup location to an arrival location.
ride_id
: each ride request is assigned a unique identifiercreated_at
: the epoch when this log entry was appendedorigin_lat
andorigin_lon
: the pickup locationdestination_lat
anddestination_lon
: the arrival location
Each line represents an attempt to offer a given ride to a given driver.
request_id
: each booking request is assigned a unique identifierlogged_at
: the epoch when this log entry was appendedride_id
: each booking request is linked to a user-iniated ride request (see 1.)driver_id
: each booking request is sent to a given driver (see 3.)driver_accepted
: the driver's response (boolean)driver_lat
anddriver_lon
: the location of the driver at the time the booking request was dispatched
Each line represents a driver state change.
driver_id
: each driver is assigned a unique identifierlogged_at
: the epoch when this log entry was appendednew_state
: the driver's state, which can be one of{ connected , disconnected , began_ride , ended_ride }
Due to time constraints for this project I aimed for finding a balance between the aspects of data exploration, feature engineering, modeling and interpretation of the results. During the project several ideas came up regarding how to expand and branch off the project within these various analysis steps and add several more.
Upon inspecting the data and convincing myself of the cleanliness of the data, the decision was made (with respect to the goal of the project) to left join the bookingRequest
table with the rideRequest
table on the ride_id
column and the resulting DataFrame df_rides
underwent a data inspection and EDA in Pandas.
The first task consisted of inspecting the three logfiles and getting an overview of the joined dataframe df_rides
and the meaning of the data (see chapter 1. Data). Several things were apparent:
-
the three log files contained data roughly spanning over a timeframe of the same 24 hours:
log file name timestamp start timestamp finish rideRequests.log 2018-10-29 12:04:50.650768640 2018-10-30 11:53:58.996917504 bookingRequests.log 2018-10-29 12:04:49.561193728 2018-10-30 11:53:58.996917504 drivers.log 2018-10-29 12:00:01.116698112 2018-10-30 11:54:29.890972416 -
I gained a rough overview of the distribution of the numerical data. It could be concluded that the geographical coordinates are perfectly normal distributed (most likely artificially generated) roughly around the city center of a big European city. A quick plot with Folium (not contained in the notebook because of limited memory on github) confirmed this. Furthermore, the target variable
driver_accepted
is imbalanced, which made a sampling technique necessary before model training.
- the cleanliness of the data was assessed. There were no missing values. No inference of missing values was necessary. Also statistical outliers are hardly present / wouldn't impact the prediction here.
Several features were engineered or can be engineered in the future which contributed to the predictive power of the model with respect to if a driver accepts a ride. For each a function was written which conveniently could be used in a preprocessing pipeline to build up the ML ready dataframe.
A driver might me influenced in her decision to accept a request by either one or a combination of these distances:
- his/her distance to the rider (or origin) - will be called
driver_origin_distance
- rider's (or origin's) distance to the destination - will be called
origin_destination_distance
- drivers distance to the destination - will be called
driver_destination_distance
In the following I created three scatterplots containing every booking request issued and how they are distributed over the three above mentioned distances as well as over time. They are additionally hued depending on driver_accepted
One can identify overweights of driver_accept=False
in certain time periods (around 3am e.g.) for all three kinds of distances and as well as an distinctive overweight of driver_accept=False
towards higher distances of driver-origin (first plot). This way one can indeed immediately get an idea of the importance of at least the logget_at
and the driver_origin_distance
feature to be important predictors for the target variable.
One can conclude from the geographical normal distribution of request data that most rides are requested within the city center. A drivers preference regarding which rides to accept might be influenced by the assumption where most rides are requested since this allegedly would increase his/her chances to catch the next request after dropping off the customer /rider in quick succession - increasing his/her daily profit.
Since we saw that all geographic coordinates are almost perfectly normally distributed, we assume that the mean of it corresponds to the city center. In a case of a city which shows a high degree of point symmetry like Paris one also could choose the city center as a reference point for this feature.
A more sophisticated approach for real world data would probably be to cluster the multitude of geographic coordinates of all ride requests of the past with an unsupervised algorithm e.g. K-Means, build not only one but multiple cluster centers (which would correspond roughly to more "vivid" or more densely populated areas) and then calculate for every ride request the distance between the rider and the next cluster. The idea being that the closer to one of the vivid areas a rider requests a ride, the earlier the requested driver hopes to finish the ride also within that area and pick up the next rider in quicker succession increasing his/her acceptance level. An assumed behavior of drivers could therefore be to stick around in the vicinity of venues, places of active nightlife etc.
2.2.3. Time difference between ride request made and booking request made (implemented, doesn't play a role)
A driver might expect a more displeased (pleased) rider if too much (only little) time has passed since the request was made, risking a negative (hoping for a positive) review. Therefore - as it was the initial though - this time difference could be of importance. It was discovered only after analyzing the feature importance of the trained model that this feature does hardly play a role, since the system seems to issue booking request almost immediately after ride request, and only in very rare cases couple of seconds pass between these events.
2.2.4. How many rides has a driver completed since last 'connected' when booking request is issued (implemented)
A driver might be more inclined to accept a booking request when having completed only few rides / less inclined when already completed subjectively enough. This theory is also motivated by some visualizations done in the EDA process. To engineer this feature, the joined bookingRequest
and rideRequest
table was combined with the drivers
table: For each individual row of that joined table df_rides
, the driver_id
together with the logged_at
epoch was used to 1) group the driver
table by that driver_id
and then sum up how many cases of ended_ride
states occured up to that logged_at
epoch. This sum was appended as another column no_of_accumulated_rides
to the main df_rides
DataFrame.
In the below scatterplots (fig. 1 - 3) all booking requests are plotted with the number of accumulated rides the specific driver has accumulated up to that point when the booking request is issued (x-axis) and distances (rider-driver, rider-destination, driver-destination). It is clearly visible that riders are more inclined to accept when they don't have much rides accumulated. Additionally fig. 3 in particular shows that drivers who get requested between roughly their their 2nd to 5th ride of their "shift" accept a broad range of distances to the customer, while above the 5th - and in particular as their first ride they tend to reject long distances.
It can be really beneficial to analyze these plots more in depth which can provide promising leads for even more features.
2.2.5. Accumulated distance that a driver has travelled since last 'connected' when booking request is issued (not yet implemented)
A driver might be less inclined to accept a new booking the longer the accumulated distance is on his/her workday.
Another crucial feature for the target variable could be if a driver is finding himself/herself in a ride with a customer when a new booking request arrives.
It was found and confirmed though that booking requests are only issued to those drivers who are connected and free. Therefore this feature wouldn't have any predictive power and wasn't implemented.
A baseline model (logistic regression) with df_rides
was quickly build with no preprocessing or feature engineering whatsoever. Expectedly it performed poorly in predicting the underbalanced class (driver_accepted=True
).
I therefore switched to training a Random Forest model, after using SMOTE (synthetic minority oversampling) on the imbalanced target variable and after adding the aforementioned engineered features to the already existing ones.
I only optimized the RandomForest with regards to the number of trees (result ca. n=150) using a GridSearch with 10-fold cross validation with accuracy score as scoring paramenter
Looking at the metrics, the confusion matrices, the ROC curve and the AUC one can say that the model does a decent job in minimizing the False Positives and False Negatives.
When assessing model quality according to several metrics one should always keep in mind though the big picture - in this case what are the costs when misclassifying the target variable in certain directions, e.g.
-
goal 1: Minimizing the chance of missing
driver_accept=True
cases (with the idea that we can allow for some drivers classified that they would accept). In this case our model would perform reasonably well with a recall of 0.84. In practical terms, from this classification results recommendations could be send out to the drivers to deploy in certain areas, this could lead to the system recommending more drivers to move to certain locations "for nothing" (because the system decided they might accept specific rides there). If the cost of this is acceptable to our business, this trained model would already be a decent one. -
goal 2: Minimizing the number of drivers who are wrongly predicted as accepting a drive (FP's). A benefit that could be bought by allowing for more drivers to be classified that they would not accept a drive - although they would. In this case our model's precision of 0.65 would perform relatively bad compared to the other metrics.
class | precision | recall | f1-score |
---|---|---|---|
driver_accept=False |
0.93 | 0.82 | 0.87 |
driver_accept=True |
0.65 | 0.84 | 0.73 |
confusion matrix with absolute values of classification | confusion matrix with values normed to the true label |
Feature importance
Below the feature importances are shown which were calculated during the model training and can be accessed via the feature_importances_
property of the model. Interestingly (but also not surprisingly) the engineered feature of the distance between driver and customer at the time when the booking request is issued is the most important one. Other three engineered features like òrigin_destination_distance
, no_of_accumulated_rides
and driver_destination_distance
rank rather high and provide predictive power. The engineered time difference between ride request and booking request does hardly play a role - no wonder: After inspecting the histogram of these values it's clear: The system issues booking requests almost immediately after ride requests.