Li Deng 08.08.2019
Github: https://github.com/dengl11/Doordash-DataSicence-Project
- From there you can see my step-by-step implementation
- Will make it private after being reviewed by Doordash engineers
-
Model Using a simple linear regression model in sklearn
-
Model Evaluation Train-test split the historical data, and compute MSE
-
Data Preprocessing
- compute the generated column
delivery_seconds
fromactual_deliverty_time
andcreated_at
- remove the entries where
actual_deliverty_time
orcreated_at
isnan
- label-encode the categorical column
store_primary_category
(a better way is to use one-hot encoding) - choose the features used for training, and remove other columns
- fill
nan
with the mode in categorical columns, median in numerical columns
- compute the generated column
-
Finding From the coefficients of the linear regression model:
num_distinct_items : 15.03
total_items : -4.89
total_onshift_dashers : -2.47
total_outstanding_orders : 2.33
total_busy_dashers : -1.93
estimated_order_place_duration : 0.96
store_primary_category : 0.76
max_item_price : 0.11
estimated_store_to_consumer_driving_duration : 0.10
min_item_price : 0.02
subtotal : 0.01
num_distinct_items
is the dominant t factor for delivery time; the morenum_distinct_items
, the longer the delivery time, which makes sense;total_onshift_dashers
is also important; the more dashers on shift, the shorter delivery time, which also makes sensetotal_outstanding_orders
: is also important; the more outstanding orders to deal with, the longer the delivery time
Output dataframe to ./output/predictions.tsv
- Further debug to catch out potential bugs