Technical task for the Data Scientist role at the Marketing Tech dept., Delivery Hero SE

The exercise is based on a simplified version of a real model built by our team. Its objective is, to predict whether a customer is going to return (order again from us) in the upcoming 6 months or not.
So the main idea is, to create an algorithm to correctly predict the is_returning_customer label (see details here) based on the provided data.

You can solve this exercise using the Python data ecosystem with the usual well-known tools and libraries.

Data dictionary

There are two data samples provided as (gunzipped) CSV files.

Order data

The order dataset contains the history of orders placed by customers acquired between 2015-03-01 and 2017-02-28. The data points were synthetically generated to reflect patterns of real data for the purpose of the exercise.

The dataset columns definition:

Column	Description
customer_id	Unique customer ID.
order_date	Local date of the order.
order_hour	Local hour of the order.
customer_order_rank	Number of a successful order counted in chronological order starting with 1 (an empty value would correspond to a failed order).
is_failed	0 if the order succeeded. 1 if the order failed.
voucher_amount	The discounted amount if a voucher (discount) was used at order's checkout.
delivery_fee	Fee charged for the delivery of the order (if applicable).
amount_paid	Total amount paid by the customer (the voucher_amount is already deducted and the delivery_fee is already added).
restaurant_id	Unique restaurant ID.
city_id	Unique city ID.
payment_id	Identifies the payment method the customer has chosen (such as cash, credit card, PayPal, ...).
platform_id	Identifies the platform the customer used to place the order (web, mobile app, mobile web, …).
transmission_id	Identifies the method used to place the order to the restaurant (fax, email, phone, and different kinds of proprietary devices or point-of-sale systems).

The data rows are ordered by customer_id and order_date: all orders of one customer appear in chronological order on consecutive rows.

It is not necessary to know the exact meaning of the IDs, therefore we do not provide the mapping from the IDs to actual restaurants / cities / payment methods / platforms / transmission methods.

Labeled data

The labeled dataset flags whether the customers placed at least one order within 6 months after 2017-02-28 or not.

The dataset columns definition:

Column	Description
customer_id	Unique customer ID.
is_returning_customer	0 if the customer did not return (did not order again) in the 6 months after 2017-02-28. 1 if the customer returned (ordered again) at least once after 2017-02-28.

The data rows are ordered by customer_id.

Solution submission criteria

Please send us back:

All the code and files you used (both, for exploration and the final ones) as a GitHub repository (see below for further details)
A summary of the data transformations that you tried, and the ones you kept (we should be able to see this in the commit history)
A description of the ML models you tried, and the one you kept (we should be able to see this in the commit history)
A measurement of how good your models are (at least for the best one)
Any other findings on the data that you would like to share
Ideas about any further models or data transformations that you would try if you had more time

Please try to make commits as often and concise as possible to allow us to follow your working process.

How to submit the code

Please clone this repo to a private Github repo. Add the recruiting manager as a collaborator on the repo once finished.

The GitHub repo should contain:

All the code (both for exploration and the models) and the commits history showing the process you followed
The instructions on how to train and evaluate the model you would build

Please share the full codebase you developed while working on the task.

How we evaluate the task

Please take into account the solution quality points that we value the most:

The solution is working, i.e. codebase can be executed by us in our environment by following your instructions
The structure of the repository is clear
The codebase is human-readable
Python best practices are being followed, i.e. good development practices and Python principles and idioms are in place
The commits are atomic
Unit/integration/smoke tests are implemented

Regarding the modeling task, we value:

Clarity of the reasoning behind assumptions and modelling approach you are going to select
EDA
Feature engineering
Algorithm(s) chosen
Models predictive performance evaluation and monitoring

Final notes

We value a working solution and the reasoning of choices behind the solution the most. With your solution, please try to express yourself in a way you would do, while working on a real business problem.

Please do not hesitate to write us in case of further questions. We would be happy to address them as soon as possible to clarify the task.

We hope that you find the exercise interesting, and it is worth your time.

Good luck, and looking forward to meeting you at the next interview round!

behroozazarkhalili/mkt-casestudy-ds