pip install -U pip
pip install -r requirements.txt
The csv file technical_test_data.csv
should be in the folder data/
The file exploration.ipynb
shows some statistics and how the solution was
elaborate. The file solution.ipynb
uses the functions of functions.py
to
display the scatter of each users with their home and work places.
The idea comes from a simple assumption: people go from home to work during week days and stay at home more often on week end days.
The first step is to extract the points of interest of the user using a
clustering algorithm DBSCAN
and the haverstine distance metrics which fits
best for geospatial data.
We try to find groups of dots not too far from each other (around 10 meters if you
don't count the horizontal precision) and with a sufficient amount of data (at
least 100 event should have occured).
Then we take the top two most visited places and match them to the week days and the week end days to separate home from work.
-
Usage of other features:
-
time series: we did not use the motion of the users at all
- When they start to move from A to B ?
- How long did they stay at some place ?
- The clustering part groups events which did not necessarily happened at the same time. We should consider point spatially grouped but also temporally grouped
- Instead of counting the occurrence of each point of interest, analyze the order of them: for instance, people are more likely doing the pattern HOME/WORK/HOME during the week whereas during week ends the number of point of interest can increase.
-
crc32
: home and work wifi spot should be constant and stable (giving also good horizontal precision) -
speed
: the speed feature was not that used or analyze. Some negative values and some stats did not help that much to use it properly
-
-
Tweak and explore DBSCAN possibilities
- Take more time to find a proper haversine version with horizontal precision
- Finetune the value of
eps
-
Use values of latitude and longitude to map point of interest to actual places (using an external API)