Locations clustering proof-of-concept test task.
Given that we have a dataset which has around 60k of location addresses, including city, road, house_number, postcode and state: Download the Locations dataset
The original locations data look like this:
city | road | house_number | postcode | state |
---|---|---|---|---|
fort worth | sycamore school rd | 5421 | 76123 | texas |
gautier | highway 90 | 2701 | 39553 | mississippi |
farmington | e main | 4750 | 87401 | nm |
richland | queensgate drive | 2751 | 99352 | washington |
davie | s state hwy seven | 1600 | 33317 | fl |
crestview | w james lee blvd w | 302 | 32536 | fl |
eufaula | s eufaula ave | 1250 | 36027 | al |
mccalla | eastern valley rd | 4746 | 35111 | alabama |
hugoton | e 11th st | 612 | 67951 | ks |
The candidate is expected to create a solution - a proof of concept, using Python, to cluster (or simply group) those addresses where they belong to the same place (same address). Candidate is free to implement any machine learning model, or hand-crafting solutions to show how the candidate would approach to solve this problem.
For example, these records below could be scattered around the orignal locations CSV, and we want to "group" them together into their own single address cluster, and don't remove the duplicates:
city | road | house_number | postcode | state | cluster_id | record_idx |
---|---|---|---|---|---|---|
simpsonville | fairview rd | 692 | 29680-6708 | south carolina | 1 | 23 |
simpsonville | fairview rd | 692 | 29680 | sc | 1 | 434 |
simpsonville | fairview road | 692 | 29680 | sc | 1 | 543 |
simpsonville | fairview rd | 692 | 29680 | sc | 1 | 25436 |
simpsonville | fairview rd | 692 | 29680 | south carolina | 1 | 78 |
hialeah | e 10th ave | 1021 | 33010 | florida | 2 | 653 |
hialeah | e 10th ave | 1021 | 33010 | fl | 2 | 24 |
hialeah | e 10th avenue | 1021 | 33010 | fl | 2 | 87 |
hialeah | e 10th ave | 1021 | 33010 | florida | 2 | 4576 |
maumee | arrowhead dr | 1401 | 43537 | ohio | 3 | 987 |
maumee | arrowhead drive | 1401 | 43537 | ohio | 3 | 2476 |
maumee | arrowhead rd | 1401 | 43537 | oh | 3 | 2356 |
maumee | arrowhead rd | 1401 | 43537 | oh | 3 | 453 |
maumee | arrowhead rd | 1401 | 43537 | ohio | 3 | 819 |
- Fork this repo or create a new public GitHub repo for the solution.
- The solution will open/read the location data CSV file and write the grouped/clustered result in an output CSV file.
- There's no restriction on how the candidate would choose to approach the problem.
Please email to: long@shake.io if you have any questions.
Good luck!