Setup

This system requires Docker compose to start up the infrastructure.

Versions Required:

Docker Engine: 18.02.0+

Docker Compose: 1.21.1

These can be obtained from Docker

Execute

To run the system with full logging

export DOCKER_KAFKA_HOST=$(ipconfig getifaddr en0)
docker-compose up --scale processor=8

To run the system in the background:

export DOCKER_KAFKA_HOST=$(ipconfig getifaddr en0)
docker-compose up --scale processor=8 -d

Accessing background process logs

You can access logs for the processor using the following:

docker-compose logs -f processor

You can access all logs with the following:

docker-compose logs -f

To submit the files and perform the searches you will need to use the helper files which will require some dependencies to be installed on your submitting machine.

pip3 install -r processor/requirements.txt

Now you can submit the file or files using the helper tool.

python3 process.py /path/to/file/or/directory/or/file/including/paths/to/other/files

You can query the database using the helper tool.

python3 search.py -name name

python3 search.py --lat 100 --lon 100 --radius 100 # in miles

Pagination for search can be done using the additional --start and --size options. Defaults are 0 and 20, respectively.

Examples

 $ python3 search.py --name 'san francisco' --start 50 --size 5
[            root] INFO               search.py:135   2018-05-21 12:18:19 Name Search:
[            root] INFO               search.py:49    2018-05-21 12:18:19 [
  {
    "latitude": 19.34564,
    "country": "MX",
    "longitude": -98.86034,
    "shape": "AQAAAH8w8Nx7WDNAM9yAzw+3WMA=",
    "name": "San Francisco Acuautla",
    "admin_2": "039",
    "admin_1": "15",
    "search_location": {
      "lat": 19.34564,
      "lon": -98.86034
    }
  },
  {
    "latitude": 20.55254,
    "country": "MX",
    "longitude": -98.00209,
    "shape": "AQAAAFq77UJzjTRAg2kYPiKAWMA=",
    "name": "San Francisco",
    "admin_2": "083",
    "admin_1": "30",
    "search_location": {
      "lat": 20.55254,
      "lon": -98.00209
    }
  },
  {
    "latitude": 20.65082,
    "country": "MX",
    "longitude": -98.57522,
    "shape": "AQAAAC2VtyOcpjRAVACMZ9CkWMA=",
    "name": "Tlahuelompa (San Francisco Tlahuelompa)",
    "admin_2": "081",
    "admin_1": "13",
    "search_location": {
      "lat": 20.65082,
      "lon": -98.57522
    }
  },
  {
    "latitude": 19.44279,
    "country": "MX",
    "longitude": -99.34398,
    "shape": "AQAAAO/+eK9acTNAmZ6wxAPWWMA=",
    "name": "San Francisco Chimalpa",
    "admin_2": "",
    "admin_1": "17",
    "search_location": {
      "lat": 19.44279,
      "lon": -99.34398
    }
  },
  {
    "latitude": 19.28333,
    "country": "MX",
    "longitude": -99.80917,
    "shape": "AQAAAMb5m1CISDNA4Ln3cMnzWMA=",
    "name": "Loma de San Francisco",
    "admin_2": "118",
    "admin_1": "15",
    "search_location": {
      "lat": 19.28333,
      "lon": -99.80917
    }
  }
]
[            root] INFO               search.py:55    2018-05-21 12:18:19 Starting at 50, displaying 5 of 114

$ python3 search.py --start 0 --size 10 --lon -122.419 --lat 37.7749 --radius 5
[            root] INFO               search.py:141   2018-05-21 12:18:57 Location Search:
[            root] INFO               search.py:49    2018-05-21 12:18:57 [
  {
    "latitude": 37.7966,
    "country": "US",
    "longitude": -122.40858,
    "shape": "AQAAAC7/If325UJALnO6LCaaXsA=",
    "name": "Chinatown",
    "admin_2": "075",
    "admin_1": "CA",
    "search_location": {
      "lat": 37.7966,
      "lon": -122.40858
    }
  },
  {
    "latitude": 37.71715,
    "country": "US",
    "longitude": -122.40433,
    "shape": "AQAAAMcpOpLL20JAq7LviuCZXsA=",
    "name": "Visitacion Valley",
    "admin_2": "075",
    "admin_1": "CA",
    "search_location": {
      "lat": 37.71715,
      "lon": -122.40433
    }
  },
  {
    "latitude": 37.75018,
    "country": "US",
    "longitude": -122.43369,
    "shape": "AQAAAIAO8+UF4EJAi6azk8GbXsA=",
    "name": "Noe Valley",
    "admin_2": "075",
    "admin_1": "CA",
    "search_location": {
      "lat": 37.75018,
      "lon": -122.43369
    }
  },
  {
    "latitude": 37.77493,
    "country": "US",
    "longitude": -122.41942,
    "shape": "AQAAADpY/+cw40JAdNL7xteaXsA=",
    "name": "San Francisco",
    "admin_2": "075",
    "admin_1": "CA",
    "search_location": {
      "lat": 37.77493,
      "lon": -122.41942
    }
  },
  {
    "latitude": 37.75993,
    "country": "US",
    "longitude": -122.41914,
    "shape": "AQAAAOif4GJF4UJAghyUMNOaXsA=",
    "name": "Mission District",
    "admin_2": "075",
    "admin_1": "CA",
    "search_location": {
      "lat": 37.75993,
      "lon": -122.41914
    }
  }
]
[            root] INFO               search.py:55    2018-05-21 12:18:57 Starting at 0, displaying 5 of 5

Troubleshooting

If you have an issue with kafka or elasticsearch when restarting run the following:

docker-compose rm -fs # to remove and kill all containers
rm -rf volumes  # to remove saved data

Then rerun the docker-compose up command.

Discussion

Using the --scale processor=8 command in docker-compose will spawn 8 processes in a consumer group that will share the load of reading off of Kafka. Obviously the way that the docker-compose is set up right now the bottle neck is in kafka and elasticsearch which are both single node instances.

docker-compose isn't the best platform to scale kafka and elasticsearch, however, so this system still has the same linear "clock" time. A more robust solution would shift to use kubernetes in order to better manage the scaling of kafka and elasticsearch.

docker-compose does give a very clear understanding of how the system might work in production and is a quick tool to set up the environment for devs.

Without having scaling both elasticsearch and kafka we are also susceptible to lost data on outage which isn't ideal.

All of this being said, given a sufficiently well provisioned kafka cluster and elasticsearch cluster I'm pretty confident that this solution would work well.