/data_engineering_assignment

Take home assignment for data engineering candidates at SpotOn

Primary LanguagePython

Dusty Shapiro's Spoton Data Engineering Assignment

Instructions

  • Run ./script/setup
  • Input your Kaggle username and API key at the prompts
    • This script will write the Kaggle credentials to docker.env, along with some hardcoded values for the Minio user/password, and the local postgres URL.
    • One can also open the docker.env.sample file and paste in their Kaggle credentials manually. If going this route, be sure to rename this to docker.env before going to the next step.
  • Run: ./script/spoton to spin up the docker containers and run the ingestion.
  • This script will run the entire ingestion, from retreiving data from Kaggle, through Minio-S3 and finally into the local postgres DB.
  • Once the above script finishes (it may take 30-60 seconds to complete the ingestion), feel free to run ./script/db_shell to enter the local DB container, where one can query the data just ingested.
    • The ingestion dumps into the public schema. An example query: select count(*) from olist_customers_dataset;
    • Note: The ./script/spoton is running docker-compose up -d, so keep in mind once the script finishes, the Docker containers are still running in detached mode.

See nextsteps.md for more

Assignment

Your task is to automate the download and ingestion of the Brazilian ecommerce data set using the Kaggle API.

  1. Fork this repo to your account
  2. Create an account with Kaggle if you don't already have one. You will need this to access their API to retreive data for the assigment.
  3. Create a script to retrieve Brazilian eCommerce data from the Kaggle API and place the files listed below in object storage.
  • Minio is provided in docker compose or feel free to use your choice of object storage (e.g. Google Cloud Storage, AWS S3)

  • Load only the following datasets:

    olist_customers_dataset.csv
    olist_order_items_dataset.csv
    olist_orders_dataset.csv
    olist_products_dataset.csv
    
  1. Ingest files from object storage into Postgres using Singer, python or your programming language of choice. Provided is a base python image in the Dockerfile along with a Postgres instance that can be created using docker compose. Your ingestion process should create a table for each file. Here are some helpful links if you are using singer (hint, we link singer ;), but use whatever you are most comfortable with)

  2. Create a nexsteps.md in the repo and cover deployment considerations along with thoughts on further optimization for scale.

Acceptance Criteria

  • We will pulling from the master/main branch of your provided repo link
  • We should be able to run your code by running the following command docker-compose up.
  • We should be able to access the generated tables in the Postgres DB at localhost:5432.
  • Note, feel free to use patterns that you might otherwise avoid if they save a significant amount of time. However, be sure to call these out in your nextsteps.md and be prepared to discuss how you might implement given additional time.