- Run
./script/setup
- Input your Kaggle username and API key at the prompts
- This script will write the Kaggle credentials to
docker.env
, along with some hardcoded values for the Minio user/password, and the local postgres URL. - One can also open the
docker.env.sample
file and paste in their Kaggle credentials manually. If going this route, be sure to rename this todocker.env
before going to the next step.
- This script will write the Kaggle credentials to
- Run:
./script/spoton
to spin up the docker containers and run the ingestion. - This script will run the entire ingestion, from retreiving data from Kaggle, through Minio-S3 and finally into the local postgres DB.
- Once the above script finishes (it may take 30-60 seconds to complete the ingestion), feel free to run
./script/db_shell
to enter the local DB container, where one can query the data just ingested.- The ingestion dumps into the public schema. An example query:
select count(*) from olist_customers_dataset;
- Note: The
./script/spoton
is runningdocker-compose up -d
, so keep in mind once the script finishes, the Docker containers are still running in detached mode.
- The ingestion dumps into the public schema. An example query:
Your task is to automate the download and ingestion of the Brazilian ecommerce data set using the Kaggle API.
- Fork this repo to your account
- Create an account with Kaggle if you don't already have one. You will need this to access their API to retreive data for the assigment.
- Create a script to retrieve Brazilian eCommerce data from the Kaggle API and place the files listed below in object storage.
-
Minio is provided in docker compose or feel free to use your choice of object storage (e.g. Google Cloud Storage, AWS S3)
-
Load only the following datasets:
olist_customers_dataset.csv olist_order_items_dataset.csv olist_orders_dataset.csv olist_products_dataset.csv
-
Ingest files from object storage into Postgres using Singer, python or your programming language of choice. Provided is a base python image in the
Dockerfile
along with a Postgres instance that can be created using docker compose. Your ingestion process should create a table for each file. Here are some helpful links if you are using singer (hint, we link singer ;), but use whatever you are most comfortable with)- CSV Singer tap: https://github.com/singer-io/tap-s3-csv
- Postgres Singer target: https://github.com/datamill-co/target-postgres
-
Create a
nexsteps.md
in the repo and cover deployment considerations along with thoughts on further optimization for scale.
- We will pulling from the master/main branch of your provided repo link
- We should be able to run your code by running the following command
docker-compose up
. - We should be able to access the generated tables in the Postgres DB at
localhost:5432
. - Note, feel free to use patterns that you might otherwise avoid if they save a significant amount of time. However, be sure to call these out in your
nextsteps.md
and be prepared to discuss how you might implement given additional time.