/takehome-data-eng

Takehome Challenge for Data Engineering

Primary LanguagePythonMIT LicenseMIT

Bungalow Take Home Project for Data Engineer Role (V2. 2021-11-02)

Welcome to the Bungalow Takehome Challenge for Data Engineering! This is a barebones repo to get you started.

What to build

A common task for data engineers at Bungalow involves the integration of the of third-party data, modelling data, storing it and making it available for downstream teams such as analytics, data science and ultimately the entire organization.

For this test we will collect the (current weather data)[https://openweathermap.org/current] from OpenWeatherMap. The free API will work for this assignment. You shouldn’t pay for the API.

For this challenge we'd like to give a brief snapshot of a common workload may entail. Of course, this might become a big task. Therefore to save time for you, we did some of the heavy lifting, like the set up and some scaffolding of the environment.

Please install Docker Desktop on your laptop. It will contain the environment that we would need for the next steps.

The Docker compose would have two software applications and simple setup required for them.

  • Airflow: To run your additions to the boilerplate DAGs.

  • Postgres: To maintain your tables. (You can swap it with any other database or your choice, i.e. SQLite, MySQL)

Below are the steps in the data flow diagram:

  • fetcher.py script, that represents the fetcher DAG, would retrieve the data from the current weather API.

  • The fetcher script would process and clean the data, then stores it the Postgres database considering relationships, integrity, performance, and extendability.

  • The transformer.py script, that represents the Transformer DAG, would transform the data from the previous step to prepare some derived dataset tables. You will have the choice to implement the transformations both in Python or SQL.

  • The Transformer writes the datasets back to Postgres.

  • The downstream customer(s) would read both original and derived tables. They will execute historical queries to run analytics and science models.

This project is meant to be flexible as to showcase your decision making capabilities and your overall technical experience. If you are uncomfortable with Docker, Postgres or Airflow, please feel free to remove or replace them. They are meant to be a starting point for you. As long as you can achieve the outcome feel free to use any additional tooling and approach you see fit. We will ask follow questions about your decision mechanism in the follow up conversation.

We are more interested in seeing your thought process and approach to solving the problem!

Deliverables

We will expect to see the following items in your Github pull request:

  • Your Python code for data fetcher and transformer.

  • The data model SQL and your design for its data modelling

  • Readme file with your notes

Evaluation

We will use this project as our basis for our evaluation of your overall fit for a data engineering role from a technical viewpoint.

To do this, we will review your code with an eye for the following:

  • Readability, scalability and usability

  • Data processing and relational modelling

  • Python and SQL know-how

Time expectations

We know you are busy and likely have other commitments in your life, so we don't want to take too much of your time. We don't expect you to spend more than 2 hours working on this project. That being said, if you choose to put more or less time into it for whatever reason, that is your choice.

Feel free to indicate in your notes below if you worked on this for a different amount of time and we will keep that in mind while evaluating the project. You can also provide us with additional context if you would like to.

Additionally, we have left a spot below for you to note. If you have ideas for pieces that you would have done differently or additional things you would have implemented if you had more time, you can indicate those in your notes below as well, and we will use those as part of the evaluation.

Public forks

We encourage you to try this project without looking at the solutions others may have posted. This will give the most honest representation of your abilities and skills. However, we also recognize that day-to-day programming often involves looking at solutions others have provided and iterating on them. Being able to pick out the best parts and truly understand them well enough to make good choices about what to copy and what to pass on by is a skill in and of itself. As such, if you do end up referencing someone else's work and building upon it, we ask that you note that as a comment. Provide a link to the source so we can see the original work and any modifications that you chose to make.

Challenge instructions

Fork this repository and clone to your local environment

  • Prepare your environment with Python and any other tools you may need. Docker can do it for you.
    • To run the docker-compose, you need to run the following commands:
      # Initializing the folders and the non-root user for Airflow
      mkdir -p  ./logs ./plugins
      echo -e "AIRFLOW_UID=$(id -u)" > .env
      # Initializing airflow database
      docker-compose up airflow-init
      # Running the docker-compose
      docker-compose up 
      # You can see the Airflow UI in http://localhost:8080 with username/password: airflow
    • If you run to any problems with the environment, please refer to here.
  • Fill in the TODO in the repository. There are currently 6 TODOS, but you can go beyond and above.
    • Any problems with the DAGs? They are taken from here. Please take a look at the rest of tutorial if needed.
    • You can check Postgres operator from here
    • To keep it simple, let's use the Airflow database for the storage of your dataset
  • Write down the notes, in the Readme.md file.
  • Complete the challenge and push back to the repo
    • If you have any questions in any step, please reach out to your recruiter. A member of engineering team will be involved to support you, as if you were working for Bungalow.

Your notes (Readme.md)

Assumptions:

  • Location data wont change often so storing this in a separate table with a foreign key to weather data should suffice
  • Lat, and Long are verified to be well formed prior to the api call
  • Metric units are used for storing all parameters
  • Postgres connection exists in airflow admin panel

Time spent

Give us a rough estimate of the time you spent working on this. If you spent time learning in order to do this project please feel free to let us know that too. This makes sure that we are evaluating your work fairly and in context. It also gives us the opportunity to learn and adjust our process if needed.

Assumptions

Did you find yourself needing to make assumptions to finish this? If so, what were they and how did they impact your design/code?

Next steps

Provide us with some notes about what you would do next if you had more time. Are there additional features that you would want to add? Specific improvements to your code you would make?

Instructions to the evaluator

Provide any end user documentation you think is necessary and useful here