/quest

A data quest

Primary LanguageJupyter Notebook

Rearc Data Quest

Q. What is this quest?

It is a fun way to assess your data skills. It is also a good representative sample of the work we do at Rearc.

Q. So what skills should I have?

  • Data management / data engineering concepts.
  • Programming language (python, java, scala, etc).
  • AWS knowledge (Lambda, SQS, CloudWatch logs).
  • Infrastructure-as-code (Terraform, CloudFormation, etc)

Q. What do I have to do?

This quest consists of 4 different parts. Putting all 4 parts together we will have a Data Pipeline architecture.

  • Part 1 and Part 2 will showcase your skills with data management, AWS concepts, and your overall data engineering skillset. The goal is to source data from different places and store it in-house.
  • Part 3 will showcase your data analytics skills. The goal is to find some interesting insights with data.
  • Lastly, Part 4 will put all the pieces together. The goal here is to showcase your experience with automation and AWS services.

Part 1: AWS S3 & Sourcing Datasets

  1. Republish this open dataset in Amazon S3 and share with us a link.
    • You may run into 403 Forbidden errors as you test accessing this data. There is a way to comply with the BLS data access policies and re-gain access to fetch this data programatically - we have included some hints as to how to do this at the bottom of this README in the Q/A section.
  2. Script this process so the files in the S3 bucket are kept in sync with the source when data on the website is updated, added, or deleted.
    • Don't rely on hard coded names - the script should be able to handle added or removed files.
    • Ensure the script doesn't upload the same file more than once.

Part 2: APIs

  1. Create a script that will fetch data from this API. You can read the documentation here
  2. Save the result of this API call as a JSON file in S3.

Part 3: Data Analytics

  1. Load both the csv file from Part 1 pr.data.0.Current and the json file from Part 2 as dataframes (Spark, Pyspark, Pandas, Koalas, etc).

  2. Using the dataframe from the population data API (Part 2), generate the mean and the standard deviation of the annual US population across the years [2013, 2018] inclusive.

  3. Using the dataframe from the time-series (Part 1), For every series_id, find the best year: the year with the max/largest sum of "value" for all quarters in that year. Generate a report with each series id, the best year for that series, and the summed value for that year. For example, if the table had the following values:

    series_id year period value
    PRS30006011 1995 Q01 1
    PRS30006011 1995 Q02 2
    PRS30006011 1996 Q01 3
    PRS30006011 1996 Q02 4
    PRS30006012 2000 Q01 0
    PRS30006012 2000 Q02 8
    PRS30006012 2001 Q01 2
    PRS30006012 2001 Q02 3

    the report would generate the following table:

    series_id year value
    PRS30006011 1996 7
    PRS30006012 2000 8
  4. Using both dataframes from Part 1 and Part 2, generate a report that will provide the value for series_id = PRS30006032 and period = Q01 and the population for that given year (if available in the population dataset). The below table shows an example of one row that might appear in the resulting table:

    series_id year period value Population
    PRS30006032 2018 Q01 1.9 327167439

    Hints: when working with public datasets you sometimes might have to perform some data cleaning first. For example, you might find it useful to perform trimming of whitespaces before doing any filtering or joins

  5. Submit your analysis, your queries, and the outcome of the reports as a .ipynb file.

Part 4: Infrastructure as Code & Data Pipeline with AWS CDK

  1. Using AWS CloudFormation, AWS CDK or Terraform, create a data pipeline that will automate the steps above.
  2. The deployment should include a Lambda function that executes Part 1 and Part 2 (you can combine both in 1 lambda function). The lambda function will be scheduled to run daily.
  3. The deployment should include an SQS queue that will be populated every time the JSON file is written to S3. (Hint: S3 - Notifications)
  4. For every message on the queue - execute a Lambda function that outputs the reports from Part 3 (just logging the results of the queries would be enough. No .ipynb is required).

Q. Do I have to do all these?

You can do as many as you like. We suspect though that once you start you won't be able to stop. It's addictive.

Q. What do I have to submit?

  1. Link to data in S3 and source code (Step 1)
  2. Source code (Step 2)
  3. Source code in .ipynb file format and results (Step 3)
  4. Source code of the data pipeline infrastructure (Step 4)

Q. What if I successfully complete all the steps?

We have many more for you to solve as a member of the Rearc team!

Q. What if I fail?

Do. Or do not. There is no fail.

Q. Can I share this quest with others?

No.

Q. How do I get around the 403 error when I try to fetch BLS data?

Hint 1 The BLS data access policies can be found here: https://www.bls.gov/bls/pss.htm
Hint 2 The policy page says:

BLS also reserves the right to block robots that do not contain information that can be used to contact the owner. Blocking may occur in real time.

How could you add information to your programmatic access requests to let BLS contact you?

Hint 3 Adding a User-Agent header to your request with contact information will comply with the BLS data policies and allow you to keep accessing their data programmatically.

==============================================

Data Engineering Challenge Solution

This project addresses the four parts of the data engineering challenge using AWS Lambda functions and Terraform for infrastructure management.

Part 1: Downloading Data

The handle_sync Lambda function in lambda_function.py handles downloading data:

  1. It calls sync_s3_with_source from publish.py to download all files from the BLS website (https://download.bls.gov/pub/time.series/pr/) and store them in an S3 bucket.
  2. The function uses the requests library to fetch the HTML content of the webpage and BeautifulSoup to parse it and extract file links.
  3. Each file is then downloaded and uploaded to the specified S3 bucket.

Part 2: API Data Retrieval

The handle_sync function also manages API data retrieval:

  1. It calls fetch_api_data from fetch.py to fetch population data from the Data USA API (https://datausa.io/api/data?drilldowns=Nation&measures=Population).
  2. The function uses the requests library to make the API call and retrieve the JSON data.
  3. The data is then stored in the S3 bucket as 'api_data.json'.

Part 3: Data Analysis

The handle_analysis Lambda function in lambda_function.py performs the required data analysis:

  1. It reads the population data from 'api_data.json' in S3 and the BLS data from 'pr.data.0.Current' in S3.
  2. The function calculates the mean and standard deviation of the population between 2013 and 2018.
  3. It determines the best year (highest sum of values) for each series_id in the BLS data.
  4. A report is generated combining BLS data for series 'PRS30006032' in Q1 with the corresponding population data.

Part 4: Infrastructure as Code

The project uses Terraform to manage AWS infrastructure:

  1. Terraform configuration files in the terraform folder define the required AWS resources.
  2. This includes Lambda functions, S3 bucket, IAM roles, CloudWatch log groups, and an SQS queue.
  3. To deploy:
    • Navigate to the terraform folder
    • Run terraform init to initialize Terraform
    • Run terraform plan to preview changes
    • Run terraform apply to create or update the infrastructure

Execution Flow

  1. The handle_sync function is triggered periodically (e.g., daily) to update data.
  2. When new data is available in S3, a message is sent to the SQS queue.
  3. The SQS message triggers the handle_analysis function to process the new data.
  4. Results of the analysis are returned as a JSON response.

This serverless architecture ensures efficient, event-driven data processing and analysis.