It is a fun way to assess your data skills. It is also a good representative sample of the work we do at Rearc.
- Data management / data engineering concepts.
- Programming language (python, java, scala, etc).
- AWS knowledge (Lambda, SQS, CloudWatch logs).
- Infrastructure-as-code (Terraform, CloudFormation, etc)
This quest consists of 4 different parts. Putting all 4 parts together we will have a Data Pipeline architecture.
- Part 1 and Part 2 will showcase your skills with data management, AWS concepts, and your overall data engineering skillset. The goal is to source data from different places and store it in-house.
- Part 3 will showcase your data analytics skills. The goal is to find some interesting insights with data.
- Lastly, Part 4 will put all the pieces together. The goal here is to showcase your experience with automation and AWS services.
- Republish this open dataset in Amazon S3 and share with us a link.
- You may run into 403 Forbidden errors as you test accessing this data. There is a way to comply with the BLS data access policies and re-gain access to fetch this data programatically - we have included some hints as to how to do this at the bottom of this README in the Q/A section.
- Script this process so the files in the S3 bucket are kept in sync with the source when data on the website is updated, added, or deleted.
- Don't rely on hard coded names - the script should be able to handle added or removed files.
- Ensure the script doesn't upload the same file more than once.
- Create a script that will fetch data from this API. You can read the documentation here
- Save the result of this API call as a JSON file in S3.
-
Load both the csv file from Part 1
pr.data.0.Current
and the json file from Part 2 as dataframes (Spark, Pyspark, Pandas, Koalas, etc). -
Using the dataframe from the population data API (Part 2), generate the mean and the standard deviation of the annual US population across the years [2013, 2018] inclusive.
-
Using the dataframe from the time-series (Part 1), For every series_id, find the best year: the year with the max/largest sum of "value" for all quarters in that year. Generate a report with each series id, the best year for that series, and the summed value for that year. For example, if the table had the following values:
series_id year period value PRS30006011 1995 Q01 1 PRS30006011 1995 Q02 2 PRS30006011 1996 Q01 3 PRS30006011 1996 Q02 4 PRS30006012 2000 Q01 0 PRS30006012 2000 Q02 8 PRS30006012 2001 Q01 2 PRS30006012 2001 Q02 3 the report would generate the following table:
series_id year value PRS30006011 1996 7 PRS30006012 2000 8 -
Using both dataframes from Part 1 and Part 2, generate a report that will provide the
value
forseries_id = PRS30006032
andperiod = Q01
and thepopulation
for that given year (if available in the population dataset). The below table shows an example of one row that might appear in the resulting table:series_id year period value Population PRS30006032 2018 Q01 1.9 327167439 Hints: when working with public datasets you sometimes might have to perform some data cleaning first. For example, you might find it useful to perform trimming of whitespaces before doing any filtering or joins
-
Submit your analysis, your queries, and the outcome of the reports as a .ipynb file.
- Using AWS CloudFormation, AWS CDK or Terraform, create a data pipeline that will automate the steps above.
- The deployment should include a Lambda function that executes Part 1 and Part 2 (you can combine both in 1 lambda function). The lambda function will be scheduled to run daily.
- The deployment should include an SQS queue that will be populated every time the JSON file is written to S3. (Hint: S3 - Notifications)
- For every message on the queue - execute a Lambda function that outputs the reports from Part 3 (just logging the results of the queries would be enough. No .ipynb is required).
You can do as many as you like. We suspect though that once you start you won't be able to stop. It's addictive.
- Link to data in S3 and source code (Step 1)
- Source code (Step 2)
- Source code in .ipynb file format and results (Step 3)
- Source code of the data pipeline infrastructure (Step 4)
We have many more for you to solve as a member of the Rearc team!
Do. Or do not. There is no fail.
No.
Hint 1
The BLS data access policies can be found here: https://www.bls.gov/bls/pss.htmHint 2
The policy page says:BLS also reserves the right to block robots that do not contain information that can be used to contact the owner. Blocking may occur in real time.
How could you add information to your programmatic access requests to let BLS contact you?
Hint 3
Adding aUser-Agent
header to your request with contact information will comply with the BLS data policies and allow you to keep accessing their data programmatically.
This project addresses the four parts of the data engineering challenge using AWS Lambda functions and Terraform for infrastructure management.
The handle_sync
Lambda function in lambda_function.py
handles downloading data:
- It calls
sync_s3_with_source
frompublish.py
to download all files from the BLS website (https://download.bls.gov/pub/time.series/pr/) and store them in an S3 bucket. - The function uses the
requests
library to fetch the HTML content of the webpage andBeautifulSoup
to parse it and extract file links. - Each file is then downloaded and uploaded to the specified S3 bucket.
The handle_sync
function also manages API data retrieval:
- It calls
fetch_api_data
fromfetch.py
to fetch population data from the Data USA API (https://datausa.io/api/data?drilldowns=Nation&measures=Population). - The function uses the
requests
library to make the API call and retrieve the JSON data. - The data is then stored in the S3 bucket as 'api_data.json'.
The handle_analysis
Lambda function in lambda_function.py
performs the required data analysis:
- It reads the population data from 'api_data.json' in S3 and the BLS data from 'pr.data.0.Current' in S3.
- The function calculates the mean and standard deviation of the population between 2013 and 2018.
- It determines the best year (highest sum of values) for each series_id in the BLS data.
- A report is generated combining BLS data for series 'PRS30006032' in Q1 with the corresponding population data.
The project uses Terraform to manage AWS infrastructure:
- Terraform configuration files in the
terraform
folder define the required AWS resources. - This includes Lambda functions, S3 bucket, IAM roles, CloudWatch log groups, and an SQS queue.
- To deploy:
- Navigate to the
terraform
folder - Run
terraform init
to initialize Terraform - Run
terraform plan
to preview changes - Run
terraform apply
to create or update the infrastructure
- Navigate to the
- The
handle_sync
function is triggered periodically (e.g., daily) to update data. - When new data is available in S3, a message is sent to the SQS queue.
- The SQS message triggers the
handle_analysis
function to process the new data. - Results of the analysis are returned as a JSON response.
This serverless architecture ensures efficient, event-driven data processing and analysis.