In this challenge, we will develop a simple Apache Airflow data pipeline.
Our data pipeline fetches data from News API, transforms the data into a tabular structure, and stores the transformed data on Amazon S3.
- I recommend starting with a fresh virtualenv using Python 3.6 on a *nix system.
- Our docker versions are docker 17.12.0-ce and docker-compose 1.18.0.
- Run
make init
to download project dependencies. - Run
make test
to make sure basic smoke tests are passing. - Run
make run
with docker running to bring up airflow.- The Airflow UI/Admin Console should now be visible on http://localhost:8080.
- There will be a DAG named
sample_dag
. You should be able to view logs from the Task Instance Context Menu.
- Follow the requirements + rules of engagement below.
- Use Airflow to construct a new data pipeline (DAG) named 'tempus_challenge_dag'.
- Data pipeline is scheduled to run once a day.
- Data pipeline will:
- Retrieve all English news sources.
- For each news source, retrieve the top headlines.
- Top headlines must be flattened into a CSV file. CSV Filename:
<pipeline_execution_date>_top_headlines.csv
- Result CSV must be uploaded to the following s3 location
<s3_bucket>/<source_name>
- Top headlines must be flattened into a CSV file. CSV Filename:
- Built a separate pipeline that uses the following keywords instead of English news sources: Eric Lefkofsky, Cancer, Immunotherapy
From the Apache Airflow documentation:
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
In order to facilitate the use of Airflow, we have included a Dockerfile and a docker-compose.yml that can be used to set up a local airflow development environment. Make sure to have Docker and Docker Compose installed.
From the root folder, you can execute the following command to run airflow:
docker-compose up --build
The Airflow UI/Admin Console should now be visible on http://localhost:8080.
In order to build the data pipeline, it will be necessary to create a DAG. We have provided an example DAG, dags/sample_dag.py
, that can be used as a reference. Further documentation can be found in the airflow tutorial and the airflow concepts pages.
To load a new DAG into airflow, simply create a new Python file in the dags
folder that contains an airflow DAG object.
To install additional Python packages (boto3, pandas, requests, etc.), add them to requirements.txt
.
- https://airflow.apache.org/index.html
- https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
- https://speakerdeck.com/artwr/apache-airflow-dataengconf-sf-2017-workshop
- https://github.com/hgrif/airflow-tutorial
A simple REST API that can be used to retrieve breaking headlines and search for articles. A free News API account is required to obtain an API key.
Route | Description |
---|---|
/v2/top-headlines | Returns live top and breaking headlines for a country, specific category in a country, single source, or multiple sources. |
/v2/sources | Returns the subset of news publishers that top headlines are available from. |
A simple cloud storage service run by Amazon Web Services (AWS). An AWS account is needed to use AWS S3. Furthermore, AWS has a free tier that can be used for this challenge.
Amazon provides a Python SDK (boto), that provides an easy to use API for interacting with AWS S3.