Building an ETL pipeline from device to cloud

Blog post series on h3xagn.com

In these series of blogs I covered building an ETL data pipeline using Python. The data is coming from 80 field devices that are installed on mobile equipment. Data is collected using an on-premise FastAPI web server which processes, transforms and uploads the data to Azure. Data is ingested into Azure Data Explorer and then made available to end-users. A seperate Azure Function app also sends data to the device vendor.

Each blog focusses on a specific part of the pipeline to move the data from the field device to the cloud. The code here reflects the latest improvements - see the BONUS blog for more information.

Repo structure and file descriptions

01 Web Server

This folder contains the FastAPI web server receiving data from the field devices, transforming it and uploading it to the cloud.

main.py: FastAPI web server using uvicorn.
utils/process_json.py: Background task to process JSON file, transform data to CSV and upload to Azure Blob Storage.

The dependencies and virtual environment details are located in the Pipfile which can be used with pipenv.

02 Function App

This folder contains the Azure Function app code that is triggered by the blob upload and sends data to the vendor.

main.py: The function app that reads the compressed JSON file and sends it to an API endpoint.
function.json: The configuration of the function app which sets the trigger path.
requirements.txt: Contains the dependencies for the function app.

Ensure all other Application Settings are updated in the Azure Portal for the function app.

03 Azure Data Explorer

This folder contains the Azure Data Explorer commands to setup ADX as well as the queries to test our data model that will be exported to the Data Lake.

kusto-commands.kql: Contains all the commands to setup tables, exports and continuous exports.
kusto-data-model-queries.kql: Contains the queries to test data quality rules and joins with dimension tables.
test_data: Contains three test CSV files to test the data model.

Note that for the factOilData, the data quality checks are built into the continuous export query. The quality check implemented are:

removing duplicate records based on TimeStamp and TagName.
removing invalid data, i.e. where Value is below 0 and above 10 000.
removing records where TagName is empty.

License

GNU GPL v3

Author

Coenraad Pretorius

AaronCreighton/example-blog-build-etl-pipeline