Apple Health ETL Project

Architecture Diagram

Description

This project involves an ETL (Extract, Transform, Load) process to analyze sleep data exported from Apple Health to iCloud in XML format.

The data is then processed and transformed using AWS services, queried through Amazon Athena, and visualized using a Streamlit dashboard.

Project Overview

Export Health Data from iPhone to iCloud in XML format.
Load the data into an Amazon S3 bucket.
Set up an AWS Lambda function to process the XML data into a CSV file and store it in the S3 bucket.
Set up another AWS Lambda function to further transform the data using DuckDB and Pandas.
The data is then redirected to a Lambda Function, which saves the data as Parquet files in an S3 Bucket, partitioned by year.
Set up AWS Glue Crawlers to crawl the Parquet files stored in the S3 bucket and store the data in the AWS Glue Data Catalog table, partitioned by year.
Finally, A Streamlit dashboard is set up on an Amazon EC2 instance to display sleep analytics over the years.

Tech Stack

AWS Services : S3, Lambda, Glue, Athena, SNS, EC2
Python Libraries : boto3, lxml, s3fs, awswrangler, pandas, duckdb, streamlit
Data Processing : DuckDB
Analytics and Visualization : Athena, Streamlit

Prerequisites

The above tech stack and an iCloud Account with Apple Health Data synced regularly from Apple Watch are required. If you don't have an account, you can download my Health Dataset

Workflow

The Data is Exported from Apple Health to iCloud (Download here)
The Data is then transfered locally to an S3 Bucket using AWS CLI.

An AWS Lambda Function (Process_XML) is set up to Transform the Raw Data (XML) and save it as CSV Files in an S3 Bucket. If the Function's execution fails, its response is redirected to an AWS SNS Topic.

Processed CSV Files:

Another Lambda Function (Transform Health) is triggered by an Object Put in S3 Bucket, which further transforms the Data using DuckDB and redirects the transformed Data to another Lambda Function (To_Parquet), which saves the Data as Parquet Files, partitioned by year.

Transformed Parquet Files:

The Data is partitioned by year:

The same Lambda Function (To_Parquet), triggers AWS Glue Crawlers to Crawl the Parquet Files stored inside S3 Bucket. The Crawled data is then stored inside AWS Glue Data Catalog's tables.

Crawled Data in AWS Glue Data Catalog Tables:

Querying Data using Athena:

Example Query:

SELECT
    FROM_UNIXTIME(recorded_on / 1000) AS recorded_on,
    avg_heart_rate, 
    year
FROM heart_data_parquet
WHERE year = '2022';

In the above query, I used WHERE Clause to filter the data by year to avoid scanning other Partitions. Also, FROM_UNIXTIME() Function is used to convert the Unix Epoch format to TIMESTAMP

Output:

It also triggers to start an EC2 instance and Launch the Streamlit Dashboard app

You can access the app through the following URL : http://<your_instance's_public_ip>:8501

Replace your_instance's publicc_ip with your EC2 instance's Public IPV4 Address.

Configuring Auto Launcher:

On the EC2 instance, open crontab editor

crontab -e

Add the following line to the editor:

@reboot /home/ec2-user/.local/bin/streamlit run /home/ec2-user/<path_to_streamlit_app> --server.port 8501

Replace the path_to_streamlit_app with path to your Streamlit App

Now whenever the EC2 instance restarts, the Streamlit app will automatically run on port 8501

Streamlit Dashboard

Here's a quick look of the Streamlit Dashboard hosted on EC2:

Screen.Recording.2023-04-30.at.3.40.14.AM.mov

vinamrgrover/ETL-Apple-Health