This project involves an ETL (Extract, Transform, Load) process to analyze sleep data exported from Apple Health to iCloud in XML format.
The data is then processed and transformed using AWS services, queried through Amazon Athena, and visualized using a Streamlit dashboard.
- Export Health Data from iPhone to iCloud in XML format.
- Load the data into an Amazon S3 bucket.
- Set up an AWS Lambda function to process the XML data into a CSV file and store it in the S3 bucket.
- Set up another AWS Lambda function to further transform the data using DuckDB and Pandas.
- The data is then redirected to a Lambda Function, which saves the data as Parquet files in an S3 Bucket, partitioned by year.
- Set up AWS Glue Crawlers to crawl the Parquet files stored in the S3 bucket and store the data in the AWS Glue Data Catalog table, partitioned by year.
- Finally, A Streamlit dashboard is set up on an Amazon EC2 instance to display sleep analytics over the years.
- AWS Services : S3, Lambda, Glue, Athena, SNS, EC2
- Python Libraries : boto3, lxml, s3fs, awswrangler, pandas, duckdb, streamlit
- Data Processing : DuckDB
- Analytics and Visualization : Athena, Streamlit
The above tech stack and an iCloud Account with Apple Health Data synced regularly from Apple Watch are required. If you don't have an account, you can download my Health Dataset
-
The Data is Exported from Apple Health to iCloud (Download here)
-
The Data is then transfered locally to an S3 Bucket using AWS CLI.
- An AWS Lambda Function (Process_XML) is set up to Transform the Raw Data (XML) and save it as CSV Files in an S3 Bucket. If the Function's execution fails, its response is redirected to an AWS SNS Topic.
Processed CSV Files:
- Another Lambda Function (Transform Health) is triggered by an Object Put in S3 Bucket, which further transforms the Data using DuckDB and redirects the transformed Data to another Lambda Function (To_Parquet), which saves the Data as Parquet Files, partitioned by year.
Transformed Parquet Files:
The Data is partitioned by year:
- The same Lambda Function (To_Parquet), triggers AWS Glue Crawlers to Crawl the Parquet Files stored inside S3 Bucket. The Crawled data is then stored inside AWS Glue Data Catalog's tables.
Crawled Data in AWS Glue Data Catalog Tables:
Querying Data using Athena:
Example Query:
SELECT
FROM_UNIXTIME(recorded_on / 1000) AS recorded_on,
avg_heart_rate,
year
FROM heart_data_parquet
WHERE year = '2022';
In the above query, I used WHERE
Clause to filter the data by year
to avoid scanning other Partitions.
Also, FROM_UNIXTIME()
Function is used to convert the Unix Epoch format to TIMESTAMP
Output:
- It also triggers to start an EC2 instance and Launch the Streamlit Dashboard app
You can access the app through the following URL : http://<your_instance's_public_ip>:8501
Replace your_instance's publicc_ip
with your EC2 instance's Public IPV4 Address.
Configuring Auto Launcher:
On the EC2 instance, open crontab editor
crontab -e
Add the following line to the editor:
@reboot /home/ec2-user/.local/bin/streamlit run /home/ec2-user/<path_to_streamlit_app> --server.port 8501
Replace the path_to_streamlit_app
with path to your Streamlit App
Now whenever the EC2 instance restarts, the Streamlit app will automatically run on port 8501
Here's a quick look of the Streamlit Dashboard hosted on EC2: