In this project trying to do this architecture:
so :
- uploaded the youtube data to an s3 bucket with CLI to fast control of partitioning and making folders with commands
- data stored with its kind JSON files and CSV files
- made data catalog with AWS Glue data catalog and crawler for the CSV files and JSON files and the output will be used in Athena and the tables in a database
- after checking the out we found problems in the JSON files structures so it is needed to process and it will be as follows :
-
created AWS lambda and wrote a python function to edit and convert the JSON files to parquet files and made the trigger to the lambda when data is uploaded to the s3 bucket and make the output be in the second s3 bucket and the second database in Athena after that we can check the Schema and data types of the table
-
after cleaning the JSON files and converting them to parquet, also converted the CSV files to parquet and do some of the processes with AWS glue ETL job and made the cleaned output in the second s3 bucket
-
after that made a second AWS Glue data catalog crawler for the cleaned version and made the output to the second database
-
so for now we have a cleaned table from the JSON files which was converted and processed to parquet by lambda and a cleaned table from the CSV files which are converted and processed to parquet by ETL glue and they are all stored in the same database
-
the next step is to build a new ETL to join our tables and store the output in the final s3 bucket for doing analytics with AWS Glue studio
-
our data now is ready to use in different things like dashboard reporting or machine learning models
-
as an example, we use the final data to make a simple dashboard with AWS quicksight
-
data link : https://www.kaggle.com/datasets/datasnaek/youtube-new?select=KR_category_id.json
-
Video for all the steps: https://www.linkedin.com/posts/mohamed-abohassan-6509641a9_aws-python-analytics-activity-6992862892307951617-ATSM?utm_source=share&utm_medium=member_desktop