AWS-Data-Engineering-YouTube-Analysis-Project

In this project trying to do this architecture:

so :

uploaded the youtube data to an s3 bucket with CLI to fast control of partitioning and making folders with commands
data stored with its kind JSON files and CSV files
made data catalog with AWS Glue data catalog and crawler for the CSV files and JSON files and the output will be used in Athena and the tables in a database

after checking the out we found problems in the JSON files structures so it is needed to process and it will be as follows :

created AWS lambda and wrote a python function to edit and convert the JSON files to parquet files and made the trigger to the lambda when data is uploaded to the s3 bucket and make the output be in the second s3 bucket and the second database in Athena after that we can check the Schema and data types of the table
after cleaning the JSON files and converting them to parquet, also converted the CSV files to parquet and do some of the processes with AWS glue ETL job and made the cleaned output in the second s3 bucket
after that made a second AWS Glue data catalog crawler for the cleaned version and made the output to the second database
so for now we have a cleaned table from the JSON files which was converted and processed to parquet by lambda and a cleaned table from the CSV files which are converted and processed to parquet by ETL glue and they are all stored in the same database
the next step is to build a new ETL to join our tables and store the output in the final s3 bucket for doing analytics with AWS Glue studio
our data now is ready to use in different things like dashboard reporting or machine learning models
as an example, we use the final data to make a simple dashboard with AWS quicksight
data link : https://www.kaggle.com/datasets/datasnaek/youtube-new?select=KR_category_id.json
Video for all the steps: https://www.linkedin.com/posts/mohamed-abohassan-6509641a9_aws-python-analytics-activity-6992862892307951617-ATSM?utm_source=share&utm_medium=member_desktop

IqraSA/AWS-Data-Engineering-YouTube-Analysis-Project