Amazing Prime is a platform for streaming movies and TV shows. They are sponsoring a hackathon, providing clean dataset to participants and asking them to predict popular pictures. To keep the datset updated we need to create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. This goal was achieved in the following steps.
Task: The dataset utilized takes in three files—Wikipedia data (JSON file), Kaggle metadata (csv file), and the MovieLens rating data (csv file).
The code to convert these files into a dataframe is as below:
ETL_function_test.ipynb
Task: Extracting and transforming the Wikipedia data so we can merge it with the Kaggle metadata.
The code to extract and transform data is linked below:
ETL_clean_wiki_movies.ipynb
Task: Extracting and transforming the Kaggle metadata and MovieLens rating data, then converting the transformed data into separate DataFrames. Then, merging the Kaggle metadata DataFrame with the Wikipedia movies DataFrame to create the movies_df DataFrame. Finally, merging the MovieLens rating data DataFrame with the movies_df DataFrame to create the movies_with_ratings_df.
The code to extract and transform the data is linked below:
ETL_clean_kaggle_data.ipynb
Task: Add the movies_df DataFrame and MovieLens rating CSV data to a SQL database.
The code to add the movies_df and ratings data is linked below:
ETL_create_database.ipynb
-
The data from the movies_df DataFrame replaces the current data in the movies table in the SQL database, as determined by the movies_query.png
-
The data from the MovieLens rating CSV file is added to the ratings table in the SQL database, as determined by the ratings_query.png.
-
The elapsed time to add the data to the database is displayed in the 'ETL_create_database' python notebook referenced above.