Retrieving and cleaning movie ratings data from multiple sources and loading it into a PostgreSQL database
Three predominate sources of data were used for this project:
- Wikipedia movie data in JSON format
- Movie metadata derived from kaggle
- MovieLens rating data derived from kaggle
Data from all sources required a sufficient amount of cleaning involving filtering, converting datatypes, renaming/dropping columns, cleaning up string text using regular expression, etc. After cleaning, dataframes were merged to prepare for load
Finally, data was loaded into a postgreSQL database using a single function and timed.
- Files include similarities in code but are meant to show the work through additive stages:
-
Writing a function to read in three data files and testing (See ETL_function_test.ipynb)
-
Wikipedia data read into a Pandas DataFrame and cleaned (See ETL_clean_wiki_movies.ipynb)
-
Cleaning Kaggle data in addition to wikipedia data and merge with ratings data (See ETL_clean_kaggle_data.ipynb)
-
Finish the ETL process by loading all data into a postgreSQL database (See ETL_create_database.ipynb)
-