/Movie_Data_ETL

Creating a data pipeline with movie-rating data - loading into a postgreSQL database.

Primary LanguageJupyter Notebook

Movie Data ETL

Purpose

Retrieving and cleaning movie ratings data from multiple sources and loading it into a PostgreSQL database

Process

Extract

Three predominate sources of data were used for this project:

  • Wikipedia movie data in JSON format
  • Movie metadata derived from kaggle
  • MovieLens rating data derived from kaggle

Transform

Data from all sources required a sufficient amount of cleaning involving filtering, converting datatypes, renaming/dropping columns, cleaning up string text using regular expression, etc. After cleaning, dataframes were merged to prepare for load

Load

Finally, data was loaded into a postgreSQL database using a single function and timed.

Steps