/Movielens-Dataset-Analysis-Azure-Data-Engineering-Project

Created a movie recommendation system on Azure utilizing Spark SQL by analyzing the MovieLens dataset.

Primary LanguageJupyter Notebook

Building a Movie Recommender System

This is a data engineering project for movie suggestions based on MovieLens raw dataset. It is built using below mentioned Azure services.

  1. Azure blob storage
  2. Azure data lake storage gen2
  3. Azure Data Factory
  4. Azure databricks
  5. Azure Synapse Analytics

The Architecture Diagram for this project is shown below -


I have used azure data factory as a orchestration tool for building and executing data pipeline. The main tasks involved are -
  1. Data cleaning using ADF's data flow by removing duplicate rows and null values and ingesting them to Azure data lake storage gen2 in parquet format.
  2. Data transformation in azure databricks by calculating Bayesian average ratings and top 5 tags for each movie using spark SQL.
  3. Data analysis and best movie by genre or rating calculations in Azure synapse analytics.



I have used the below mentioned resources in Azure portal for building this movie recommender project end-to-end.

  1. Key vault
  2. Synapse workspace
  3. Azure Databricks Service
  4. Data factory (V2)
  5. Storage account
  6. Storage account