sakethmukkanti/Movielens-Dataset-Analysis-Azure-Data-Engineering-Project

Created a movie recommendation system on Azure utilizing Spark SQL by analyzing the MovieLens dataset.

Jupyter Notebook

Building a Movie Recommender System

This is a data engineering project for movie suggestions based on MovieLens raw dataset. It is built using below mentioned Azure services.

Azure blob storage
Azure data lake storage gen2
Azure Data Factory
Azure databricks
Azure Synapse Analytics

The Architecture Diagram for this project is shown below -

I have used azure data factory as a orchestration tool for building and executing data pipeline. The main tasks involved are -

Data cleaning using ADF's data flow by removing duplicate rows and null values and ingesting them to Azure data lake storage gen2 in parquet format.
Data transformation in azure databricks by calculating Bayesian average ratings and top 5 tags for each movie using spark SQL.
Data analysis and best movie by genre or rating calculations in Azure synapse analytics.

I have used the below mentioned resources in Azure portal for building this movie recommender project end-to-end.

Key vault
Synapse workspace
Azure Databricks Service
Data factory (V2)
Storage account
Storage account