This is a data engineering project for movie suggestions based on MovieLens raw dataset. It is built using below mentioned Azure services.
- Azure blob storage
- Azure data lake storage gen2
- Azure Data Factory
- Azure databricks
- Azure Synapse Analytics
The Architecture Diagram for this project is shown below -
I have used azure data factory as a orchestration tool for building and executing data pipeline. The main tasks involved are -
- Data cleaning using ADF's data flow by removing duplicate rows and null values and ingesting them to Azure data lake storage gen2 in parquet format.
- Data transformation in azure databricks by calculating Bayesian average ratings and top 5 tags for each movie using spark SQL.
- Data analysis and best movie by genre or rating calculations in Azure synapse analytics.
I have used the below mentioned resources in Azure portal for building this movie recommender project end-to-end.
- Key vault
- Synapse workspace
- Azure Databricks Service
- Data factory (V2)
- Storage account
- Storage account