πŸ“Ί Matrix Factorization with PySpark for Movie Recommender System


🐍 Code

πŸ˜† About Dataset

πŸ“Š Context: The Original dataset encompasses user ratings and free-text tagging from MovieLens, a movie recommendation platform. It comprises 20,002,263 ratings and 465,564 tag applications across 27,278 movies.

πŸ˜† Shrinking Dataset Since the dataset is too large to compute resulted will be O(N^2M), we need to perform shrinking by choosing the users who like maximum number of movies.We selects the top 10,000 users and perform Matrix Factorization.

Install Apache Spark, Dependent Tools and Environment Path on Windows (Local Machine)

  1. Add system variables for Java_Home, Spark, Hadoop
  2. Add system variables path for Spark, Hadoop
  3. Add user variables for Java_Home, Spark, Hadoop to use the Apache Spark from other directory.

😊 Java Download: https://www.java.com/en/download/manual.jsp alt text

  • install Java
  • Java Environment: Start Menu Search >> Edit The System Variables >> Environment Variables alt text

😊 Apache Spark Download: https://spark.apache.org/downloads.html

  • Create a folder named Sparkin C-drive and `unzip' the downloaded version.
  • Apache Spark Environment: Start Menu Search >> Edit The System Variables >> Environment Variables alt text

😊 For Hadoop(winutils.exe and hadoop.dll):

πŸ˜† Build Spark in powershell of vscode


sc details: Spark context available as 'sc' (master = local[*], app id = local-1708364284935).

note: bydefault, it starts with scala

πŸ˜† Submit Spark To check version

spark-submit --version

PySpark setup

video link: https://www.youtube.com/watch?v=e17s4ul4uTo https://www.youtube.com/watch?v=Irn7a8U-QxA

Check Python Version

start >> cmd >> Run as Administrator alt text

Failed to Access wsl terminal(Ubuntu) due to forget password

😟note: Forget Password Problem: alt text Solution Video Link: https://www.youtube.com/watch?v=RCW9PTNS440

  1. Go to command prompt as follows: alt text

        wsl -l  
        ubuntu2004 config --default-user root
  2. Go to Ubuntu Terminal alt text

     passwd shibli_nomani
  3. go back to the command prompt and change the default setting (root to shibli_nomani) alt text

        ubuntu2004 config --default-user shibli_nomani
πŸ˜‰ Java Install on top of ubuntu
    java -version
    sudo apt update
    sudo apt install default-jdk
πŸ˜‰ Scala Install

scala version

scala -version
sudo apt install scala
πŸ˜‰ Pyspark in powershell(venv in vscode)
pip install pyspark

alt text

😀Error : py4exception occurs due to version mismatch of Apache Spark and Pyspark alt text

    spark-submit --version

alt text

  • edit the requirements.txt file with same version of pyspark as per Spark pyspark==3.4.2

  • run again the requirements.txt in powershell

    list of libaries

    • matplotlib
    • matplotlib-inline
    • numpy
    • keras
    • tensorflow
    • tensorflow-intel
    • scikit-learn
    • seaborn
    • scipy
    • pyspark==3.4.2
    pip install -r requirements.txt

Playing with Pyspark

Becareful about the filepath and choose the absolute filepath to avoid Py4JJavaError raises due to filepath issue. 😀 solution:

Load data using SparkContext

data = sc.textFile("E:/Data Science/recommeder system/Recommender System/data/smallrating.csv")


Mismatch in Python in worker and Python Driver

Check Spark URL for Python in worker has different version 3.10 than that in driver 3.8 and install correct version for driver according to Python worker. here, 3.10 πŸ˜†πŸ˜†

alt text

  • for python version installation:
  • set pyenv in powershell of vscode before create any venv
pyenv shell 3.10.8

Apache Spark πŸ”₯

Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It supports a wide range of applications, including batch processing, streaming analytics, machine learning, and graph processing.

PySpark 🐍

PySpark is the Python API for Apache Spark, a fast and general-purpose cluster computing system for Big Data processing. It provides a simple and consistent interface for distributed data processing using Python.

MLlib 🧠

MLlib is Apache Spark's scalable machine learning library, offering a set of high-level APIs for scalable machine learning algorithms. It includes common learning algorithms and utilities, such as classification, regression, clustering, collaborative filtering, dimensionality reduction, and feature engineering, suitable for large-scale data processing tasks.

RDD (Resilient Distributed Dataset) πŸš€

RDD is a core data structure in PySpark representing an immutable, distributed collection of objects. It enables parallel processing, fault tolerance, and high-level abstractions for distributed data processing.

  1. Purpose: RDDs facilitate distributed data processing across a cluster of machines, ensuring fault tolerance and providing a high-level abstraction for complex operations.
  2. Benefits:
  • Parallel Processing: Enables parallel computation across multiple nodes, improving performance and scalability.
  • Fault Tolerance: Automatic recovery from node failures ensures data integrity and reliability.
  • Data Immutability: Immutable nature simplifies parallel processing and ensures data consistency.
  • Lazy Evaluation: Lazy evaluation reduces unnecessary computation and optimizes performance.
  • Versatility: Supports various operations like map, reduce, filter, and join for diverse data processing tasks.

Summary 🌟

PySpark, with its RDDs, empowers developers to efficiently process large-scale datasets in distributed environments. πŸš€ It offers fault tolerance, scalability, and a versatile set of operations, making it a go-to choice for Big Data processing with Python. πŸπŸ’»

Recommender System πŸŽ―πŸ“Š:

A software tool or algorithm that analyzes user preferences and behavior to make personalized recommendations for items or content. Recommender systems aim to assist users in finding relevant items of interest, such as movies, products, or articles, by leveraging techniques like collaborative filtering, content-based filtering, or hybrid methods. These systems are widely used in various domains, including e-commerce, streaming services, social media, and online platforms, to enhance user experience and engagement.

Collaborative Filtering πŸ€πŸ”:

Recommender system technique based on users' past interactions to predict their future preferences without requiring explicit knowledge about users or items.

Matrix Factorization βš™οΈπŸ”’:

Mathematical technique that decomposes a matrix into lower-dimensional matrices, often used in recommendation systems to represent users and items as latent factors.

Alternating Least Squares (ALS) πŸ”„βž•πŸ”²:

Matrix factorization algorithm commonly used in collaborative filtering recommender systems. It iteratively updates user and item factors to minimize the squared error between observed and predicted ratings.

Relationship πŸ’‘πŸ”„πŸ’‘:

Collaborative filtering utilizes user-item interactions for recommendations. Matrix factorization techniques like ALS decompose interaction matrices to capture latent factors. ALS is a specific algorithm for collaborative filtering, implementing matrix factorization with alternating least squares optimization. In essence, collaborative filtering leverages user preferences, matrix factorization captures latent factors, and ALS optimizes this process for making recommendations.
