/spark_ml_train_model

This repo contains files required to follow example on Youtube which shows how to train Spark ML model from scratch

Primary LanguageJupyter Notebook

Build and Train ML model with Spark ML

To build and train a Machine Learning (ML) model with Spark is not hard. With this tutorial we will build a simple Binary Classification ML model with Spark. We will use Logistic Regression built-in Spark algorithm, and then evaluate it by getting performance metrics from the model.

There are some different from we do it in Scikit-Learn. Spark provides a built-in SparkML engine with rich #SparkML API which you can leverage to build your unique Machine Learning model.

In this tutorial we are using SparkUI v.3.2.1 with pyspark-shell.

The critical points you should pay your attention to is:

  • Datatypes (DTypes)
  • String Indexer and One-Hot-Encoding for categorical features.
  • Vector Assembler.

All these parts are explained and demonstrated in details in this tutorial. Also, you will learn what is SparkContext and SparkSession (differences between them). Therefore you will be able to check Data schema and handle data types in Spark DataFrame, selected features within your data. As required for ML modelling, you will also learn how to split your data into train and test sets.

Here you also learn how to setup ML stages with Spark and build a custom ML Pipeline to build your Machine Learning Model with Spark.

At the end, you will learn hot to get model performance metrics, such as Precision, Recall, or ROC curve values.

LINK TO THE FULL YOUTUBE VIDEO TUTORIAL IS HERE !