Spark-for-Machine-Learning-AI

This is the my lesson notes and exercises for a LinkedIn course, Spark-for-Machine-Learning-AI.

Introduction to Spark and MLlib
Data Preparation and Transformation
- Numeric:
  - MinMaxScaler
  - StandardScaler
  - Bucketizer
- Text:
  - Tokenizer
  - HashingTF
Clustering
- K-Mean
- Hierarchical clustering with Bisecting K-means
Classification
- Navie Bayes
- Multilayer perceptron
- Decision trees
Regression
- Linear regression
- Decision tree regression
- Gradient-boosted tree regression (requiredd significant time to build the model)
Recommendations
- Collaborative Filtering
  - In Spark: Using Alternating Least Squares method
- Content-Based Filtering
Tips for using Spark MLlib:
- (1) Processing:
  - Collect, reformat, and transform data
    - Load data into Spark DataFrames
    - Include headers, or column names, in text file
    - Use inferSchema=True
    - Use StringIndexer to map from string to numeric indexes
- (2) Model Building:
  - Apply machine learning algorithms to training data
    - Split data into trainging and test sets
    - Fit models using trainging data
    - Create predictions by applying a transform to the test data
- (3) Validation:
  - Assess the quality of models built in step 2
    - Use MLlib evaluators:
      - MulticlassClassificationEvaluator
      - RegressionEvaluator
    - Experimeny with multiple algorithms
    - Vary hyperparameters
- Other suggestions:
  - (1) MLlibs Docs:
    - Detailed API documentation and examples
  - (2) Kaggle:
    - Data sets and articles
  - (3) AWS Data Sets:
    - Big data and public data sets

kevin-chao-com/Spark-for-Machine-Learning-AI