Intro to PySpark MLlib

Introduction

Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on). Built on top of Spark, MLlib is a scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Spark MLLib seamlessly integrates with other Spark components such as Spark SQL, Spark Streaming, and DataFrames and is installed in the Databricks runtime.

Machine Learning Library (MLlib) Programming Guide

Data types
Basic statistics
- summary statistics
- correlations
- stratified sampling
- hypothesis testing
- random data generation
Classification and regression
- linear models (SVMs, logistic regression, linear regression)
- naive Bayes
- decision trees
- ensembles of trees (Random Forests and Gradient-Boosted Trees)
Collaborative filtering
- alternating least squares (ALS)
Clustering
- k-means
Dimensionality reduction
- singular value decomposition (SVD)
- principal component analysis (PCA)
Feature extraction and transformation
Optimization (developer)
- stochastic gradient descent
- limited-memory BFGS (L-BFGS)

mukeshmithrakumar/intro-to-pyspark-mllib

Intro to PySpark MLlib

Introduction

Machine Learning Library (MLlib) Programming Guide

Reference