/ScalableML

Teaching materials for module COM6012 Scalable Machine Learning, University of Sheffield, 2024

Primary LanguageHTML

COM6012 Scalable Machine Learning - University of Sheffield

Spring 2024

by Shuo Zhou and Robert Loftin, with Tahsin Khan and Xianyuan Liu

In this module, we will learn how to do machine learning at large scale using Apache Spark. We will use the High Performance Computing (HPC) cluster systems of our university. If you are NOT on the University's network, you must use VPN (Virtual Private Network) to connect to the HPC.

This edition uses PySpark 3.5.0, the latest stable release of Spark (Sep 13, 2023), and has 10 sessions below. You can refer to the overview slides for more information, e.g. timetable and assessment information.

  • Session 1: Introduction to Spark and HPC (Shuo Zhou)
  • Session 2: RDD, DataFrame, ML pipeline, & parallelization (Shuo Zhou)
  • Session 3: Scalable logistic regression (Shuo Zhou)
  • Session 4: Scalable generalized linear models (Robert Loftin)
  • Session 5: Scalable decision trees and ensemble models (Tahsin Khan)
  • Session 6: Scalable neural networks (Tahsin Khan)
  • Session 7: Scalable matrix factorisation for collaborative filtering in recommender systems (Robert Loftin)
  • Session 8: Scalable k-means clustering and Spark configuration (Robert Loftin)
  • Session 9: Scalable PCA for dimensionality reduction and Spark data types (Robert Loftin)
  • Session 10: Apache Spark in the Cloud (Xianyuan Liu)

You can also download the Spring 2023 version for preview or reference.

If you do not have one yet, we recommend you to sign up for a GitHub account to learn using this popular open source software development platform.

An Introduction to Transparent Machine Learning

Shuo Zhou developed a course on An Introduction to Transparent Machine Learning with Prof. Haiping Lu, part of the Alan Turing Institute’s online learning courses in responsible AI. If interested, you can refer to this introductory course with emphasis on transparency in machine learning to assist you in your learning of scalable machine learning.

Acknowledgement

The materials are built with references to the following sources:

Many thanks to

  • Haiping Lu and Mauricio A Álvarez, who has developed this module from 2016 to 2023(2). Their contributions are still reflected in the materials.
  • Mike Croucher, Neil Lawrence, Will Furnass, Twin Karmakharm, Mike Smith, Xianyuan Liu, Desmond Ryan, and Vamsi Sai Turlapati for their inputs and inspirations since 2016.
  • Our teaching assistants and students who have contributed in many ways since 2017.