/data-analytics-machine-learning-big-data

Slides, code and more for my class: Data Analytics and Machine Learning on Big Data

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Data Analytics and Machine Learning on Big Data

What's In This Repo

This repo contains the materials for my Data Analytics and Machine Learning on Big Data class. It contains:

  • code - Code
  • case_studies - Case studies
  • cluster_setup - Scripts for setting up your cluster on AWS
  • data - Datasets
  • slides - Presentation slides
  • helpful_things - Additional helpful materials
    • Algorithm selection mindmap
    • Setting up a Spark Cluster on AWS tutorial
    • Setting up a Hadoop Cluster on AWS tutorial

Install Tutorials

Python and Spark

If you want to install and run everything on your computer, here are the best tutorials I've found for getting Python and Spark running on your computer.

Graphviz

In order to visualize the decision trees in Jupyter, you will need to install Graphviz as well as the Python package.

For the package install:

On Windows, once you install Graphviz, add the full path to the bin directory in the Graphviz to your PATH.

To install the Python package on all operating systems: pip install graphviz