This repo contains the materials for my Data Analytics and Machine Learning on Big Data class. It contains:
code
- Codecase_studies
- Case studiescluster_setup
- Scripts for setting up your cluster on AWSdata
- Datasetsslides
- Presentation slideshelpful_things
- Additional helpful materials- Algorithm selection mindmap
- Setting up a Spark Cluster on AWS tutorial
- Setting up a Hadoop Cluster on AWS tutorial
If you want to install and run everything on your computer, here are the best tutorials I've found for getting Python and Spark running on your computer.
In order to visualize the decision trees in Jupyter, you will need to install Graphviz as well as the Python package.
For the package install:
On Windows, once you install Graphviz, add the full path to the bin directory in the Graphviz to your PATH.
To install the Python package on all operating systems: pip install graphviz