/w261-environment

Primary LanguageJupyter Notebook

w261 Environment

Overview

Welcome to w261 - Machine Learning at Scale. In this class, on top of learning about the Machine Learning models in the industry, you will be using production grade technology and infrastructure deployment best practices. For about two thirds of the class, you will be using an environment orchestration in Google Cloud. For the last third, you will get the opportunity to use Databricks on Azure.

GitHub Repository

Overview

A read-only GitHub repository will be used as a source of code for Homework and Live Session Labs.

Obtain GitHub Personal Token

While authenticated to GitHub, please navigate to github/personal_tokens to obtain one. You will need it for the automatization script mentioned below. Add a note such as w261 GCP or similar to keep track of this token. Lastly, check the box on repo which provides full control of private repositories, automatically all the underneath boxes are shown as checked. Please, be aware that you will only be able to see and copy this code once, so you may want to save a local copy of it on your local Windows/Mac machine.

Google Cloud

Overview

Google Cloud is a top state of the art platform for Big Data Analytics and Orchestration. The service used in w261 is Dataproc which is the main Cloud API for orchestration of Big Data Hadoop and Spark clusters.

Why Dataproc?

Dataproc offers a plug-and-play kind of cluster orchestration for Hadoop and Spark. Jupyter Lab comes out of the box using the GoogleUserContent front-end which is highly secure, and prevents us from exposing our VM with an external IP address.

Redeem Credits on GCP

Google offers $300 in credits for new accounts. Login to GCP Console to take advantage of this offer by clicking the top banner that shows the promotion. You can create a gmail account if you don't have an email account part of the Google Suite. You must have an account with a Billing Account set before running the automated orchestration.

Note: Acepting this offer involves providing a credit card that will not be charged immediately after credits deplete. You will have an opportunity to decide to continue before GCP charging your credit card.

Automated Orchestration using GCP CloudShell

For this Class, we will be using a single automation script that will help us navigate through some complexity in the Cloud and Compute World.

The first step is to open the GCP Console, and click the terminal icon >_ in the top blue bar.

This will open a panel box at the bottom of the screen, which is your CloudShell. This is serverless compute, you are allocated 5 GB of Storage, and is a great tool to act as a bridge for all the components we will be using in w261. From here, using the automation script, you will be able to deploy clusters, load data into Buckets and pull code from the Main Repo. The best part of CloudShell is that it's free.

Running the automated script on CloudShell guarantees having the appropriate dependencies, packages and the right environment.

GCP Infrastructure Deployment

The script you need to run is to prepare a Google Project with all the artifacts needed to work in a secure environment with Dataproc. Please take a look at the documentation in Create Dataproc Cluster to have a look inside of the orchestration under the covers.

Please follow the prompts:

gsutil cat gs://w261-hw-data/w261_env.sh | bash -euo pipefail

This script will take longer to run the first time before you have deployed any cluster. Once all the components are deployed, the subsequent runs will skip all orchestration and will create clusters on demand directly. Although, it will always check for all components to be installed. To run the script, follow the prompts; after you run the command line above, press Q to exit the Welcome screen and begin running the actual script. You will have the respond y to the first question (Do you want to proceed?) and then respond to some of the following questions. Please run the script again until you see in the prompts that a cluster was successfully created.

You can see your clusters in GCP Dataproc. If you don't see your cluster, switch to w261-student project in the top blue GCP bar. Remember you will be consuming credits on a per second basis. The orchestration that got put together had this in mind, and following best practices, $300 should be more than enough.

It's up to you if you want to delete it directly or let the max-idle feature hit in (you will select that every time you create a cluster: 1h, 2h, 3h, 6h, 12h, 24h).

Things to know

  • Once you open JupyterLab, navigate to the root folder where you see to folders: GCS and Local Disk. We will work on Local Disk for HW1 and 2, and all first Labs before turning to Spark. The automation script make sure the files are properly loaded as long as you have run the script at least once.

  • When working on a Notebook, get the full path where this notebook is located, and then add a new cell at the very top like this one:

%cd /full/path/to/the/notebook
  • To get the data for the HWs, add a new cell and comment the previous command that pulled the data such as !curl, !wget and similar, and obtain the data now from your GCS Data Bucket created in the first automation script:
!mkdir -p data/
!gsutil cp gs://<your-data-bucket>/main/Assignments/HW2/data/* data/

Feel free to explore where the data is for a specific HW with gsutil ls gs://<your-data-bucket>/main/Assignments/HW* If you don't remember your GCS Data Bucket, run gsutil ls to get a list of Buckets in your account.

  • For Hadoop, the new location of the JAR_FILE is:
JAR_FILE = '/usr/lib/hadoop/hadoop-streaming-3.2.2.jar'
  • For debugging, go to Dataproc -> Clusters -> Web Interfaces and look for:

    • MapReduce Job History for Hadoop job logs.
    • Spark History Server for Spark job logs.
  • In Jupyter, when running mkdir use -p to make sure you create the entire path, if inner folders doesn't exist.

    • !hdfs dfs -mkdir -p {HDFS_DIR}
  • Spark UI for Current Notebook

    • The Spark UI for current jobs and Notebook can be accessed via SSH directly into the Master Node.
    • Open the Cloud Shell.
    • Get the zone where your Master node is located. Adjust the name of your instance. You can also assign the direct value if already known.
    ZONE=$(gcloud compute instances list --filter="name~w261" --format "value(zone)")
    • SSH into the VM using your Cloud Shell. It can also be done from your local terminal or Google Cloud SDK if running windows. Adjust the name of your instance if different.
    gcloud compute ssh w261-m --ssh-flag "-L 8080:localhost:42229" --zone $ZONE
    • Click the Web Preview button at the top right in the Cloud Shell panel. We mapped this port to 8080, which is the default port number that Web Preview uses.
    • By default, Dataproc runs the Spark UI on port 42229. Adjust accordingly if using a different port. In order to get the port number, open a new cell and run the variable spark (if SparkSession already established). You'll see the UI link. Hover over the link and get the port number.
    • Keep the Cloud Shell alive by running sleep 1800, or a number you feel comfortable to keep the tunnel open.