Learning Hadoop and Spark

This is the companion repo to my LinkedIn Learning Courses on Apache Hadoop and Apache Spark.

🐘 1. Learning Hadoop - link
- uses mostly GCP Dataproc
- for running Hadoop & associated libraries (i.e. Hive, Pig, Spark...) workloads

🌩️ 2. Cloud Hadoop: Scaling Apache Spark - link
- uses GCP DataProc, AWS EMR --or--
- Databricks on AWS

⛈️ 3. Azure Databricks Spark Essential Training - link
- uses Azure with Databricks
- for scaling Apache Spark workloads

Development Environment Setup Information

You have a number of options - although it is possible for you to set up a local Hadoop/Spark cluster, I do NOT recommended this approach as it's needlessly complex for initial study. Rather I do recommend that you use a partially or fully-managed cluster. For learning, I most often use a fully-managed (free tier) cluster.

1. SaaS - Databricks --> MANAGED

Databricks offers managed Apache Spark clusters. Databricks can run on AWS, Azure or GCP --> announced in 20201 - link. In this course, I use Databricks running on AWS, as the community editor is simple and fast to set up for learning purposes.

Use Databricks Community Edition (managed, hosted Apache Spark), run on AWS. Example notebook shown in screenshot above.
- uses Databricks (Jupyter-style) notebooks to connect to a one or more custom-sized and managed Spark clusters
- creates and manages your data files stored in cloud buckets as part of Databricks service
- uses DFS file system in cluster data operations
- use Databricks AWS community edition (simplest set up - free tier on AWS) - link --OR--
- use Databricks Azure trial edition - Azure may require a pay-as-you-go account to get needed CPU/GPU resources
- try Databricks on GCP beta - announced recently - link

2. PaaS Cloud on GCP (or AWS) --> PARTIALLY-MANAGED

Setup a Hadoop/Spark managed cloud-cluster via GCP Dataproc or AWS EMR
- see setup-hadoop folder in this Repo for instructions/scripts
  - create a GCS (or AWS) bucket for input/output job data
  - see example_datasets folder in this Repo for sample data files
- for GCP use DataProc includes Jupyter notebook interface --OR--
- for AWS use EMR you can use EMR Studio (which includes managed Jupyter instances) - link example screenshot shown above
- for Azure it is possible to use their HDInsight service. I prefer Databricks on Azure because I find it to be more feature complete and performant.

3. IaaS local or cloud --> MANUAL

Setup Hadoop/Spark locally or on a 'raw' cloud VM, such as AWS EC2
- NOT RECOMMENDED for learning - too complex to set up
- Cloudera Learning VM - also NOT recommended, changes too often, documentation not aligned

Example Jobs or Scripts

EXAMPLES from org.apache.hadoop_or_spark.examples - link for Spark examples

Run a Hadoop WordCount Job with Java (jar file)
Run a Hadoop and/or Spark CalculatePi (digits) Script with PySpark or other libraries
Run using Cloudera shared demo env
- at https://demo.gethue.com/
- login is user:demo, pwd:demo

maxcrom/learning-hadoop-and-spark