Starting Spark

Getting started with Spark big data analytics engine

Recommended: Spark in the Cloud

Free Databricks Community Edition

Optional: Local Installation on Windows

First, set up Windows machine tools

Basic Setup for Big Data

Verify Prerequisities are installed (included above)

JDK
Python (Anaconda)

choco install openjdk -y
choco install 7zip -y
choco install anaconda3 --params="/AddToPath:1" -y

Verify:

java --version
python --version

Install Spark locally

Spark is written in Scala (a new language for the JVM), but you can interact with it using Scala - or Python.

Read: https://spark.apache.org/
Download Spark: https://spark.apache.org/downloads.html (e.g., to Downloads folder).
Use 7zip to extract, extract again.
Move so you have C:\spark-3.1.1-bin-hadoop2.7\bin
Download winutils from https://github.com/cdarlint/winutils/tree/master/hadoop-2.7.7/bin into spark bin folder.
Optional: Create C:\tmp\hive and in your new bin folder, open Command Window as Admin and run winutils.exe chmod -R 777 C:\tmp\hive
Set System Environment Variables:
- SPARK_HOME = C:\spark-3.1.1-bin-hadoop2.7
- HADOOP_HOME = C:\spark-3.1.1-bin-hadoop2.7
- Path - add %SPARK_HOME%\bin

Verify Spark using Scala

In PowerShell as Admin, run spark-shell to launch Spark with Scala (you should get a scala prompt)
In a browser, open http://localhost:4040/ to see the Spark shell web UI
Exit spark shell with CTRL-D (for "done")

If needed, install PySpark (Anaconda)

/bin includes pyspark. If you need another, follow instructions at https://anaconda.org/conda-forge/pyspark. Open Anaconda prompt and run:

conda install -c conda-forge pyspark

Verify Spark using Python

In PS as Admin, run pyspark to launch Spark with Python. You should get a Python prompt >>>.
Quit a Python window with the quit() function.

Warnings

If you see a WARN about trying to compute pageszie, just hit ENTER. This command works in Linux, but not in Windows.

Experiment with Spark & Python

First Steps With PySpark and Big Data Processing by Luke Lee

Experiment with Spark & Java

Build custom apps using Java.

See https://github.com/denisecase/spark-maven-java-challenge

java -cp target/spark-challenge-1.0.0-jar-with-dependencies.jar edu.nwmissouri.isl.App "data.txt"

Experiment with Spark & Scala

Create a new RDD from an existing file. Use common textFile functions.

Implement word count.

map - one in to one out
reduce - all in, one out
filter - some in to some or less or zero out
flatMap - one in to many out
reduceByKey - many in, one out

val textFile = sc.textFile("README.md")
textFile.count()
textFile.first()
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
val ct = textFile.filter(line => line.contains("Spark")).count()

Find the most words in a single line. All data in -> one out.

First, map each line to words and get the size.
Then, reduce all sizes to one max value.

Which functions should we use? Which version do you prefer?

textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))

val maxWords = textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)

MapReduce in Spark & Scala

First, flatMap each line to words.
Then, map each word to a count (one).
Then, reduceByKey to aggregate a total for each key.
After transformations, use an action (e.g. collect) to collect the results to our shell.

Can you modify to get max by key?

val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()

Terms

Resilient Distributed Dataset (RDD) - a distributed collection of items
RDD actions (return values)
RDD transformations (return pointers to new RDD)
RDD transformations list (link)
Application - custome driver program + executors
SparkSession - access to Spark (interactive or created in an app)
Job - parallel computation with tasks spawned by Spark action (e.g., save(), collect())
Stage - set of tasks
Task - unit of work sent to Spark executor
Cluster manager (built-in, YARN, Mesos, or Kubernetes)
Spark executor (usually one per worker node)
Deployment modes (local, standalone, YARN, etc.)

Resources

Try Spark Quick Start at https://spark.apache.org/docs/2.1.0/quick-start.html
Try Spark examples at https://spark.apache.org/examples.html
Try Spark online with Databricks at https://databricks.com/
Free Spark Databricks Community Edition at https://community.cloud.databricks.com/login.html
Databricks Datasets at https://docs.databricks.com/data/databricks-datasets.html
"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
Learning Spark v2 (includes Spark 3)
Intro to SparkNLP

sudheera96/starting-spark