Hadoop Ecosystem

Big Data

3V's of big data

Volume
Variety
Velocity
Veracity is the 4th V which indicates the uncertainty in the data

Hadoop:

Runs on commodity hardware
Low cost per GB
Ability to store Petabytes of data in a single cluster
Hadoop was invented at Yahoo, and inspired by Google file system and Google's MapReduce papers

What is Hadoop?

Hadoop is a distributed system
The data nodes and worker nodes can scale horizontally ( The data is stored and processes run over the data nodes)
In Hadoop we bring the computation to where the data is stored.
Namenode stores the metadata for the files in hadoop, it also runs a yarn resource manager service which is responsible for giving our application necessary resources on the cluster.
There is a standby namenode and standby yarn resource manager in case the the main ones fails.
Hadoop will automatically recover from a node failure. This means that code within the application possibly gets executed twice. Developers will need to handle that.

HDFS and MapReduce

HDFS will use blocks of data with a size of 128 MB by default
Each of these blocks is replocated 3 times
The blocks are distributed in a way to support fault tolerance
MapReduce is a way of splitting a computation task to a distributed set of files
MapReduce consists of a Job Tracker and multiple task trackers
The task trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes
The Job tracker sends the code to run on the task trackers

Spark

We can think of Spark as a flexible alternative of MapReduce.
MapReduce requires files to be stored in HDFS, Spark doesn't
Spark can also perform operations up to 100% faster than MapReduce. How does it do it though ?
- MapReduce writes mosr data to disk after every map and reduce operation
- Spark keeps most of the data in memory after each transformation
- Spark can spill over to disk if the memory is filled.
At the coore of the Spark is the idea of a Resilient Distributed Dataset (RDD)
RDD has 4 main features:
- Distributed Collection of Data
- Fault tolerant
- Parallel operation - partioned
- Ability to use many data sources

RDDs are immutable, lazily evaluated, and cacheable. There are two types of Spark operations:
- Transformations: are basically a recipe to follow.
- Actions: actually perform what the recipe says to do and returns something back
This behaviour carries over to the syntax when coding. A lot of times you will write a method call, but won’t see anything as a result until you call the action. This makes sense because with a large dataset, you don’t want to calculate all the transformations until you are sure you want to perform them

Setup Spark

Install Java
Install Python
Install Scala
Set environment variables

# Settings for PySpark
export SPARK_HOME='{path where spark package is downloaded}/spark-3.0.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Activate Python virtual environment

. ./setup-env.sh
pip install -r requirements-spark.txt

Start Jupyter notebook

jupyter notebook

Spark MLlib

Spark’s MLlib is mainly designed for Supervised and Unsupervised Learning tasks, with most of its algorithms falling under those two categories.

Supervised learning algorithms are trained using labeled examples, such as an input where the desired output is known.
Unsupervised learning is used against data that has no historical labels. One of the main “quirks” of using MLlib is that you need to format your data so that eventually it just has one or two columns:

Features, Labels (Supervised)
Features (Unsupervised)

Confusion Matrix

Natural Language processing

Tokenization
Stop world removal
Stemming and Lemmatization
Count vectorization - BOW
TF-IDF

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

Internally, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

The steps for streaming are:

Create a SparkContext
Create a StreamingContext
Create a Socket Text Stream
Read in the lines as a “DStream”

The steps for working with the data:

Split the input line into a list of words
Map each word to a tuple: (word,1)
Then group (reduce) the tuples by the word (key) and sum up the second argument (the number one)

Command on the terminal

nc -lk 9999

prashantfb65/spark-project