spark_ML

Machine Learning using Spark MLlib

(http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + (http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)

On Linux: Ubuntu 14.04.5 LTS, Release: 14.04, trusty.

Hadoop does batch processing i.e processing of blocks of data already stored over a period of time. Initially Hadoop's MapReduce technique was the best framework for processing data in batches. Spark is an open-source cluster computing framework for real-time processing. Spark's additional functionality is that it can process data in real time and since it was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations it is also about 100 times faster than Hadoop MapReduce in batch processing large data sets.

Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles etc and any other Hadoop InputFormat.

More differences and Spark details are here: https://www.edureka.co/blog/spark-tutorial/

Install Hadoop in Stand-Alone Mode on Ubuntu 16.04

Once installed run it as:
/usr/local/hadoop/bin/hadoop

Scikit-Learn ML Examples:
http://scikit-learn.org/stable/auto_examples/index.html#

Spark Examples:
https://spark.apache.org/examples.html

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version ...
      /_/

There was a problem such as the following while running pyspark

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 52.0

Apache Maven and JDK 8 had to be installed. Details here:
https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-get-on-ubuntu-16-04

Also another problem to keep in mind from some of the Spark MLib code on the website to add the context and session variables:

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

Big Data with Apache Spark:

HDFS with Spark: https://cbw.sh/spark.html

Setting Up Your Environment - In order to use HDFS and Spark, you first need to configure your environment so that you have access to the required tools. The easiest way to do this is to modify the .bashrc configuration file in your home directory.

shamitb/spark_ML

spark_ML