/Advanced-Machine-Learning-with-Spark-2.x

Advanced Machine Learning with Spark 2.x [video], published by Packt

Primary LanguageScalaOtherNOASSERTION

Advanced Machine Learning with Spark 2.x [video]

This is the code repository for Advanced Machine Learning with Spark 2.x [video], published by Packt. It contains all the supporting project files necessary to work through the video course from start to finish.

About the Video Course

The aim of this course is to provide a practical understanding of advanced Machine Learning algorithms in Apache Spark to make predictions and recommendation and derive insights from large distributed datasets. This course starts with an introduction to the key concepts and data types that are fundamental to understanding distributed data processing and Machine Learning with Spark.

Further to this, we provide practical recipes that demonstrate some of the most popular algorithms in Spark, leading to the creation of sophisticated Machine Learning pipelines and applications. The final sections are dedicated to more advanced use cases for Machine Learning: streaming, Natural Language Processing, and Deep Learning. In each section, we briefly establish the theoretical basis of the topic under discussion and then cement our understanding with practical use cases.

What You Will Learn

  • Get introduced to Machine Learning libraries and datatypes in Spark: MLlib, ML, vectors, matrices, labeled points, rating datatypes, and more.
  • Understand different key components of Machine Learning applications.
  • Learn to evaluate, fine-tune, save and deploy models along with pipelines.
  • Deploy Machine Learning models in a typical streaming application.
  • Understand Natural Language Processing in Spark.
  • Understand Deep learning workflows in Spark.

Instructions and Navigation

Assumed Knowledge

To fully benefit from the coverage included in this course, you will need:
To fully benefit from the coverage included in this course, you will need:

  • Prior working knowledge of the Scala language
  • Familiarity with Git and GitHub for source control
  • Basic Understanding of Apache Spark technology

    Technical Requirements

    This course has the following software requirements:
    This course has the following software requirements:

  • An IntelliJ IDE
  • JDK 8

    This course has been tested on the following system configuration:

  • OS: MAC OS Sierra
  • Processor: Intel i7 2800
  • Memory: 16GB

    Apache Spark 2.0.0 application starter template

    Features

    • Can use Spark interactively from the console:
    $ sbt
    ...
    [spark2-project1]> console
    ...
    Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45).
    Type in expressions for evaluation. Or try :help.
    
    scala> import org.apache.spark.sql.SparkSession; import org.apache.spark.SparkContext; import org.apache.spark.SparkContext._; import org.apache.spark.SparkConf; val conf = new SparkConf().setAppName("Simple Application").setMaster("local").set("spark.rpc.netty.dispatcher.numThreads","2"); val sc = new SparkContext(conf); 
    16/09/02 14:53:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf
    conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@6c42f434
    sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1d63a678
    
    scala> val logFile = "src/main/resources/log4j.properties"
    logFile: String = src/main/resources/log4j.properties
    
    scala> val logData = sc.textFile(logFile, 2).cache()
    logData: org.apache.spark.rdd.RDD[String] = src/main/resources/log4j.properties MapPartitionsRDD[1] at textFile at <console>:19
    
    scala> val numAs = logData.filter(line => line.contains("a")).count()
    numAs: Long = 28
    
    scala> val numBs = logData.filter(line => line.contains("b")).count()
    numBs: Long = 7
    
    scala> val spark = SparkSession.builder().appName("financial_data").master("local").getOrCreate()
    16/09/02 14:55:10 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
    spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4f045932
    
    scala> val opts = Map("url" -> "jdbc:postgresql:somedb", "dbtable" -> "sometableinthedb")
    opts: scala.collection.immutable.Map[String,String] = Map(url -> jdbc:postgresql:somedb, dbtable -> sometableinthedb)
    
    scala> val df = spark.read.format("jdbc").options(opts).load
    df: org.apache.spark.sql.DataFrame = ...
    
    scala> df.show(false)
    ...
    
    scala> sc.stop()
    

    and then back in the console (ctrl+d):

    [spark2-project1]> run
    ...
    [info] Running com.example.Hello
    16/07/31 19:30:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Lines with a: 28, Lines with b: 7
    [success] Total time: 4 s, completed Jul 31, 2016 7:30:13 PM
    

    References:

    https://stackoverflow.com/questions/31685408/spark-actor-not-found-for-actorselection

    https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-logging.html

    https://spark.apache.org/docs/latest/quick-start.html

    Related Products