This repository contains Spark and Scala code samples.
The code presented is by no means prepared to be deployed, contains lots of excessive comments and printouts. The aim is to provide clear feedback for future reference when learning Apache Spark.
Movies datasets @ GroupLens
Apache Spark - quick start guide
How to use SBT behind a proxy @ Stackoverflow
- Install a JDK (Java Development Kit) from http://www.oracle.com/technetwork/java/javase/downloads/index.html . Keep track of where you installed the JDK; you’ll need that later.
- Download a pre-built version of Apache Spark from https://spark.apache.org/downloads.html
- If necessary, download and install WinRAR so you can extract the
.tgz
file you downloaded. http://www.rarlab.com/download.htm - Extract the Spark archive, and copy its contents into
C:\spark
after creating that directory. You should end up with directories likec:\spark\bin
,c:\spark\conf
, etc. - Download
winutils.exe
from https://sundog-spark.s3.amazonaws.com/winutils.exe and move it into aC:\winutils\bin
folder that you’ve created. (note, this is a 64-bit application. If you are on a 32-bit version of Windows, you’ll need to search for a 32-bit build of winutils.exe for Hadoop.) - Open the the
c:\spark\conf
folder, and make sure “File Name Extensions” is checked in the “view” tab of Windows Explorer. Rename thelog4j.properties.template
file tolog4j.properties
. Edit this file (using Wordpad or something similar) and change the error level from INFO to ERROR for log4j.rootCategory - Right-click your Windows menu, select Control Panel, System and Security, and then System. Click on “Advanced System Settings” and then the “Environment Variables” button.
- Add the following new USER variables:
a.SPARK_HOME c:\spark
b.JAVA_HOME
(the path you installed the JDK to in step 1, for exampleC:\Program Files\Java\jdk1.8.0_101
)
c.HADDOP HOME c:\winutils
- Add the following paths to your PATH user variable:
%SPARK_HOME%\bin %JAVA_HOME%\bin
- Close the environment variable screen and the control panels.
- Install the latest Scala IDE from http://scala-ide.org/download/sdk.html
- Test it out!
a. Open up a Windows command prompt in administrator mode.
b. Enter cd c:\spark and then dir to get a directory listing.
c. Look for a text file we can play with, likeREADME.md
orCHANGES.txt
d. Enterspark-shell
e. At this point you should have ascala>
prompt. If not, double check the steps above.
f. Enterval rdd = sc.textFile(“README.md”)
(or whatever text file you’ve found)
g. Enterrdd.count()
h. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
i. Hitcontrol-D
to exit the spark shell, and close the console window
j. You’ve got everything set up! Hooray!
- Install Apache Spark using
Homebrew
. a. InstallHomebrew
if you don’t have it already by entering this from a terminal prompt:b. Enter/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install apache-spark
c. Create alog4j.properties
file viad. Edit thecd /usr/local/Cellar/apache-spark/2.0.0/libexec/conf cp log4j.properties.template log4j.properties (substituted 2.0.0 for the version actually installed)
log4j.properties
file and change the log level fromINFO
toERROR
onlog4j.rootCategory
. - Install the Scala IDE from http://scala-ide.org/download/sdk.html
- Test it out!
a. Cd to the directory apache-spark was installed to (such as /usr/local/Cellar/apachespark/2.0.0/libexec/)
and then ls to get a directory listing.
b. Look for a text file we can play with, likeREADME.md
orCHANGES.txt
c. Enterspark-shell
d. At this point you should have a scala> prompt. If not, double check the steps above.
e. Enterval rdd = sc.textFile(“README.md”)
(or whatever text file you’ve found)
f. Enterrdd.count()
g. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
h. Hitcontrol-D
to exit the spark shell, and close the console window
i. You’ve got everything set up! Hooray!
- Install Java, Scala, and Spark according to the particulars of your specific OS. A good starting point is http://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm (but be sure to install Spark 2.0 or newer)
- Install the Scala IDE from http://scala-ide.org/download/sdk.html
- Test it out!
a. Cd to the directoryapache-spark
was installed to and then ls to get a directory listing.
b. Look for a text file we can play with, likeREADME.md
orCHANGES.txt
c. Enterspark-shell
d. At this point you should have ascala>
prompt. If not, double check the steps above.
e. Enterval rdd = sc.textFile(“README.md”)
(or whatever text file you’ve found)
f. Enterrdd.count()
g. You should get a count of the number of lines in that file! Congratulations, you just ran
your first Spark program! h. Hitcontrol-D
to exit the spark shell, and close the console window
i. You’ve got everything set up! Hooray!