This example runs on MapR 6.1 , Spark 2.3.1 and greater Install and fire up the Sandbox using the instructions here: http://maprdocs.mapr.com/home/SandboxHadoop/c_sandbox_overview.html. ____________________________________________________________________ Step 1: Log into Sandbox, create data directory, MapR Event Stre Topic and MapR Database table: Use an SSH client such as Putty (Windows) or Terminal (Mac) to login. See below for an example: use userid: mapr and password: mapr. For VMWare use: $ ssh mapr@ipaddress For Virtualbox use: $ ssh mapr@127.0.0.1 -p 2222 after logging into the sandbox At the Sandbox unix command line: Create a directory for the data for this project mkdir /user/mapr/data or hadoop fs -mkdir /user/mapr/data ____________________________________________________________________ Step 2: Copy the data file to the MapR sandbox or your MapR cluster Copy the data file from the project data folder to the sandbox using scp to this directory /user/mapr/data/flight.csv on the sandbox: For VMWare use: $ scp *.json mapr@<ipaddress>:/mapr/demo.mapr.com/user/mapr/data/. For Virtualbox use: $ scp -P 2222 data/*.json mapr@127.0.0.1:/mapr/demo.mapr.com/user/mapr/data/. this will put the data file into the cluster directory: /mapr/<cluster-name>/user/mapr/data ____________________________________________________________________ Step 3: To run the code in the Spark Shell: /opt/mapr/spark/spark-*/bin/spark-shell --master local[2] - For Yarn you should change --master parameter to yarn-client - "--master yarn-client" ____________________________________________________________________ Step 4: To submit the code as a spark application: Build project, Copy the jar files Build project with maven and/or load into your IDE and build. You can build this project with Maven using IDEs like Intellij, Eclipse, NetBeans, and then copy the JAR file to your MapR Sandbox, or you can install Maven on your sandbox and build from the Linux command line, for more information on maven, eclipse or netbeans use google search. This creates the following jar in the target directory. mapr-spark-flightdelay-1.0.jar After building the project on your laptop, you can use scp to copy your JAR file from the project target folder to the MapR Sandbox: From your laptop command line or with a scp tool : use userid: mapr and password: mapr. For VMWare use: $ scp nameoffile.jar mapr@ipaddress:/mapr/demo.mapr.com/user/mapr/. For Virtualbox use: $ scp -P 2222 target/*.jar mapr@127.0.0.1:/mapr/demo.mapr.com/user/mapr/. this will put the jar file into the directory: /user/mapr _____________________________________________________________________ Step 5: To run the application code for Datasets, DataFrames and Spark SQL From the Sandbox command line : /opt/mapr/spark/spark-*/bin/spark-submit --class dataset.Flight --master local[2] mapr-spark-flightdelay-1.0.jar This will read from the file "/mapr/demo.mapr.com/data/flights20170102.json" You can optionally pass the file as an input parameter (take a look at the code to see what it does) ____________________________________________________________________ To run the application code for Machine Learning Classification From the Sandbox command line : /opt/mapr/spark/spark-*/bin/spark-submit --class machinelearning.Flight --master local[2] mapr-spark-flightdelay-1.0.jar This will read from the file mfs:///mapr/demo.mapr.com/data/flight.json You can optionally pass the file as an input parameter (take a look at the code to see what it does) ____________________________________________________________________ Preparation for Structured Streaming with MapR Event Store for Kafka and MapR Database : use the mapr command line interface to create a stream, a topic, get info and create a table: maprcli stream create -path /user/mapr/stream -produceperm p -consumeperm p -topicperm p maprcli stream topic create -path /user/mapr/stream -topic flights to get info on the flights topic : maprcli stream topic info -path /user/mapr/stream -topic flights Create the MapR Database Table which will get written to maprcli table create -path /user/maprflighttable -tabletype json -defaultreadperm p -defaultwriteperm p Run the Java code to publish events to the topic: java -cp ./mapr-spark-flightdelay-1.0.jar:`mapr classpath` streams.MsgProducer This client will read lines from the file in "/mapr/demo.mapr.com/data/flight.csv" and publish them to the topic /user/mapr/stream:flights. You can optionally pass the file and topic as input parameters <file topic> Optional: run the MapR Streams Java consumer to see what was published : java -cp mapr-spark-flightdelay-1.0.jar:`mapr classpath` streams.MsgConsumer _____________________________________________________________________________ Run the the Spark Structured Streaming client to consume events enrich them and write them to MapR Database (in separate consoles if you want to run at the same time as the java publisher) From the Sandbox command line : /opt/mapr/spark/spark-*/bin/spark-submit --class stream.StructuredStreamingConsumer --master local[2] mapr-spark-flightdelay-1.0.jar This spark streaming client will consume from the topic /user/mapr/stream:flights, enrich from the saved model at /mapr/demo.mapr.com/data/flightmodel and write to the table /user/maprflighttable. You can optionally pass the input parameters <topic model table> You can use ctl-c to stop In another window while the Streaming code is running, run the code to Query from MapR Database /opt/mapr/spark/spark-*/bin/spark-submit --class sparkmaprdb.QueryFlight --master local[2] \ mapr-spark-flightdelay-1.0.jar Use the Mapr-DB shell to query the data start the hbase shell and scan to see results: $ /opt/mapr/bin/mapr dbshell maprdb mapr:> jsonoptions --pretty true --withtags false maprdb mapr:> find /user/mapr/flighttable --limit 5 ____________________________________________________________________ To run the application code for GraphFrames To read from MapR Database into GraphFrames From the Sandbox command line : /opt/mapr/spark/spark-*/bin/spark-submit --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11 --class graphmaprdb.Flight --master local[2] mapr-spark-flightdelay-1.0.jar