This repository supports my talk entitled Getting Started with Spark Streaming.
This demo is easiest to run in IntelliJ IDEA, although you can certainly run it via submit-spark
or in another IDE like Eclipse.
-
Apache Spark must be installed locally. Instructions for this are available in my talk entitled Getting Started with Apache Spark.
-
ncat
must be installed. This comes pre-installed on MacOS and Linux. For Windows, you can find it on the Nmap website.
-
Run ncat to open up a socket on port 9999:
ncat -kl 9999
-
Execute
HelloSparkStreaming.scala
orHelloSparkStreamingDataFrame.scala
.
This demo is easiest to run in IntelliJ IDEA, although you can certainly run it via submit-spark
or in another IDE like Eclipse.
-
Apache Spark must be installed locally. Instructions for this are available in my talk entitled Getting Started with Apache Spark.
-
ncat
must be installed. This comes pre-installed on MacOS and Linux. For Windows, you can find it on the Nmap website. -
Docker must be installed and should be configured to run Linux-based containers rather than Windows-based containers.
-
Apache Kafka must be installed. My preferred option is to use Confluent Platform on Docker, as this works well on Windows.
-
Cassandra must be installed. My preferred option is to use Cassandra in a Docker container.
-
Start up Confluent Platform:
git clone https://github.com/confluentinc/cp-all-in-one cd cp-all-in-one git checkout 5.5.1-post cd cp-all-in-one/ docker-compose up -d
-
Start up Cassandra:
docker run -p 9042:9042 -p 7000:7000 -p 7001:7001 -p 7199:7199 -p 9160:9160 --name spark-cassandra -d cassandra
-
Connect to Cassandra. One option is to use the Cassandra workbench in Visual Studio Code. Run the following code:
CREATE KEYSPACE public WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 } };
After running this script to create the keyspace, run the following script to create the table.
CREATE TABLE public.car("Name" text primary key, "Cylinders" int, "Horsepower" int )
-
Create a new Kafka topic named
car
. -
Run
Job.scala
in theksc
project'sSpark
folder. -
Load data from
data\cars.json
into thecar
topic. Here is an example on Windows.kafka-console-producer --broker-list localhost:9092 --topic car < C:\SourceCode\Getting-Started-With-Spark-Streaming\data\cars.json
-
Run
SELECT * FROM public.car
against Cassandra and notice the data has loaded into the Cassandra table.
These examples are derived from the Microsoft.Spark samples for F#.
-
Docker must be installed and should be configured to run Linux-based containers rather than Windows-based containers.
-
Apache Kafka must be installed locally if you wish to run the Kafka experiment. My preferred option is to use Confluent Platform on Docker, as this works well on Windows.
-
Start up Confluent Platform:
git clone https://github.com/confluentinc/cp-all-in-one cd cp-all-in-one git checkout 5.5.1-post cd cp-all-in-one/ docker-compose up -d
-
In the Confluent Control Center (by default, http://localhost:9021), navigate to Cluster settings, choose Listener, and ensure that the advertised.listeners property has the IP address for your host machine. For example, if your host machine is at IP address 192.168.1.10, the advertised listener should be at that IP address rather than 127.0.0.1 or localhost. Otherwise, the .NET Kafka example will not work.
-
Create a topic in Kafka named Flights if you wish to run the Kafka demo.
-
Build the Dockerfile in this repository:
docker build . -t gswss
-
Choose the demo you want to run. Both
vi
andnano
are installed with the image, so pick a text editor and modify therun_spark_dotnet_demo
file to pick a specific demo.docker run --name gswss -it gswss bash cd /root vi run_spark_dotnet_demo
-
Run ncat. To do this, open a new console and run the following:
docker exec -it gswss /bin/bash nc -kl 9999
Alternatively, you can load a large file with the following:
docker exec -it gswss /bin/bash nc -kl 9999 < /root/data/WarAndPeace.txt
-
Execute the
run_spark_dotnet_demo
script to run the chosen demo../run_spark_dotnet_demo