/lambda-architecture-jeeconf-kyiv

Simple Lambda Architecture implementation based on Apache Spark (Core, SQL, Streaming)

Primary LanguageJava

Sample Application for "Lambda Architecture with Apache Spark" Presentation

Presentation

Link to the presentation: http://www.slideshare.net/tmatyashovsky/lambda-architecture-with-apache-spark

Idea

Provide hashtags statistics used in a tweets filtered by particular hashtag, e.g. #morningatlohika, #jeeconf, etc. Statistics should be provided from all time till today + right now.

Batch View

Batch view creation is simplified, i.e. it is assumed that there is a batch layer that processed all historical tweets and produced some output. 'BatchTest' can be used to generated batch view in .parquet format with data listed below:

  • apache – 6
  • architecture – 12
  • aws – 3
  • java – 4
  • jeeconf – 7
  • lambda – 6
  • morningatlohika – 15
  • simpleworkflow – 14
  • spark – 5

Real-time View

Assuming application had received new tweet like "Cool presentation by @tmatyashovsky about #lambda #architecture using #apache #spark at #jeeconf" real-time view will be as following:

  • apache – 1
  • architecture – 1
  • jeeconf – 1
  • lambda – 1
  • spark – 1

Query, i.e. Batch View + Real-time View

Upon receiving a request query will merge batch view and real-time view to get the following result:

  • apache – 7
  • architecture – 7
  • aws – 3
  • java – 4
  • jeeconf – 8
  • lambda – 7
  • morningatlohika – 15
  • simpleworkflow – 14
  • spark – 6

Simplified Steps

  • Create batch view (.parquet) via Apache Spark
  • Cache batch view in Apache Spark
  • Start streaming application connected to Twitter
  • Focus on real-time #jeeconf tweets*
  • Build incremental real-time views
  • Query, i.e. merge batch and real-time views on a fly

Stream from file system (used for testing) can be used as a backup

Build, Configure and Run

Build

Standard build:

./gradlew clean build shadowJar

Quick build without tests:

./gradlew clean build shadowJar -x test

Configuration

All available configuration properties are spread out via 3 files:

  • application.properties - contains business logic specific stuff
  • spark.properties - contains Spark specific stuff
  • twitter4j.properties - contains credentials needed to make authorization call to Twitter

All properties are self explanatory, but few the most important ones are listed explicitly below.

Application Properties

Name Type Default value Description
server.port Integer 9090 The port to listen for incoming HTTP requests
batch.view.file.path String Path to batch view
batch.view.force.precache Boolean true Decides whether to pre cache batch view or not
twitter.filter.text String #jeeconf To filter only specific tweets
twitter.filter.date Date 20160519 To filter only new tweets

Spark Properties

Name Type Default value Description
spark.master String spark://127.0.0.1:7077 The URL of the Spark master. For development purposes, you can use local[n] that will run Spark on n threads on the local machine without connecting to a cluster. For example, local[2].
spark.distributed-libraries String Path to distributed library that should be loaded into each worker of a Spark cluster.
spark.streaming.context.enable Boolean true Whether start Spark streaming context as it should be on production or just create it and let tests configure it for own purposes
spark.streaming.batch.duration.seconds Long 15 Interval in seconds to trigger new data into stream, i.e. interval between mini batches
spark.streaming.remember.duration.seconds Long 6000 You can also run SQL queries on tables defined on streaming data from a different thread (that is, asynchronous to the running StreamingContext). Just make sure that you set the StreamingContext to remember a sufficient amount of streaming data such that the query can run. Otherwise the StreamingContext, which is unaware of the any asynchronous SQL queries, will delete off old streaming data before the query can complete.
spark.streaming.checkpoint.directory String /tmp/checkpoint Directory used for check pointing
spark.streaming.file.stream.enable Boolean false # Whether use file stream instead of real Twitter. Main purpose - is backup plan when there is no Internet connection.
spark.streaming.directory String Directory used as backup plan for streaming when Twitter is not available

Twitter Properties

For more information how to generate OAuth access token please refer to this page: https://dev.twitter.com/oauth/overview/application-owner-access-tokens

Sample configuration for a local development environment

Create application.properties (for instance, in your user home directory) and override any of the described properties. For instance, minimum set of values that should be specified for your local environment is listed below:

spark.distributed-libraries=<path_to_your_repo>/spark-distributed-library/build/libs/spark-distributed-library-1.0-SNAPSHOT-all.jar

batch.view.file.path=<path_to_your_repo>/spark-driver/src/test/resources/batch-views/batch-view.parquet

Create twitter4j.properties (for instance, in your user home directory) and use the following properties for your local environment:

oauth.consumerKey=<replace with your own>
oauth.consumerSecret=<replace with your own>
oauth.accessToken=<replace with your own>
oauth.accessTokenSecret=<replace with your own>

Make sure that you have entered correct values that correspond to your twitter account.

Testing

Streaming pipeline is cover with tests in order to demonsrate how easily streaming applications can be tested via Spark. Stream from filesystems is used in order to mock stream from Twitter. Also tweets.txt is being generated with some dummy tweets (and particular hashtags) to verify correctness too.

Run

From your favourite IDE plese run ApplicationConfiguration main method. This will use default configuration bundled in the source code.

In order to run the application with custom configuration please add spring.config.location parameter that corresponds to directory that contains your custom application.properties and twitter4j.properties (in our example your user home directory). Or just enumerate them explicitly, for instance:

spring.config.location=/Users/<your user>/application.properties,/Users/<your user>/twitter4j.properties