/sessionizer

Spark Streaming Sessionizer

Primary LanguageScalaApache License 2.0Apache-2.0

Sessionizer

About

Sessionizer is a real-time Spark Streaming based library which will "sessionize" web traffic. In other words, it applies session IDs based on inference of traffic.`

I based this on Cloudera's blog (which did not work out of the box).

Blog: http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

Developer:

Status

Sample code from here, https://github.com/tmalaska/SparkStreaming.Sessionization, modified further for its needs

  • I Removed hbase, so we are just writing to a file in hdfs.
  • I also updated maven build and added SBT build.
  • Note: more documentation is in the 'doc' directory

Problem

This is an example of how to use Spark Streaming to Sessionize web log data by ip address.
This will mean that we are sessionizing in NRT and landing the results on HDFS.

  • Number of events
  • Number of active sessions
  • Average session time
  • Number of new sessions
  • Number of dead sessions

Input Data

How to Build

  1. Build using:
mvn clean package

or

sbt clean package

How to Run

Spark Streaming is not able to run local input (though it can produce local ouput.) I do not support running from hdfs files, as this never worked even in the original CLoudera project.

Aggregations

I am keeping track of counts in an hdfs file (in the current approach). Here is some information on the aggreagations we do:

Miscellaneous