Sessionizer is a real-time Spark Streaming based library which will "sessionize" web traffic. In other words, it applies session IDs based on inference of traffic.`
I based this on Cloudera's blog (which did not work out of the box).
Developer:
- Tim Fox, tfox@createksolutions.com
Sample code from here, https://github.com/tmalaska/SparkStreaming.Sessionization, modified further for its needs
- I Removed hbase, so we are just writing to a file in hdfs.
- I also updated maven build and added SBT build.
- Note: more documentation is in the 'doc' directory
This is an example of how to use Spark Streaming to Sessionize web log data by ip address.
This will mean that we are sessionizing in NRT and landing the results on HDFS.
- Number of events
- Number of active sessions
- Average session time
- Number of new sessions
- Number of dead sessions
- Build using:
mvn clean package
or
sbt clean package
Spark Streaming is not able to run local input (though it can produce local ouput.) I do not support running from hdfs files, as this never worked even in the original CLoudera project.
I am keeping track of counts in an hdfs file (in the current approach). Here is some information on the aggreagations we do: