/marseille

A real time streaming implementation of markov chain based fraud detection

Primary LanguageScalaOtherNOASSERTION

An example implementation of a Markov chain based credit card fraud detection

More information on our blog: http://

In order to run our example, we need to install the followings:

Building the examples:

$ sbt assembly - this creates [installation_dir]/marseille/target/scala-2.10/spark-kafka-fraud-detection-assembly-1.0.jar

Let's start all the components running.

Start spark in standalone mode

$ [spark_installation_dir]/sbin/start-all.sh

you should be able to bring up the spark monitoring page at http://localhost:8080 at this point. The page should indicate that there is one work node (default setting).

Start hbase

$ [hbase_installation_dir]/bin/start-hbase.sh

This would also start the attached zookeeper instance.

Start kafka. this will reuse the zookeeper instance started with hbase. Do not start a second instance.

$ [kafka_installation_dir]/bin/kafka-server-start.sh config/server.properties

Start hdfs

$ [hadoop_installation_dir]/sbin/start-all.sh

Use jps to check both spark and hbase are running

$ jps

You should see lines like 10461 Kafka 10322 HRegionServer 14175 Jps 13669 Main 97904 10222 HMaster 13893 Master 13546 SecondaryNameNode 13452 DataNode 13377 NameNode

Now start all the monitoring tools to keep an eye on each stage of execution

Create 3 kafka topics $ [kafka_installation_dir]/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic real-time-cc $ [kafka_installation_dir]/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic user-trans-history $ [kafka_installation_dir]/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic candidate-fraud-trans

Start 3 listeners for these 3 topics using kafka's console consumer $ [kafka_installation_dir]/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic real-time-cc $ [kafka_installation_dir]/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic user-trans-history $ [kafka_installation_dir]/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic candidate-fraud-trans

The first task to run is creating a training data set.

Run the TrainerApp to create the training set. The number of customers is set to 500 by default in the app_config.prop file $ /opt/spark-1.1.0-bin-hadoop2.4/bin/spark-submit --class com.jcalc.app.TrainerApp --master spark://yourhostname:7077 spark-kafka-fraud-detection-assembly-1.0.jar

this creates three hdfs files : /data/streaming_analysis/markovmodel.txt - this is the model file. it contains the transition probability for all state pairs /data/streaming_analysis/training_sequence - this is the raw transaction sequence /data/streaming_analysis/training_transaction - this is transaction sequence, which groups all transaction sequence per customer

Use spark-submit to start MarkovPredictor which listens on the real-time-cc topic, and publish to topics user-trans-history and candidate-fraud-trans $ /opt/spark-1.1.0-bin-hadoop2.4/bin/spark-submit --class com.jcalc.feed.MarkovPredictor --master spark://yourhostname:7077 spark-kafka-fraud-detection-assembly-1.0.jar

Finally, feed simulated credit card transaction data into the real-time-cc topic to feed MarkovPredictor. 50 is a typical number of customers to use. $java -cp spark-kafka-fraud-detection-assembly-1.0.jar com.jcalc.feed.CCStreamingFeed 50

When you feed this simulated data, you should see data being printed out on all three topics. In addition, you can check that the transaction history is recorded in hbase 'cchistory' table. Use hbase shell to see the records. $ [hbase_install_dir]/bin/hbase shell hbase(main)> scan 'cchistory'

Hbase record dump looks like the following

 UHRMHKBO1D                               column=transactions:records, timestamp=1418713010152, value=LNS,NNL,NNS,LHL,NNN,LHL,NNL,LNN,NNN,NNS,LNN                 
 UM2RPYPHRX                               column=transactions:records, timestamp=1418713010327, value=NNL,LHN,NNN,NNL,NNN,NNL,NHL,NNN,NNS,NNL,NNN,LHN,LNL         
 W19EW2550B                               column=transactions:records, timestamp=1418713010198, value=NNN,HHN,NNN,NHL,NNS,NNL,NHN,NNL,HNL,LNS,LNL                 
 X30RNTW3KG                               column=transactions:records, timestamp=1418713010215, value=NNN,LNN,NNL,LHS,LNN,NNL,NNN,LHN,LNN                         
 XM7YLRIDL8                               column=transactions:records, timestamp=1418713010173, value=LNN,LNL,NNL,NNL,LNL,NNL,HNN,NNN,NNN,NNN,LNL                 
 Y3LNSKDK02                               column=transactions:records, timestamp=1418713010207, value=NNS,NNN,NNN,NNN,HNN,LNN,HNL,NNL,NNS,NNL,NNL,LNN,LNN,LHL,NNN 
50 row(s) in 0.0520 seconds

Repeat the simulated data feed as often as you like.