Twitter Follow-Back Prediction

Tsinghua University - Big Data Summer Camp 2016

Algorithm Used - Random Forest Classifier

Features Used/Model Performance/Sample Input/Output - SEE BELOW


  1. spark-shell --executor-memory 10G or just spark-shell
  2. :paste
  3. Copy and paste the entire code into the shell
  4. CTRL+D

Note - Make sure that graph_cb.txt and interaction_list_all.txt are copied onto root at HDFS


  1. Number of users that are following USER#1 (Popularity metric #1)
  2. Number of users that are following USER#2 (Popularity metric #1)
  3. Number of users that USER#1 is following (Popularity metric #2)
  4. Number of users that USER#2 is following (Popularity metric #2)
  5. Number of times that USER#1 has been mentioned on Twitter (Popularity metric #3)
  6. Number of times that USER#2 has been mentioned on Twitter (Popularity metric #3)
  7. Number of times USER#1 has mentioned USER#2
  8. Number of times USER#2 has mentioned USER#1
  9. Number of days after which a connection between USER#1 and USER#2 occured (Case 1 - If both users connected on the same day => Score = 200; Case 2 - If both users connected after X days => Score = abs(X); Case 3 - If both users did not connect at all in the past => Score = -1)

Label -> 1.0 if there will be a connection between USER#2 and USER#1, given that USER#1 has already followed USER#2 0.0 otherwise


(1.0,[3.0,6.0,5.0,6.0,3.0,1.0,0.0,1.0,200.0]) (1.0,[3.0,2.0,5.0,3.0,3.0,1.0,0.0,0.0,200.0]) (0.0,[3.0,12.0,5.0,9.0,3.0,35.0,0.0,0.0,-1.0]) (0.0,[3.0,10.0,5.0,2.0,3.0,0.0,0.0,0.0,-1.0]) (0.0,[3.0,4.0,5.0,1.0,3.0,4.0,0.0,0.0,-1.0]) (0.0,[0.0,47.0,1.0,11.0,0.0,0.0,0.0,0.0,-1.0]) (0.0,[3.0,6.0,3.0,1.0,0.0,0.0,0.0,0.0,-1.0]) (1.0,[3.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,200.0]) (1.0,[3.0,5.0,3.0,4.0,0.0,2.0,0.0,0.0,200.0]) (1.0,[1.0,2.0,8.0,7.0,1.0,19.0,1.0,1.0,200.0]) . .


testErr: Double = 1.0

scala> precision.foreach { case (t, p) => | println(s"Threshold: $t, Precision: $p") | }

Threshold: 1.0, Precision: 1.0 Threshold: 0.0, Precision: 0.6063996042409183

scala> model res21: org.apache.spark.mllib.tree.model.RandomForestModel = TreeEnsembleModel classifier with 9 trees

scala> val recall = metrics.recallByThreshold recall: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[376] at map at BinaryClassificationMetrics.scala:216

scala> recall.foreach { case (t, r) => | println(s"Threshold: $t, Recall: $r") | } Threshold: 1.0, Recall: 1.0 Threshold: 0.0, Recall: 1.0

scala> val PRC = PRC: org.apache.spark.rdd.RDD[(Double, Double)] = UnionRDD[379] at union at BinaryClassificationMetrics.scala:111

scala> val f1Score = metrics.fMeasureByThreshold f1Score: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[380] at map at BinaryClassificationMetrics.scala:216

scala> f1Score.foreach { case (t, f) => | println(s"Threshold: $t, F-score: $f, Beta = 1") | } Threshold: 1.0, F-score: 1.0, Beta = 1 Threshold: 0.0, F-score: 0.7549797729531488, Beta = 1

scala> val auPRC = metrics.areaUnderPR auPRC: Double = 1.0

scala> println("Area under precision-recall curve = " + auPRC) Area under precision-recall curve = 1.0

scala> val thresholds = thresholds: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[386] at map at :77

scala> val roc = metrics.roc roc: org.apache.spark.rdd.RDD[(Double, Double)] = UnionRDD[390] at UnionRDD at BinaryClassificationMetrics.scala:92

scala> val auROC = metrics.areaUnderROC auROC: Double = 1.0

scala> println("Area under ROC = " + auROC) Area under ROC = 1.0


