http://104.197.20.219/yiqin/facebook_hackathon_hacker.html
We get the data from Facebook using Facebook Graph API. The dataset are from about Hackathon Hackers (HH), which has become the biggest Facebook group for hackathon attendees. Currently, it has over 18k members. It's a place to discuss hackathons, tech news, college, high school, and dank memes. It had 53 different public subgroups. One of hbase tables I built is to get the top 20 active users in most 25 popular subgroups.
Facebook limits developers to any public graph data at Facebook due to the privary. This dataset in csv file is 245.5 MB. The time range is from 7/1/2014 - 8/20/2015, which is more than one year. This is the larget dataset I could get from Facebook Graph API. It has more than 1,300,000 items. It includes information about members' likes, comments and posts.
I upload all files to svn. These files are also upoad to different directory in the cluster.
-
create_Facebook_Graph_Data_Table.pig
To create the hbase table for the most popular subgroup. -
top_user.pig To create the hbase table for the top active users in each subgroup. I use macro to reused code.
-
general_information.pig To create other information. These parts are not stored into hbase table due to time limit. It includes Types count, Like ranking, and Likes and comments average.
-
uber-yiqin-0.0.1-SNAPSHOT.jar Java application. It includes Thrift and SerializeFacebookGraphNode.
-
input Includes .csv files, which are raw data from Facebook Graph API.
Found 1 items -rw-r--r-- 2 mpcs53013 hdfs 212250910 2015-12-08 04:46 /mnt/scratch/yiqin/Facebook_Graph_data
-
background.jpeg
-
elegant-aero.css
-
table.css These three files are for UI element in html.
-
facebook_hackathon_hacker.html
html file used for the url. It includes three parts: most popular subgroup, top active users, and submitting new Facebook grasph node data.
-
facebook_Graph_Data_Table.pl To get the hbase table of most popular subgroups.
-
top_user.pl To get the hbase table of top active users.
-
submit_new_data.pl
To send new data to kafka.
I have created a topic yiqin-facebook on kafka.
###############################################
###############################################
The batch layer is ready. csv
dataset are uploaded /mnt/scratch/yiqin/input
in the cluster. facebookGraphNode.thrift is created to process the data. uber-yiqin-0.0.1-SNAPSHOT.jar
is also in the cluster. create_Facebook_Graph_Data_Table.pig
is to create the table and show the table. You can see the data are stored.
Facebook graph data are so difficult to deal with, but contain interesting information. I'm going to finish the project in the next 4 days.
Subgroup must satisfy one of these prerequisites:
- 1 week old, 5 posts, and 250 members
- 3 weeks old, 10 posts, and 100 members
(Hackathon Hackers,522870) (HH: What Are You Working On?,40649) (HH Design,39851) (HH Hacker Problems,14539) (Hackathon Hackers EU,14432) (HH Data Hackers,13657) (HH Webdev,9903) (HH iOS,5349) (HH Throw a Hackathon,5047) (HH: Snackathon Snackers,4122) (HH: VR,2902) (HH Free Stuff,2772) (HH Growthhacking,2409) (HH CTF,1481) (HH Canada Eh?,1290) (HH Skillshare,1180) (HH Blog Posts,1066) (HH Connect,1042) (HH FIRST + VEX,976) (HH: Book Club,719) (HH EdTech,695) (HH South,542) (HH Python,479) (HH Texas,419) (HH Social Good,392) (HH Systems Programming,374) (HH Africa,358) (HH Hardware Hackers,301) (HH Futurism,298) (HH Internet of Things,197) (HH: Code Reviews,191) (Hackathon Hackers Asia,186) (HH Constructive Debates,128) (HH Product Launch,86) (HH: Share Your Projects,67) (HH λ,62) (Hackathon Hackers South East Asia (SEA),58)
(like,516458) (comment,155369) (status,11430) (link,6118) (photo,859) (video,688) (event,164) (note,2) (offer,1)
(Hackathon Hackers,1000,status,#hackerinchief) (Hackathon Hackers,1000,status,Mac is now supporting Windows!) (Hackathon Hackers,903,photo,git commit -m "Fixed interface issues."
source: twitter) (Hackathon Hackers,757,photo,Zuck actually checks his facebook.) (Hackathon Hackers,645,status,) (Hackathon Hackers,626,photo,Yo! This guy's license plate says "NODE JS" #paloalto) (Hackathon Hackers,606,status,ohhhhhhhhhh babyyyyyy ;)) (Hackathon Hackers,586,link,Thinking of dropping out? I wrote a bit on what you'll go through.) (Hackathon Hackers,540,status,Who's down to bring a hackathon to Ahmed's community? We must encourage that kid to keep building and educate those around him. #hellyeah?) (Hackathon Hackers,517,status,Seems legit!)
We only consider status, links, photo, video, comment.
(link,11.275253350768224,5.315299117358614) (photo,28.21885913853318,8.679860302677533) (video,9.795058139534884,4.125) (status,10.356167979002624,9.236482939632547) (comment,1.9140304693986574,0.0)
gcloud compute copy-files facebook_Graph_Data_Table.pl webserver:/tmp
sudo mv /tmp/facebook_Graph_Data_Table.pl /var/www/cgi-bin/yiqin/
sudo chmod 777 facebook_Graph_Data_Table.pl
gcloud compute copy-files uber-yiqin-0.0.1-SNAPSHOT.jar hadoop-m:/mnt/scratch/yiqin
hadoop jar uber-yiqin-0.0.1-SNAPSHOT.jar edu.uchicago.yiqin.SerializeFacebookGraphNode /mnt/scratch/yiqin/input Don't forget the class, which contains the main.
http://104.197.20.219/cgi-bin/yiqin/facebook_Graph_Data_Table.pl
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic yiqin-facebook
kafka-console-consumer.sh --zookeeper localhost:2181 --topic yiqin-facebook --from-beginning