Dataset contains 19,850 posts from 83 various news organizations & personalities representing up to the last 250 page posts made as of July 14th, 2017. Each post has up to 100 comments for a total of 1,025,403 comments.
You can alternatively access the data through BigQuery: https://bigquery.cloud.google.com/dataset/jbencina-144002:fb_news
2021-10-10: Updated reame 2019-11-21: Updated the comments data to obfuscate user id and user name with hashed values using Python's blake2b
- post_scraper_github.py - script used to scrape pages
- fb_news_comments_1000K.7z - comments file
- fb_news_posts_20K.csv - posts file
- fb_news_pagenames.csv - page names
Apologizes for the 7Z format, but it compressed the best.
This script loops though a dictionary of Facebook page ids and retrieves the last N posts and up to 100 comments for each post. The results are optionally cached as individual data files and ultimately stored as a set of data frames: one for posts, one for comments. They can be linked by the common post_id field.
While originally used for news sites, this script can accommodate any Facebook page. One could also loop though the comments to get more than the last 100.
To join the two datasets, split the post_id
and post_name
fields from
the Posts and Comments files respectively on underscore. The resulting ID
allows for a clean join between both files.
- created_time
- description: only for posts with links
- link
- message: post contents
- page_id
- post_id: two part identifier of [page_id]_[post_id]
- react_angry
- react_haha
- react_like
- react_love
- react_sad
- react_wow
- scrape_time
- shares
- created_time
- from_id: user id (obfuscated)
- from_name: publicly visible user name (obfuscated)
- message: text
- post_name: two part identifier of [from_id]_[post_id]
May be subject to future additions.
'bbc': '228735667216',
'fox_news': '15704546335',
'abc_news': '86680728811',
'nbc_news': '155869377766434',
'cbs_news': '131459315949',
'cnn': '5550296508',
'msnbc': '273864989376427',
'npr': '10643211755',
'politico': '62317591679',
'reuters': '114050161948682',
'wastington_post': '6250307292',
'ny_times': '5281959998',
'economist': '6013004059',
'financial_times': '8860325749',
'the_guardian': '10513336322',
'daily_mail': '164305410295882',
'ny_dailynews': '268914272540',
'breitbart': '95475020353',
'infowars': '80256732576',
'huffington_post': '18468761129',
'daily_kos': '43179984254',
'salon': '120680396518',
'the_hill': '7533944086',
'national_review': '15779440092',
'usa_today': '13652355666',
'wsj': '8304333127',
'buzzfeed_news': '618786471475708',
'cnbc': '97212224368',
'newsweek': '18343191100',
'associated_press': '249655421622',
'bloomberg_politics': '1481073582140028',
'yahoo_news': '338028696036',
'chicago_tribune': '5953023255',
'la_times': '5863113009',
'daily_beast': '37763684202',
'drudge_report': '1416139158459267',
'the_blaze': '140738092630206',
'young_cons': '147772245840',
'daily_caller': '182919686769',
'newsmax': '85452072376',
'wordl_net_daily': '119984188013847',
'independent_journal_review': '687156898054966',
'los_angeles_times': '5863113009',
'time': '10606591490',
'us_worldnews_report': '5834919267',
'business_insider': '20446254070',
'slate': '21516776437',
'vox': '223649167822693',
'think_progress': '200137333331078',
'democratic_undergound': '455410617809655',
'talking_points_memo': '98658495398',
'the_nation': '7629206115',
'mother_jones': '7642602143',
'raw_story': '20324257234',
'pro_publica': '13320939444',
'townhall': '41632789656',
'washington_examiner': '40656699159',
'daily_signal': '300341323465160',
'weekly_standard': '11643473298',
'the_atlantic': '29259828486',
'the_newyorker': '9258148868',
'morning_joe': '90692553761',
'vice_news': '235852889908002',
'rt': '326683984410',
'al_jazeera': '7382473689',
'one_america_news': '220198801458577',
'christian_science_monitor': '14660729657',
'pbs_newshour': '6491828674',
'miami_herald': '38925837299',
'person_alex_jones': '6499393458',
'person_anderson_cooper': '60894670532',
'person_rachel_maddow': '25987609066',
'person_sean_hannity': '69813760388',
'person_chris_matthews': '114114045339706',
'person_megyn_kelly': '1425464424382692',
'person_neil_cavuto': '101988643193353',
'person_chris_hayes': '153005644864469',
'person_shepard_smith': '131010790489',
'person_erin_burnett': '102938906478343',
'person_joe_scarobourgh': '144128236879',
'person_rush_limbaugh': '136264019722601',
'person_bill_mahar': '62507427296',
'person_ann_coulter': '695526053890545'