Facebook News Scraper

Overview

Dataset contains 19,850 posts from 83 various news organizations & personalities representing up to the last 250 page posts made as of July 14th, 2017. Each post has up to 100 comments for a total of 1,025,403 comments.

You can alternatively access the data through BigQuery: https://bigquery.cloud.google.com/dataset/jbencina-144002:fb_news

Files

post_scraper_github.py - script used to scrape pages
fb_news_comments_1000K.7z - comments file
fb_news_posts_20K.csv - posts file
fb_news_pagenames.csv - page names

Apologizes for the 7Z format, but it compressed the best.

Script

This script loops though a dictionary of Facebook page ids and retrieves the last N posts and up to 100 comments for each post. The results are optionally cached as individual data files and ultimately stored as a set of data frames: one for posts, one for comments. They can be linked by the common post_id field.

While originally used for news sites, this script can accommodate any Facebook page. One could also loop though the comments to get more than the last 100.

Fields - Posts

created_time
description: only for posts with links
link
message: post contents
page_id
post_id
react_angry
react_haha
react_like
react_love
react_sad
react_wow
scrape_time
shares

Fields - Comments

created_time
from_id: user id
from_name: publicly visible name of poster
message
post_id: parent post_id

Pages Scraped

May be subject to future additions.

'bbc': '228735667216',
'fox_news': '15704546335',
'abc_news': '86680728811',
'nbc_news': '155869377766434',
'cbs_news': '131459315949',
'cnn': '5550296508',
'msnbc': '273864989376427',
'npr': '10643211755',
'politico': '62317591679',
'reuters': '114050161948682',
'wastington_post': '6250307292',
'ny_times': '5281959998',
'economist': '6013004059',
'financial_times': '8860325749',
'the_guardian': '10513336322',
'daily_mail': '164305410295882',
'ny_dailynews': '268914272540',
'breitbart': '95475020353',
'infowars': '80256732576',
'huffington_post': '18468761129',
'daily_kos': '43179984254',
'salon': '120680396518',
'the_hill': '7533944086',
'national_review': '15779440092',
'usa_today': '13652355666',
'wsj': '8304333127',
'buzzfeed_news': '618786471475708',
'cnbc': '97212224368',
'newsweek': '18343191100',
'associated_press': '249655421622',
'bloomberg_politics': '1481073582140028',
'yahoo_news': '338028696036',
'chicago_tribune': '5953023255',
'la_times': '5863113009',
'daily_beast': '37763684202',
'drudge_report': '1416139158459267',
'the_blaze': '140738092630206',
'young_cons': '147772245840',
'daily_caller': '182919686769',
'newsmax': '85452072376',
'wordl_net_daily': '119984188013847',
'independent_journal_review': '687156898054966',
'los_angeles_times': '5863113009',
'time': '10606591490',
'us_worldnews_report': '5834919267',
'business_insider': '20446254070',
'slate': '21516776437',
'vox': '223649167822693',
'think_progress': '200137333331078',
'democratic_undergound': '455410617809655',
'talking_points_memo': '98658495398',
'the_nation': '7629206115',
'mother_jones': '7642602143',
'raw_story': '20324257234',
'pro_publica': '13320939444',
'townhall': '41632789656',
'washington_examiner': '40656699159',
'daily_signal': '300341323465160',
'weekly_standard': '11643473298',
'the_atlantic': '29259828486',
'the_newyorker': '9258148868',
'morning_joe': '90692553761',
'vice_news': '235852889908002',
'rt': '326683984410',
'al_jazeera': '7382473689',
'one_america_news': '220198801458577',
'christian_science_monitor': '14660729657',
'pbs_newshour': '6491828674',
'miami_herald': '38925837299',
'person_alex_jones': '6499393458',
'person_anderson_cooper': '60894670532',
'person_rachel_maddow': '25987609066',
'person_sean_hannity': '69813760388',
'person_chris_matthews': '114114045339706',
'person_megyn_kelly': '1425464424382692',
'person_neil_cavuto': '101988643193353',
'person_chris_hayes': '153005644864469',
'person_shepard_smith': '131010790489',
'person_erin_burnett': '102938906478343',
'person_joe_scarobourgh': '144128236879',
'person_rush_limbaugh': '136264019722601',
'person_bill_mahar': '62507427296',
'person_ann_coulter': '695526053890545'