alexrutherford/facebook_page_scraping

Notebook to robustly query public Facebook pages based on keywords

Jupyter Notebook

Summary

Notebook and script to query Facebook API and write results to MongoDB. Steps are as follows

Define keywords to grab pages (matches description and/or title)
Grabs pages and associated meta-data
Grabs latest posts; limited to ~250 and recent time range
Pages through all comments and likes on each post

Logic

The high-level flow of the script is as follows:

Keywords to find pages are defined
A call to the API returns all page IDs matching these keywords
A second call gets the full information for these pages
Another API call gets the latest posts on these pages
Two separate calls are made to get the likes and the comments on each post

At each point, the data returned is tested to see if it already exists in the DB (according to a unique ID) and if not is added, if so is ignored

MongoDB Structure

Requires a MongoDB with four collections

Pages
Posts
Comments
Likes

Dependencies

Requests For robust HTTP requests
Langid For detecting non-English content
TextBlob Abstraction for Google translating non-English content
MongoDB connector To easily write results to DB in JSON format
Facebook API key For API authentication