/facebook_page_scraping

Notebook to robustly query public Facebook pages based on keywords

Primary LanguageJupyter Notebook

Summary

Notebook and script to query Facebook API and write results to MongoDB. Steps are as follows

  1. Define keywords to grab pages (matches description and/or title)
  2. Grabs pages and associated meta-data
  3. Grabs latest posts; limited to ~250 and recent time range
  4. Pages through all comments and likes on each post

Logic

The high-level flow of the script is as follows:

  1. Keywords to find pages are defined
  2. A call to the API returns all page IDs matching these keywords
  3. A second call gets the full information for these pages
  4. Another API call gets the latest posts on these pages
  5. Two separate calls are made to get the likes and the comments on each post

At each point, the data returned is tested to see if it already exists in the DB (according to a unique ID) and if not is added, if so is ignored

MongoDB Structure

Requires a MongoDB with four collections

  1. Pages
  2. Posts
  3. Comments
  4. Likes

Dependencies