Scraper for Facebook's Archive of Ads with Political Content ... until Facebook provides an API.
fb-ad-archive-scraper will produce:
- CSV containing the text and metadata of the ads.
- Screenshots of each ad.
- A README file.
Like any scraper, fb-ad-archive-scraper is fragile. It will break if Facebook changes the structure / code of the Archive. If fb-ad-archive-scraper breaks, let me know.
Tickets / PRs are welcome.
-
Clone the repo:
git clone https://github.com/justinlittman/fb-ad-archive-scraper.git
-
Change to the directory:
cd fb-ad-archive-scraper
-
Optionally, create a virtual environment:
virtualenv -p python3 ENV source ENV/bin/activate
-
Install requirements:
pip install -r requirements.txt
-
Install Chromedriver. On a Mac, this is:
brew cask install chromedriver
If already installed, upgrade Chromedriver with:
brew cask upgrade chromedriver
usage: scraper.py [-h] [--limit LIMIT] [--headed] [--wait WAIT]
[--country {ALL,US,BR}]
[--type {news_ads,political_and_issue_ads}]
[--status {all,active,inactive}]
email password query [query ...]
Scrape Facebook's Archive of Ads with Political Content
positional arguments:
email Email address for FB account
password Password for FB account
query Query
optional arguments:
-h, --help show this help message and exit
--limit LIMIT Limit on number of ads to scrape
--headed Use a headed chrome browser
--wait WAIT Seconds to sleep between requests
--country {ALL,US,BR}
Limit ads by country. Choices: ALL, US, BR. Default is
ALL.
--type {news_ads,political_and_issue_ads}
Limit ads by type. Choices: news_ads,
political_and_issue_ads. Default is
political_and_issue_ads.
--status {all,active,inactive}
Limit ads by status. Choices: all, active, inactive.
Default is all.
For example:
python scraper.py fbuser@gmail.com password pelosi
Notes:
- fb-ad-archive-scraper uses a headless Chrome browser. This means that you will not see the browser at work.
- The output of each run will be placed in a separate directory and include a README, CSV file summarizing the ads, PNG images, and JSON retrieved from Facebook.
The approach of extracting data from XHRs came from Ranjit Hatnagar.