fb-ad-archive-scraper

Scraper for Facebook's Archive of Ads with Political Content ... until Facebook provides an API.

fb-ad-archive-scraper will produce:

CSV containing the text and metadata of the ads.
Screenshots of each ad.
A README file.

Like any scraper, fb-ad-archive-scraper is fragile. It will break if Facebook changes the structure / code of the Archive. If fb-ad-archive-scraper breaks, let me know.

Tickets / PRs are welcome.

Install

Clone the repo:

 git clone https://github.com/justinlittman/fb-ad-archive-scraper.git

Change to the directory:
```
 cd fb-ad-archive-scraper
```

Optionally, create a virtual environment:

 virtualenv -p python3 ENV
 source ENV/bin/activate

Install requirements:
```
 pip install -r requirements.txt
```
Install Chromedriver. On a Mac, this is:
```
 brew cask install chromedriver
```
If already installed, upgrade Chromedriver with:
```
 brew cask upgrade chromedriver
```

Usage

    usage: scraper.py [-h] [--limit LIMIT] [--headed] [--wait WAIT]
                      [--country {ALL,US,BR}]
                      [--type {news_ads,political_and_issue_ads}]
                      [--status {all,active,inactive}]
                      email password query [query ...]
    
    Scrape Facebook's Archive of Ads with Political Content
    
    positional arguments:
      email                 Email address for FB account
      password              Password for FB account
      query                 Query
    
    optional arguments:
      -h, --help            show this help message and exit
      --limit LIMIT         Limit on number of ads to scrape
      --headed              Use a headed chrome browser
      --wait WAIT           Seconds to sleep between requests
      --country {ALL,US,BR}
                            Limit ads by country. Choices: ALL, US, BR. Default is
                            ALL.
      --type {news_ads,political_and_issue_ads}
                            Limit ads by type. Choices: news_ads,
                            political_and_issue_ads. Default is
                            political_and_issue_ads.
      --status {all,active,inactive}
                            Limit ads by status. Choices: all, active, inactive.
                            Default is all.

For example:

    python scraper.py fbuser@gmail.com password pelosi

Notes:

fb-ad-archive-scraper uses a headless Chrome browser. This means that you will not see the browser at work.
The output of each run will be placed in a separate directory and include a README, CSV file summarizing the ads, PNG images, and JSON retrieved from Facebook.

The approach of extracting data from XHRs came from Ranjit Hatnagar.