ad_search_template.py
is a generic file - a customised version of this is the basis for an ad scrape- To scrape ads with certain keywords and/or from certain actors, copy
ad_search_template.py
, fill in these relevant details and run the file. - This lightweight
ad_search.py
file points to a utils filead_api_utils.py
containing all the python functions that actually carry out the scrape. - The ad search file can be in a different directory from this repo, as long as the variable
UTILS_FOLDER_PATH
correctly points to the parent folder of thead_api_utils.py
file (i.e., this repo).
- The full list of search parameters that can be used with the Meta Ads API is found at https://developers.facebook.com/docs/graph-api/reference/ads_archive/
- The only parameter absolutely required by the API is
ad_reached_countries
. In practice, to avoid returning too many ads, one of at leastsearch_terms
,search_page_ids
,ad_delivery_date_min
, andad_delivery_date_max
should be used. These have generic default values as placeholders in thead_search_template.py
file. - Other custom parameters can be specified by adding entries to the
custom_params
dict. These may include regions, languages, platforms, or string search types. - Some parameters have default values included in the
ad_api_utils.py
file itself - these are sensible defaults for most purposes, and it is not recommended that these are changed. However, if changing them is required (e.g. if you are searching for only currently active ads), any values specified in thead_search
file will take precedence (i.e. wil overwrite thead_api_utils
defaults). - By default, all fields are returned in each search; a full list of these is found at https://developers.facebook.com/docs/marketing-api/reference/archived-ad/
- To customise fields, simply specify a list of
custom_fields
in thead_search
file.
- Once the customised
ad_search_template.py
file is run, the ad data (in the form of a pickled pandas dataframe) and a log file (.txt) will be saved toSAVE_FOLDER
you have specified (by default, simply/data/
in whichever directory thead_search
file is in). - This log file contains details of the parameters used for the search, as well as the time the search was run.
- Running the search file again (with the same save name -
SAVE_FNAME
) will not over-write the first set of data; further scrapes will be saved with incrementing numbers appended (e.g. scrape_2.pkl, scrape_3.pkl, etc.) - It is recommended that one folder is used to store just the raw scrapes (and their logs), and any data processing is done in another directory. This avoids issues with the incrementing file naming.
- The "raw" data obtained from the API contains some fields that require processing before being easy to use. For example, spend is returned as a dictionary:
{'lower_bound': X, 'upper_bound': Y}
. - To process the data into a useable form, import the
preproc
function from theprocessing_utils.py
file in the repo and run it on the raw dataframe. - This file:
- converts datetime columns to datetime,
- splits range fields into two columns,
- converts spend columns into USD (historical rates at the time of ad delivery start, falling back on current rates)
- calculates the minimum possible spend based on the ads duration (0.5 or 1 USD per day, depending on currency used)
- calculates a lower bound for the spend. This is the highest of the API-returned spend lower (often 0) and the minimum possible spend based on duration. This is to provide a meaningful lower spend value, avoiding summed 0s (for high-freq, low-spend ad campaigns).
- Other useful processing functions are included:
get_daily_active_matrix
: returns a boolean dataframe with ad_id index and date columns - True if that ad is active on that date.get_daily_spend_matrix
: equivalent to the above but instead lists the ad's amortized (i.e. averaged over its active lifetime) spend for that date.get_regional_impressions
: get a dataframe of ad impressions by region (sub-country regions in most cases)get_country_impressions
: aggregates the above regions into countries. region identification was part-automated (using geolocating libraries) and part manual (based on co-occurring regions with impressions, and lack of certain countries in dataset entirely). NB: these caveats mean this data should be framed as an estimate and quoted to perhaps 2 or 3 s.f.