Note: This is not an officially supported Google product. It is a reference implementation.

Oculi

Oculi is a Google Cloud-based pipeline for tagging large sets of images or videos with labels based on their content, generating a BigQuery dataset for further analysis. Content tagging is done through Cloud's pre-trained computer vision models (Vision API and Video Intelligence API).

The primary use case is for analyzing creatives (images and videos) in digital advertising. Combined with creative performance data, the output from this pipeline can be used to explore correlations between advertising content and performance (e.g. creatives with a human model tend to perform better).

Creative Sources

This pipeline supports three sources of creatives ¹:

A Google Campaign Manager (CM) account. Oculi will attempt to extract all creatives on the account that have an image or video asset in a suitable format ², then download the asset and save a copy to Cloud Storage. Users of DV360 may be able to use this option (see FAQ).
A BigQuery table of URLs. URLs must point to images or videos, and be accessible without login. Oculi will download the asset and save a copy to Cloud Storage. The required table columns are:
- Creative_ID, an unique integer for each image or video
- Advertiser_ID, an integer identifying a parent entity
- Creative_Name, a text field
- Full_URL, the URL to the image or video file
A Google Cloud Storage (GCS) bucket of creative files. Files must be image or video files (JPG and MP4 preferred) at the top level of the bucket.
- If the filenames follow the convention {numeric_id}_{other_stuff}.jpg, then the numeric_id will be used as the creative_id.
- Otherwise, a creative_id is generated by calling the Python hash() function on the entire filename.

Google Cloud components

This pipeline uses:

Dataflow for orchestration,
Vision API and Video Intelligence API for content tagging,
Storage for storing creative assets (images and videos),
BigQuery for storing content tagging output.

Prerequisites

Basic requirements for setup include:

Python 2.7
Google Cloud SDK

Full requirements can be found in requirements.txt and will be installed for you through setup scripts.

Running the pipeline

(Optional) If you would like a virtual environment set up for you:
```
source presetup_virtualenv.sh
```
Install dependencies, configure the Google Cloud SDK environment, and obtain credentials:
```
./setup_environment.sh
```
This step should generate a client_secrets.json file in your current directory.
Create a jobfile in jobs/ using sample.yaml as a reference. A few tips:
- Common acronyms used in this file: gcp = Google Cloud Platform, gcs = Google Cloud Storage, bq = BigQuery.
- All entities referenced in this file must already exist, e.g. the data_destination.bq_dataset will not be created for you.
- In creative_source_details, you can delete the sections for the sources you're not using to avoid confusion.
- If using CM, the cm_profile_id must refer to a User Profile matching the email in client_secrets.json. For more information, see the FAQ below.
Run the job:
```
python run.py jobs/JOBFILE.yaml [--limit 50] [--local]
```
- Use the optional --limit flag to cap the number of creatives for testing.
- Use the optional --local flag to run on your local machine for testing (not fully supported).

Frequently Asked Questions

How do I pull creatives from DV360 rather than Campaign Manager?

Your DV360 creatives may be accessible from Campaign Manager; work with your Google account manager or gTech contact.

When using Campaign Manager as a source, how do I fill out cm_profile_id?

Any request to Campaign Manager must be done through a User Profile (see this Help Center article for more information). Oculi requires its own User Profile, tied to the service account (robot email account) which will be used to run this pipeline. If you don't have access to Campaign Manager, you'll need a trafficker with access to create this for you.

To confirm which email should be used, check inside client_secrets.json after it has been generated by setup_environment.sh. The desired field is client_email.

How do I pull creatives from Google Ads, YouTube, or some other source?

There are two ways to extend this pipeline for sources other than CM. The simplest method is to extract creatives from another source separately, generating either a list of URLs to use as a BigQuery source, or a collection of files to use as a GCS source.

If you want to fully integrate your creative extraction into the pipeline, another option is to write a Python function to generate an in-memory list of URLs. This function can then be used to start the pipeline; see the current CM extraction method in main.py and cm_helper.py for an example. Since this method requires an in-memory list, we don't recommend it for massive volume.

The pipeline finished, but I see fewer creatives than expected.

There are many potential causes; the easiest way to investigate is by clicking on specific nodes in the Dataflow job graph and looking at the Input and Output collections. If you see a dropoff...

... in the Output of Pull creatives from CM: creatives were skipped because they were outside the date range, or because they were unsuitable formats (e.g. tracking pixels). Note that the --limit option is applied before these filters, so you'll get fewer than your limit.
... between Input and Output of Copy assets to GCS: creatives failed to download. Oculi attempts to find an asset URL for every creative (see cm_helper.py), but this method is imperfect and fails for older creatives in particular.
... between Input and Output of an Extract step: creatives were filtered out because they weren't relevant for this table, e.g. only creatives with faces detected will end up in the face_annotations table.

setup_environment.sh is failing, complaining about version mismatches and/or EnvironmentErrors.

Try using a virtual environment (step 1) to avoid any mismatches between the versions of libraries you already have installed on your system vs. the versions specified in requirements.txt.

Authors

Oly Bhaumik (sisrikshab@google.com)
Christopher Bian (cbian@google.com)
Aritra Biswas (aritrab@google.com)
David Letts (dgletts@google.com)
Eddie Ye (edwardye@google.com)

The term "creative" is used to indicate just the image or video content in an ad, excluding other components like targeting preferences. ↩
This excludes dynamic creatives. The goal of this pipeline is to break images and videos down into their content components; dynamic creatives are already broken into content components. Rather, analysis of the data from this pipeline can be used to inform a dynamic creative strategy. ↩

iampatgrady/oculi