Note: This is not an officially supported Google product. It is a reference implementation.
Oculi is a Google Cloud-based pipeline for tagging large sets of images or videos with labels based on their content, generating a BigQuery dataset for further analysis. Content tagging is done through Cloud's pre-trained computer vision models (Vision API and Video Intelligence API).
The primary use case is for analyzing creatives (images and videos) in digital advertising. Combined with creative performance data, the output from this pipeline can be used to explore correlations between advertising content and performance (e.g. creatives with a human model tend to perform better).
This pipeline supports three sources of creatives 1:
- A Google Campaign Manager (CM) account. Oculi will attempt to extract all creatives on the account that have an image or video asset in a suitable format 2, then download the asset and save a copy to Cloud Storage. Users of DV360 may be able to use this option (see FAQ).
- A BigQuery table of URLs. URLs must point to images or videos, and be
accessible without login. Oculi will download the asset and save a copy to
Cloud Storage. The required table columns are:
Creative_ID
, an unique integer for each image or videoAdvertiser_ID
, an integer identifying a parent entityCreative_Name
, a text fieldFull_URL
, the URL to the image or video file
- A Google Cloud Storage (GCS) bucket of creative files. Files must be
image or video files (JPG and MP4 preferred) at the top level of the bucket.
- If the filenames follow the convention
{numeric_id}_{other_stuff}.jpg
, then thenumeric_id
will be used as thecreative_id
. - Otherwise, a
creative_id
is generated by calling the Pythonhash()
function on the entire filename.
- If the filenames follow the convention
This pipeline uses:
- Dataflow for orchestration,
- Vision API and Video Intelligence API for content tagging,
- Storage for storing creative assets (images and videos),
- BigQuery for storing content tagging output.
Basic requirements for setup include:
- Python 2.7
- Google Cloud SDK
Full requirements can be found in requirements.txt
and will be installed for
you through setup scripts.
-
(Optional) If you would like a virtual environment set up for you:
source presetup_virtualenv.sh
-
Install dependencies, configure the Google Cloud SDK environment, and obtain credentials:
./setup_environment.sh
This step should generate a
client_secrets.json
file in your current directory. -
Create a jobfile in
jobs/
usingsample.yaml
as a reference. A few tips:- Common acronyms used in this file:
gcp
= Google Cloud Platform,gcs
= Google Cloud Storage,bq
= BigQuery. - All entities referenced in this file must already exist, e.g. the
data_destination.bq_dataset
will not be created for you. - In
creative_source_details
, you can delete the sections for the sources you're not using to avoid confusion. - If using CM, the
cm_profile_id
must refer to a User Profile matching the email inclient_secrets.json
. For more information, see the FAQ below.
- Common acronyms used in this file:
-
Run the job:
python run.py jobs/JOBFILE.yaml [--limit 50] [--local]
- Use the optional
--limit
flag to cap the number of creatives for testing. - Use the optional
--local
flag to run on your local machine for testing (not fully supported).
- Use the optional
How do I pull creatives from DV360 rather than Campaign Manager?
Your DV360 creatives may be accessible from Campaign Manager; work with your Google account manager or gTech contact.
When using Campaign Manager as a source, how do I fill out cm_profile_id
?
Any request to Campaign Manager must be done through a User Profile (see this Help Center article for more information). Oculi requires its own User Profile, tied to the service account (robot email account) which will be used to run this pipeline. If you don't have access to Campaign Manager, you'll need a trafficker with access to create this for you.
To confirm which email should be used, check inside client_secrets.json
after
it has been generated by setup_environment.sh
. The desired field is
client_email
.
How do I pull creatives from Google Ads, YouTube, or some other source?
There are two ways to extend this pipeline for sources other than CM. The simplest method is to extract creatives from another source separately, generating either a list of URLs to use as a BigQuery source, or a collection of files to use as a GCS source.
If you want to fully integrate your creative extraction into the pipeline,
another option is to write a Python function to generate an in-memory list of
URLs. This function can then be used to start the pipeline; see the current CM
extraction method in main.py
and cm_helper.py
for an example. Since this
method requires an in-memory list, we don't recommend it for massive volume.
The pipeline finished, but I see fewer creatives than expected.
There are many potential causes; the easiest way to investigate is by clicking on specific nodes in the Dataflow job graph and looking at the Input and Output collections. If you see a dropoff...
- ... in the Output of Pull creatives from CM: creatives were skipped
because they were outside the date range, or because they were unsuitable
formats (e.g. tracking pixels). Note that the
--limit
option is applied before these filters, so you'll get fewer than your limit. - ... between Input and Output of Copy assets to GCS: creatives failed to
download. Oculi attempts to find an asset URL for every creative (see
cm_helper.py
), but this method is imperfect and fails for older creatives in particular. - ... between Input and Output of an Extract step: creatives were filtered
out because they weren't relevant for this table, e.g. only creatives with
faces detected will end up in the
face_annotations
table.
setup_environment.sh
is failing, complaining about version mismatches
and/or EnvironmentError
s.
Try using a virtual environment (step 1) to avoid any mismatches between the
versions of libraries you already have installed on your system vs. the
versions specified in requirements.txt
.
- Oly Bhaumik (sisrikshab@google.com)
- Christopher Bian (cbian@google.com)
- Aritra Biswas (aritrab@google.com)
- David Letts (dgletts@google.com)
- Eddie Ye (edwardye@google.com)
Footnotes
-
The term "creative" is used to indicate just the image or video content in an ad, excluding other components like targeting preferences. ↩
-
This excludes dynamic creatives. The goal of this pipeline is to break images and videos down into their content components; dynamic creatives are already broken into content components. Rather, analysis of the data from this pipeline can be used to inform a dynamic creative strategy. ↩