TV News Ingest Pipeline

The TV News Ingest Pipeline is series of scripts designed to extract data and metadata from videos (specifically broadcast news). Though the pipeline is intended for use with the TV News Viewer, where this module extracts (and formats) the data for the viewer to load, it can be used alone and produces human-readable information from the videos it processes. If you intend to use the pipeline outputs for TV News Viewer be sure to read the TVNEWS_VIEWER_README.md in the docs directory.

The usage of the pipeline will be explained further in the coming sections, but the idea is to take in a video file (or batch of video files) and seamlessly run a series of operations on the videos like detecting faces, identifying celebrities, classifying genders, and more. At the end of the pipeline, there will be outputs for each video in the specified output directory in an easy to use structure and format.

Getting Started
- Setting Up
- Quickstart with Examples
Overview
Usage
Configuration
Components

Getting Started

Note: the TV News pipeline is only "officially" supported on Linux, however advanced users can attempt to run it on MacOS.

Setting Up

Install Python3 (requires Python 3.6 or up)
Install Rust (specifically with rustup https://rustup.rs)
Clone this repository. The following instructions all take place within this repo.
Run ./install_deps.sh. This will install submodule dependencies and install Python depedencies with pip3. Installing Gentle (one of the dependencies) is a lengthy process and could take upwards of 40 minutes. It in turn will install many other dependencies, so if you wish to do a more manual installation, you can attempt to follow the chain of installation scripts and execute the commands yourself.
If you want to enable the celebrity face identification using AWS, you will need to setup an account with AWS, and add your credentials to a config.yml file (see Configuration). Learn more at https://docs.aws.amazon.com/rekognition/latest/dg/setting-up.html.
If you plan to use the TV News Viewer, please follow the instructions listed in the TVNEWS_VIEWER_README.md.

Quickstart with Examples

I highly encourage you to read the rest of this document before getting started but this section will detail a simple use case on the sample videos we've provided. Those videos and their captions are located within the examples directory, and are licensed by MSNBC under the Creative Commons Attribution license (reuse allowed).

To begin, make sure you are in the root of the tv-news-ingest-pipeline repo.

We will first create a configuration file to make our lives a bit easier while we run these commands (you can read more in the Configuration section).

Copy over the config.yml file that we've provided in examples by running

cp examples/config.yml ./

If you haven't set up Amazon's Rekognition service, you are going to want to disable face identification. Open up config.yml in your favorite editor, and uncomment the lines as specified below:

disable:
#   - face_component
#   - black_frames
  - identities
  - identity_propogation
#   - genders
#   - captions_copy
#   - caption_alignment
#   - commercials

This will tell the pipeline to disable those two stages, without having to pass in extra command line arguments specifying what to disable.

Now, we will create a batch file containing all of the video file paths for the videos we want to process. We will do the same for the captions as well. Run

ls examples/sample_videos/* > sample_batch.txt
ls examples/sample_captions/* > sample_batch_captions.txt

Feel free to inspect either of these files to better understand what goes into a batch file.

Now we will run the pipeline on the sample videos, and have it output into the directory sample_output. Run

python3 pipeline.py sample_batch.txt --captions=sample_batch_captions.txt sample_output

And that's it! If everything was installed correctly, this should finish with all of the outputs in sample_output for you to inspect.

You can follow the instructions in TVNEWS_VIEWER_README.md for how to set up a local web app from these outputs.

Overview

As mentioned in the intro, the TV News Ingest Pipeline is a script that systematically takes a collection of videos (and video captions) through various analyses that take place in separate components. The output of the pipeline are the outputs of each component (images, JSON files, srt files) in an organized and easy to use format. These files can be used however you'd like, but scripts are included in this repo to turn these outputs into the data necessary to power the TV News Viewer (again, read TVNEWS_VIEWER_README.md for more information).

In order to easily take the videos through each stage of the pipeline, the output directory is used to communicate the videos that are being processed and which stages have been completed. For this reason, it is very important that you don't place any other files or folders in the output directory that are not created by the pipeline, as this may affect results. Further, it is recommended that a new output directory be specified for each batch of videos (by "new", we mean empty).

For a more visual overview of what the pipeline does check out our flowchart.

Usage

Videos can be located anywhere on the filesystem
Make sure you have the necessary capabilities to run each component that is active (read the Components section for more information)

Run on a Single Video

To run for just a single video with path my_video.mp4 for example:

Run python3 pipeline.py my_video.mp4 output_dir.

This will produce the following output:

output_dir/
└── my_video.mp4
    ├── alignment_stats.json        # info about the caption alignment
    ├── bboxes.json                 # bounding boxes per face
    ├── black_frames.json           # black frame locations
    ├── captions.srt                # time-aligned captions
    ├── captions_orig.srt           # original captions (copied over)
    ├── commercials.json            # detected commercial intervals
    ├── crops                       # cropped images of each face
    │   ├── 0.png
    │   ├── 1.png
    │   └── ...
    ├── embeddings.json             # FaceNet embeddings for each face
    ├── genders.json                # male/female gender per face
    ├── identities.json             # celebrity identities per identified face
    ├── identities_propogated.json  # identities propogated to unlabeled faces
    └── metadata.json               # number of frames, fps, name, width, height

Run on a Batch of Videos

The more common usage of the pipeline, however, is to process videos in larger batches. In order to do so, the scripts take as input a text file of video file paths. For example:

batch_videos.txt:

path/to/my_video1.mp4
different/path/to/my_video2.mp4

Then run python3 pipeline.py batch_videos.txt output_dir.

This will produce the same output as with a single video, but with one subdirectory per video in the batch:

output_dir/
├── my_video1
│   └── ...  # same as in the example in the previous section
└── my_video2
    └── ...

Run an Individual Script

If you want to run any of the pipeline components as individual scripts, use the -s, --script option. For instance, if you want to run just the face component,

run python3 pipeline.py batch.txt output_dir --script=face_component,

or if you want to run just the gender classification,

run python3 pipeline.py batch.txt output_dir --script=genders.

Note, however, that you are responsible for making sure the requisite inputs exist for the component you would like to run (embeddings.json for the gender classifier, for example).

Using Captions

Captions can be specified with the --captions option either as a single path to the .srt file or as a batch text file just as with videos. Make sure that the filename format of the captions are <video_name>.srt in order to match the caption with the video. Captions are not required in order to perform the rest of the video processing.

Run python3 pipeline.py my_video.mp4 --captions=my_video.srt output_dir

python3 pipeline.py batch_videos.txt --captions=batch_captions.txt output_dir.

Disabling Features

You might not want to run all pipeline components on all videos or all the time. In this case, components can be disabled with the -d or --disable options.

Currently, the components that can be disabled are:

face_component (skips the entire face detection, face embeddings, and face crops component)
black_frames (skips black frame detection)
identities (skips face identification with AWS)
identity_propogation (skips propogating identities to similar unlabeled faces)
genders (skips gender classification)
captions_copy (skips copying over captions to output)
caption_alignment (skips time-aligning the video captions)
commercials (skips detecting commercials)

If you wanted to skip detecting black frames and detecting commercials, for instance, you would run

python3 pipeline.py batch_videos.txt outputdir --disable black_frames commercials.

If you run the pipeline once with certain features disabled, and then want to add them back later, you can rerun the pipeline with the same output directory and it will attempt to redo only what is necessary for the missing outputs.

Useful Options

In addition to the options mentioned previously, there are several other useful options or flags, all of which can be listed with python3 pipeline.py --help.

-i, --init-run: Skips checks for existing or cached outputs. Should specify if this is the first time you are running the script on the set of videos.
-f, --force: Forces recomputation of all outputs, and overwrites existing ones.
-p, --parallel: Runs the path dependent on the outputs of face_component separate from the path of tasks independent of the face_component outputs. (Note: the output of these paths will not be synchronized, and so will overlap.)

Configuration

Further options can be configured in the config file, including things like credentials for the AWS identification service. A sample is provided in this repo here. The current configuration options are:

stride: the interval in seconds between frames in which you look for faces.
montage_width: the number of columns in the face image montage to send to AWS.
montage_height: the number of rows in the face image montage to send to AWS.
aws_access_key_id: your Amazon Rekognition access key ID.
aws_secret_access_key: your Amazon Rekognition secret access key.
aws_region: the region name that you'd like to make your query to.
disable: a list of the named components to disable.

Components

Here we describe the components in some detail. Not all components will be relevant for your use case, so be sure to disable what you don't need. Additionally users are encouraged to write their own components that provide the same or similar interface to the existing components (it is not difficult to figure out how to add your own if you look in pipeline.py).

Face Component

The face component consists of face detection, computing FaceNet embeddings, and extracting face image crops from frames. These are grouped together to reduce the overhead of decoding video frames.

The face detection component outputs a list of boundings boxes, in bboxes.json, one for each face detected. It is simply a JSON list of face IDs paired with the frame number they were located in and the bounding box information. Here is an example of one element in the list (where 27 is the ID of the face; face IDs count up from 0 per video):

[27, {"frame_num": 5580, "bbox": {"y2": 0.6054157018661499, "x2": 0.28130635619163513, "y1": 0.3943322002887726, "score": 0.9999775886535645, "x1": 0.1300494372844696}}]

The face embeddings component outputs a list of FaceNet embeddings, in embeddings.json, one for each face detected. It too is a JSON list of face IDs paired with the embedding vector of that face. Here is an example of one element in the list (where 13 is the ID of the face. The embeddings are quite long, so it's been truncated here):

[13, [-0.061142485588788986, ... , -0.007883959449827671]]

The face crops component outputs one image file per face detected, these reside in the crops directory and are named <face_id>.png. This image is the crop defined by the bounding box of the face (dilated a bit to give more room space along the edges). This output can be quite large, so be sure to delete the crops folder afterward if you aren't planning on using the face crops. (They are required for face identification, but if you don't want to do that component either you should disable face_crops.

Black Frame Detection

This component detects black frames in the video (currently just for use in commercial detection). It outputs a list of the frame numbers of detected black frames, in black_frames.json. For example:

[22205, 22206, 22207, 23105, ..., 101293, 101294, 101295]

Face Identification

This component attempts to recognize known celebrity identities from the face crop images. We currently use Amazon's Rekognition service to do so. In order to set up this component you need to make an AWS account and set your credentials in a config.yml file. You can view an example configuration file here, and you can learn more about Amazon Rekognition here.

It outputs a list of celebrity identities, in identities.json, for the face images it was able to identify. It is a JSON list of face IDs paired with their guessed identity and a confidence score of the guess between 0 and 100. Here is an example of one element in that list:

[213, "George W. Bush", 100.0]

Face Identity Propogation

This component attempts to further identify faces that were failed to be labeled by the original face identification by comparing how similar unlabeled faces are to those that were labeled and propogating those identities to the unidentified faces if some thresholds are satisfied. The output is the same format as for Face Identification, with these new identities appended to the end, in identities_propogated.json.

Gender Classification

This component attempts to classify binary gender from the computed FaceNet embeddings. It outputs a list of guessed genders, in genders.json, one for each face detected. It is a JSON list of face IDs paired with a guess of male or female and a confidence score between 0 and 1. Here is an example of one item from that list:

[357, "F", 1.0]

Copy Captions

This component simply copies over the specified captions file into the output directory as captions_orig.srt. Mainly useful for debugging purposes.

Time-Align Captions

This component uses the Gentle forced aligner to time-align the captions and the video's audio to give more accurate infomation about when each word occurs in the video. It outputs the aligned captions file as captions.srt, which is just like any other srt file, but with one word per line. Be aware that this component can take a long time to run.

Commercial Detection

This component detects commercials through a combination of black frames locations and caption details. This will not work for all videos, but for our use case we have found that black frames occur before commercials, and the captions tend to be non-existent during the commercials. It outputs a list of intervals during which there are expected to be commercials, as commercials.json. Each interval in the list is simply a tuple of the start and end frame of the commercial.

scanner-research/tv-news-ingest-pipeline