This repository contains the official inference and training implementation for the paper:
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Ali Athar*, Sabarinath Mahadevan*, Aljoša Ošep, Laura Leal-Taixé, Bastian Leibe
ECCV 2020 | Paper | Video | Project Page
- Python 3.7
- PyTorch 1.4, 1.5 or 1.6
- OpenCV, numpy, imgaug, pillow, tqdm, pyyaml, tensorboardX, scipy, pycocotools (see
requirements.txtfor exact versions in case you encounter issues)
-
Clone the repository and append it to the
PYTHONPATHvariable:git clone https://github.com/sabarim/STEm-Seg.git cd STEm-Seg export PYTHONPATH=$(pwd):$PYTHONPATH
-
Download the required datasets from their respective websites and the trained model checkpoints from the given links. For inference, you only need the validation sets of the target dataset. For training, the table below shows which dataset(s) you will need:
Target Dataset Datasets Required for Training Model Checkpoint DAVIS DAVIS'17, YouTubeVIS, COCO Instance Segmentation, PascalVOC link YouTube-VIS YouTube-VIS, COCO Instance Segmentation, PascalVOC link KITTI-MOTS Mapillary images, KITTI-MOTS, sequence 0002from MOTSChallengelink
File paths to datasets and model checkpoints are configured using environment variables.
-
STEMSEG_JSON_ANNOTATIONS_DIR: To streamline the code, we reorganized the annotations and file paths for every dataset into a standard JSON format. These JSON files can be downloaded from here. Set this variable to the directory holding these JSON files. -
STEMSEG_MODELS_DIR: Base directory where models are saved to by default. Only required for training. You can initially point this to any empty directory.
For inference, you only need to set the relevant variable for the target dataset. For training, since multiple datasets are used, multiple variables will be required (as mentioned below).
-
DAVIS_BASE_DIR: Set this to the full path of theJPEGImages/480pdirectory for the DAVIS dataset. The image frames for all 60 training and 30 validation videos should be present in the directory. This variable is required for training/inference on DAVIS'19 Unsupervised. -
YOUTUBE_VIS_BASE_DIR: Set this to the parent directory of thetrainandvaldirectories for the YouTube-VIS dataset. This variable is required for training/inference on YouTube-VIS and also for training for DAVIS. -
KITTIMOTS_BASE_DIR: Set this to theimagesdirectory which contains the directories holding images for each video sequence.
-
COCO_TRAIN_IMAGES_DIR: Set this to thetrain2017directory of the COCO instance segmentation dataset. Remember to use the 2017 train/val split and not the 2014 one. This variable is required for training for DAVIS and YouTube-VIS. -
PASCAL_VOC_IMAGES_DIR: Set this to theJPEGImagesdirectory of the PascalVOC dataset.This variable is required for training for DAVIS and YouTube-VIS. -
MAPILLARY_IMAGES_DIR: You will need to do two extra things here: (1) Put all the training and validation images into a single directory (18k + 2k = 20k images in total). (ii) Since Mapillary images are very large, we first down-sampled them. The expected size for each image is given instemseg/data/metainfo/mapillary_image_dims.jsonas a dictionary from the image file name to a (width, height) tuple. Please use OpenCV'scv2.resizemethod withinterpolation=cv2.INTER_LINEARto ensure the best consistency between your down-sampled images and the annotations we provide in our JSON file. This variable is required for training for KITTI-MOTS.
Assuming the relevant dataset environment variables are correctly set, just run the following commands:
-
DAVIS:
python stemseg/inference/main.py /path/to/downloaded/checkpoints/davis.pth -o /path/to/output_dir --dataset davis
-
YouTube-VIS:
python stemseg/inference/main.py /path/to/downloaded/checkpoints/youtube_vis.pth -o /path/to/output_dir --dataset ytvis --resize_embeddings
-
KITTI-MOTS:
python stemseg/inference/main.py /path/to/downloaded/checkpoints/kitti_mots.pth -o /path/to/output_dir --dataset kittimots --max_dim 1948
For each dataset, the output written to /path/to/output_dir will be in the same format as that required for the official evaluation tool for each dataset. To obtain visualizations of the generated segmentation masks, you can add a --save_vis flag to the above commands.
-
Make sure the required environment variables are set as mentioned in the above sections.
-
Run
mkdir $STEMSEG_MODELS_DIR/pretrainedand place the pre-trained backbone file in this directory. -
Optional: To verify if the data loading pipeline is correctly configured, you can separately visualize the training clips by running
python stemseg/data/visualize_data_loading.py(see--helpfor list of options).
The final inference reported in the paper is done using clips of length 16 frames. Training end-to-end with such lengthy clips requires too much GPU VRAM though, so we train in two steps:
-
First we train end-to-end with 8 frame long clips:
python stemseg/training/main.py --model_dir some_dir_name --cfg davis_1.yaml
-
Then we freeze the encoder network (backbone and FPN) and train only the decoders with 16 frame long clips:
python stemseg/training/main.py --model_dir another_dir_name --cfg davis_2.yaml --initial_ckpt /path/to/last/ckpt/from/previous/step.pth
The training code creates a directory at $STEMSEG_MODELS_DIR/checkpoints/DAVIS/some_dir_name and places all checkpoints and logs for that training session inside it. For the second step we want to restore the final weights from the first step, hence the additional --initial_ckpt argument.
Here, the final inference was done on 8 frame clips, so the model can be trained in a single step:
python stemseg/training/main.py --model_dir some_dir_name --cfg youtube_vis.yamlThe training output for this will be placed in $STEMSEG_MODELS_DIR/checkpoints/youtube_vis/some_dir_name.
Here as well, the final inference was done on 8 frame clips, but we trained in two steps.
-
First on augmented images from the Mapillary dataset:
python stemseg/training/main.py --model_dir some_dir_name --cfg kitti_mots_1.yaml
-
Then on the KITTI-MOTS dataset itself:
python stemseg/training/main.py --model_dir another_dir_name --cfg kitti_mots_2.yaml --initial_ckpt /path/to/last/ckpt/from/previous/step.pth
For this step, we included video sequence
0002from the MOTSChallenge training set into our training set. Simply copy the images directory for this video to$KITTIMOTS_BASE_DIRand rename the directory to0050(this is done because a video named0002already exists in KITTI-MOTS).
-
In general, you will need at least 16GB VRAM for training any of the models. The VRAM requirement can be lowered by reducing the image dimensions in the config YAML file (
INPUT.MIN_DIMandINPUT.MAX_DIM). Alternatively, you can also use mixed precision training by installing Nvidia apex and setting theTRAINING.MIXED_PRECISIONoption in the config YAML to true. In general, both these techniques will reduce performance. -
Multi-GPU training is possible and has been implemented using
torch.nn.parallel.DistributedDataParallelwith one GPU per process. To utilize multiple GPUs, the above commands have to be modified as follows:python -m torch.distributed.launch --nproc_per_node=<num_gpus> stemseg/training/main.py --model_dir some_dir_name --cfg <dataset_config.yaml> --allow_multigpu
-
You can visualize the training progress using tensorboard by pointing it to the
logssub-directory in the training directory. -
By default, checkpoints are saved every 10k iterations, but this frequency can be modified using the
--save_intervalargument. -
It is possible to terminate training and resume from a saved checkpoint by using the
--restore_sessionargument and pointing it to the full path of the checkpoint. -
We fix all random seeds prior to training, but the results reported in the paper may not be exactly reproducible when you train the model on your own.
-
Run
python stemseg/training/main.py --helpfor the full list of options.
Extending the training/inference to other datasets should be easy since most of the code is dataset agnostic.
See the if/else block in the main method in inference/main.py. You will just have to implement a class that converts the segmentation masks produced by the framework to whatever format you want (see any of the scripts in stemseg/inference/output_utils for examples).
You will first have to convert the annotations for your dataset to the standard JSON format used by this code. Inspect any of the given JSON files to see what the format should be like. The segmentation masks are encoded in RLE format using pycocotools. To better understand the file format, you can also see stemseg/data/generic_video_dataset_parser.py and stemseg/data/generic_image_dataset_parser.py where these files are read and parsed.
Once this is done, you can utilize the VideoDataset API in stemseg/data/video_dataset.py to do most of the pre-processing and augmentations. You just have to inherit this class and implement the parse_sample_at method (see stemseg/data/davis_data_loader.py for an example of how to do this).
Use the following BibTeX to cite our work:
@inproceedings{Athar_Mahadevan20ECCV,
title={STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos},
author={Athar, Ali and Mahadevan, Sabarinath and O{\v{s}}ep, Aljo{\v{s}}a and Leal-Taix{\'e}, Laura and Leibe, Bastian},
booktitle={ECCV},
year={2020}
}
