Useful approaches, practices and tools used in solving greenfield computer vision problems

The Challenge

In the xView2 challenge, competitors are asked to identify damaged buildings in satellite imagery following a natural disaster.
Two images: before and after.
Find building footprints using the 'before' image.
Classify each of those footprints into one of four damage categories based on the 'after' image.

Before	After

This problem is an analytical bottleneck in disaster response, and solving it has imminent real world benefits.
It's feasible: there is sufficient high quality annotated data.
It's really interesting from a technical perspective: at first glance it seems simple, but there are a lot of potentially valid approaches.
It's very well supported and positioned to be implemented after the competition.

Cf.: Understanding Clouds from Satellite Images.
- Poor annotator agreement.
- Indirectly applicable.
- Annotation requires lots of finicky manual processing.

Semantic	Instance

Semantic segmentation: some tricks with upscaling and pooling across layers, but mostly pretty simple – encoder + decoder setup.
Instance segmentation: this came out of object detection (bounding boxes), and is a little crazy. 1000s of overlapping potential bounding boxes are proposed by one RoI (region-of-interest) network, before each one is pooled into a fixed size representation and classified/adjusted. Lots of post-processing required to handle the overlapping boxes, and an absurd number of hyperparams. There are much nicer solutions (e.g. [Objects as points].)

Semantic segmentation on 'before' image to define regions of interest.
RoI-pooling + classification on 'after' images.
Why semantic segmentation on 'before' image? Much better recall: instance segmentation comes from the world of e.g. CoCo – where evaluation is measured by the number of 'good enough' masks. This task is evaluated on the pixel level.
- In addition, we can – one of the issues with instance segmentation is going from mask -> individual instances. We're evaluated at pixel level so this isn't essential.
Why RoI pooling/object detection approach on the second image? This corresponds with the way that the images are labeled e.g. each polygon is classified into one of several categories.
- It also addresses the labeling drift – bounding boxes are much closer than the masks.
- Alternative: segmentation. A little more finicky to set-up.
Heavy data augmentation, half-precision training, Linknet segmentation with efficientnet backbone.

All experiments managed using Wandb.
- Config management.
- Models saved to cloud.
- Experimental management.
- Recreating runs.
- Easy to setup.
Try and move from big to little picture. E.g. architecture -> loss -> model, etc.
Be aware of metric – f-score is sensitive to threshold and noise – loss is more stable, but less interpretable.
Find minimum experimental setup: e.g. performance @ 512x512 resolution generalizes to 1024x1024 (and trains 4x faster).
Notebook + lean codebase (frequently re-iterate).

Wandb.
Apex – half-precision/mixed-precision training from NVIDIA.
Segmentation Models Pytorch: lightweight, nice API for segmentation.
Detectron2: heavy-weight, more deeply integrated, harder to use – far more features and state of the art models. Complexity partly required by much more complicated approach.

Prioritising single metric over maintainability, readability, performance, generalisability etc.
Great for learning new skills.

Solving the right problem.
In the right way.
Lean experimental setup.
Lightweight codebase – the right balance between saving time now or saving time later.