/vision

๐Ÿ‘€ Deep Learning for image and video

Primary LanguageJupyter Notebook

๐Ÿ‘€ Vision

https://kornia.github.io/

https://blog.miguelgrinberg.com/post/video-streaming-with-flask

Index

Part 1: Traditional CV

  • Finding Descriptors (SIFT, SURF, FAST, BRIEF, ORB,BRISK)
  • Image Stitching (Brute-Force, FLANN, RANSAC)

Resources


Image theory

Part 1: Traditional vision

Perspective transform

paper = cv2.imread('./Photos/book.jpg')

pts1 = np.float32([ [219,209], [612,8], [380,493], [785,271] ]) # Coordinates that you want to Perspective Transform
pts2 = np.float32([     [0,0], [500,0],   [0,400], [500,400] ])  # Size of the Transformed Image

for val in pt1: cv2.circle(paper,(val[0],val[1]),5,(0,255,0),-1)
    
# Get transformation matrix M
M = cv2.getPerspectiveTransform(pts1,pts2)          # When manually few (at least 4) points are detceted
#M = cv2.findHomography(pts1,pts2,cv.RANSAC,5.0))   # When lots of matching points, and some of them are errors

dst = cv2.warpPerspective(paper,M,(500,400))
plt.imshow(dst)

Feature detection and Description

SIFT SURF FAST BRIEF ORB BRISK
Year 1999 2006 2006 2010 2011 2011
Feature detector Difference of Gaussian Fast Hessian Binary comparison - FAST FAST or AGAST
Spectra Local gradient magnitude Integral box filter - Local binary Local binary Local binary
Orientation Yes Yes - No Yes Yes
Feature shape Square HAAR rectangles - Square Square Square
Feature pattern Square Dense - Random point-par pixel compares Trained point-par pixel compares Trained point-par pixel compares
Distance func. Euclidean Euclidean - Hamming Hamming Hamming
Pros Accurate Accurate FAST (real time) FAST (real time) FAST (real time) FAST (real time)
Cons Slow, patented Slow, patented Large number of points Scale and roation invariant Less scale invariant Less scale invariant

References

Image Stitching

Steps:

  1. Detecting keypoints (DoG, Harris, etc.) and extracting local invariant descriptors (SIFT, SURF, etc.) from two input images
  2. Matching the descriptors between the images (overlapping area)
  3. Using the RANSAC algorithm to estimate a homography matrix using our matched feature vectors
  4. Applying a warping transformation using the homography matrix obtained from Step #3
    • Apply perspective transformation on one image using the other image as a reference frame

References

Motion and optical Flow

http://datahacker.rs/013-optical-flow-using-horn-and-schunck-method/


Part 2: Deep Learning

https://arthurdouillard.com/deepcourse/

  • Convolutional Neural Network (CNN) For fixed size oredered data, like images
    • Variable input size: use adaptative pooling, final layers then:
      • Option 1: AdaptiveAvgPool2d((1, 1)) -> Linear(num_features, num_classes) (less computation)
      • Option 2: Conv2d(num_features, num_classes, 3, padding=1) -> AdaptiveAvgPool2d((1, 1))
  • To speed up jpeg image I/O from the disk one should not use PIL, skimage and even OpenCV but look for libjpeg-turbo or PyVips.

Data Augmentation

Separable convolution

Sota CNNs

Description Paper
Inception v3 Dec 2015
Resnet Dec 2015
SqueezeNet Feb 2016
Densenet Concatenate previous layers Aug 2016
Xception Depthwise Separable Convolutions Oct 2016
ResNext Nov 2016
DPN Dual Path Network Jul 2017
SENet Squeeze and Excitation (channels weights) Sep 2017
EfficientNet Rethinking Model Scaling May 2019
Noisy Student Self-training Nov 2019
  • Small nets: Useful for mobile phones.
    • SqueezeNet (2016): v1.0: 58.108, v1.1: 58.250. paper.
    • Mobilenet v1 (2017): 69.600The standard convolution is decomposed into two. Accuracy similar to Resnet-18. paper
    • Shufflenet (2017): The most efficient net 67.400. paper.
    • NASNet-A-Mobile (2017): 74.080. paper
    • Mobilenet v2 (2018): 71.800. paper
    • SqueezeNext (2018): 62.640. Hardware-Aware Neural network design. paper.
  • Common nets:
    • Inception v3 (2015): 77.294 paper, blog
    • Resnet (2015): Every 2 convolutions (3x3->3x3) sum the original input. paper Wide ResNet?
      • Resnet-18: 70.142
      • Resnet-34: 73.554
      • Resnet-50: 76.002. SE-ResNet50: 77.636. SE-ResNeXt50 (32x4d): 79.076
      • Resnet-101: 77.438. SE-ResNet101: 78.396. SE-ResNeXt101 (32x4d): 80.236
      • Resnet-152: 78.428. SE-ResNet152: 78.658
    • Densenet (2016): Every 2 convolutions (3x3->1x1) concatenate the original input. paper
      • DenseNet-121: 74.646
      • DenseNet-169: 76.026
      • DenseNet-201: 77.152
      • DenseNet-161: 77.560
    • Xception (2016): 78.888 paper
    • ResNext (2016): paper
      • ResNeXt101 (32x4d): 78.188
      • ResNeXt101 (64x4d): 78.956
    • Dual Path Network (DPN): paper
      • DualPathNet98: 79.224
      • DualPathNet92_5k: 79.400
      • DualPathNet131: 79.432
      • DualPathNet107_5k: 79.746
    • SENet (2017): Squeeze and Excitation network. Net is allowed to adaptively adjust the weighting of each feature map in the convolution block. paper
      • SE-ResNet50: 77.636
      • SE-ResNet101: 78.396
      • SE-ResNet152: 78.658
      • SE-ResNeXt50 (32x4d): 79.076 USE THIS ONE FOR A MEDIUM NET
      • SE-ResNeXt101 (32x4d): 80.236 USE THIS ONE FOR A BIG NET
  • Giants nets: Useful for competitions.
    • Inception v4: 80.062, Inception-ResNet: 80.170 paper
    • PolyNet: 81.002
    • SENet-154: 81.304
    • NASNet-A-Large: 82.566 Crated with AutoML. paper
    • PNASNet-5-Large: 82.736
    • AmoebaNet: 83.000 paper

CNN explainability

link 1, link 2

  • Features: Average features on the channel axis. This shows all classes detected. [512, 11, 11]-->[11, 11].
  • CAM: Class Activation Map. Final features multiplied by a single class weights and then averaged. [512, 11, 11]*[512]-->[11, 11]. paper.
  • Grad-CAM: Final features multiplied by class gradients and the averaged. paper.
  • SmoothGrad paper.
  • Extra: Distill: feature visualization
  • Extra: Distill: building blocks

Libraries

Object detection

Get bounding boxes.

Decoding: State Of The Art Object Detection

Name Description Date Type
R-CNN Nov 2013 Region-based
Fast R-CNN Apr 2015 Region-based
Faster R-CNN Jun 2015 Region-based
YOLO v1 You Only Look Once Jun 2015 Single-shot
SSD Single Shot Detector Dec 2015 Single-shot
FPN Feature Pyramid Network Dec 2016 Single-shot
YOLO v2 Better, Faster, Stronger Dec 2016 Single-shot
Mask R-CNN Mar 2017 Region-based
RetinaNet Focal Loss Aug 2017 Single-shot
PANet Path Aggregation Network Mar 2018 Single-shot
YOLO v3 An Incremental Improvement Apr 2018 Single-shot
EfficientDet Based on EfficientNet Nov 2019 Single-shot
YOLO v4 Optimal Speed and Accuracy Apr 2020 Single-shot

Segmentation

Get pixel-level classes. Note that the model backbone can be a resnet, densenet, inception...

Name Description Date Instances
FCN Fully Convolutional Network 2014
SegNet Encoder-decorder 2015
Unet Concatenate like a densenet 2015
ENet Real-time video segmentation 2016
PSPNet Pyramid Scene Parsing Net 2016
FPN Feature Pyramid Networks 2016 Yes
DeepLabv3 Increasing dilatation & field-of-view 2017
LinkNet Adds like a resnet 2017
PANet Path Aggregation Network 2018 Yes
Panop FPN Panoptic Feature Pyramid Networks 2019 ?
PointRend Image Segmentation as Rendering 2019 ?

Feature Pyramid Networks (FPN): slides

Depth segmentation

Learning the Depths of Moving People by Watching Frozen People (mannequin challenge) paper

Surface normal segmentation

GANs

Reference

Applications:

  • Image to image problems
  • New images
    • From latent vector
    • From noise image

Training

  1. Generate labeled dataset
    • Edit ground truth images to become the input images.
    • This step depend of the problem: input data could be crappified, black & white, noise, vector ...
  2. Train the GENERATOR (most of the time)
    • Model: UNET with pretrained ResNet backbone + self attention + spectral normalization
    • Loss: Mean squared pixel error or L1 loss
    • Better Loss: Perceptual Loss (aka Feature Loss)
  3. Save generated images.
  4. Train the DISCRIMINATOR (aka Critic) with real vs generated images.
    • Model: Pretrained binary classifier + spectral normalization
  5. Train BOTH nets (ping-pong) with 2 losses (original and discriminator).
    • With a NoGAN approach, this step is very quick (a 5% of the total training time, more o less)
    • With a traditional progressively-sized GAN approach, this step is very slow.
    • If train so much this step, you start seeing artifacts and glitches introduced in renderings.

Tricks

  • Self-Attention GAN (SAGAN): For spatial coherence between regions of the generated image
  • Spectral normalization
  • Video

GANs (order chronologically)

Paper Name Date Creator
GAN Generative Adversarial Net Jun 2014 Goodfellow
CGAN Conditional GAN Nov 2014 Montreal U.
DCGAN Deep Convolutional GAN Nov 2015 Facebook
GAN v2 Improved GAN Jun 2016 Goodfellow
InfoGAN Jun 2016 OpenAI
CoGAN Coupled GAN Jun 2016 Mitsubishi
Pix2Pix Image to Image Nov 2016 Berkeley
StackGAN Text to Image Dec 2016 Baidu
WGAN Wasserstein GAN Jan 2017 Facebook
CycleGAN Cycle GAN Mar 2017 Berkeley
ProGAN Progressive growing of GAN Oct 2017 NVIDIA
SAGAN Self-Attention GAN May 2018 Goodfellow
BigGAN Large Scale GAN Training Sep 2018 Google
StyleGAN Style-based GAN Dec 2018 NVIDIA

2014 (GAN) โ†’ 2015 (DCGAN) โ†’ 2016 (CoGAN) โ†’ 2017 (ProGAN) โ†’ 2018 (StyleGAN)

GANS (order by type)

  • Better error function
  • CGAN: Only one particular class generation (instead of blurry multiclass).
  • InfoGAN: Disentaged representation (Dec. 2016, OpenAI)
    • CycleGAN: Domain adaptation (Oct. 2017, Berkeley)
    • SAGAN: Self-Attention GAN (May. 2018, Google)
    • Relativistic GAN: Rethinking adversary (Jul. 2018, LD Isntitute)
    • Progressive GAN: One step at a time (Oct 2017, NVIDIA)
  • DCGAN: Deep Convolutional GAN (Nov. 2016, Facebook)
    • BigGAN: SotA for image synthesis. Same GAN techiques, but larger. Increase model capacity & batch size.
    • BEGAN: Balancing Generator (May. 2017, Google)
    • WGAN: Wasserstein GAN. Learning distribution (Dec. 2017, Facebook)
  • VAEGAN: Improving VAE by GANs (Feb. 2016, TU Denmark)
  • SeqGAN: Sequence learning with GANs (May 2017, Shangai Univ.)

Product placement

Technology

  • Background-foreground segmentation so images simply slide behind objects in the front zone.
  • Optical flow analysis helps determine the overall movement of virtual ads.
  • Planar tracking helps smooth positioning.
  • Image color adjustment is optimized according to the environment.

Papers

  • CASE datasetpaper
  • ALOS datasetpaper
  • Identifying Candidate Spaces with CASE ds paper

Companies

๐Ÿ“‰ Loss functions

  • Segmentation: Usually Loss = IoU + Dice + 0.8*BCE
    • Pixel-wise cross entropy: each pixel individually, comparing the class predictions (depth-wise pixel vector)
    • IoU (F0): (Pred โˆฉ GT)/(Pred โˆช GT) = TP / TP + FP * FN
    • Dice (F1): 2 * (Pred โˆฉ GT)/(Pred + GT) = 2ยทTP / 2ยทTP + FP * FN
      • Range from 0 (worst) to 1 (best)
      • In order to formulate a loss function which can be minimized, we'll simply use 1 โˆ’ Dice
  • Generation
    • Pixel MSE: Flat the 2D images and compare them with regular MSE.
    • Discriminator/Critic The loss function is a binary classification pretrained resnet (real/fake).
    • Feature losses or perpetual losses.

Image preprocessing

Normalization

  1. Mean subtraction: Center the data to zero. x = x - x.mean() fights vanishing and exploding gradients
  2. Standardize: Put the data on the same scale. x = x / x.std() improves convergence speed and accuracy

PCA and Whitening

  1. Mean subtraction: Center the data in zero. x = x - x.mean()
  2. Decorrelation or PCA: Rotate the data until there is no correlation anymore.
  3. Whitening: Put the data on the same scale. whitened = decorrelated / np.sqrt(eigVals + 1e-5)

ZCA Whitening with Zero component analysis (ZCA) is a very similar process.

Subtract Local Mean

CLAHE: Contrast Limited Adaptive Histogram Equalization

Dicom


Resources