[CV_Pose Estimation] DeepPose: Human Pose Estimation via Deep Neural Networks

Question

Opened this issue 3 years ago · 0 comments

DeepPose: Human Pose Estimation via Deep Neural Networks

1. Introduction

strong articulations, small visible joints, occlusions, need to capture context
modeling only a small subset of all interactions bw body parts

Pose estimation <=> Joint regression (location of each joint is regressed)
Input : full img & 7-layered generic convolutional DNN
Capturing full context of each body joint
Simpler to formulate : no need to design whole feature representations, detectors for parts, interactions bw joints
Cascade of DNN-base pose predictors : increased precision of joint localization
SOTA or better than SOTA on 4 benchmarks

Pictorial Strictures (PSs) : distance transform trick
Tree-based pose models with simple binary potential
Richer part detectors : enriching representational power + maintaining tractability
Mixture models on full scale
Richer higher-order spatial relationships
Transfer joint locations, Nearest neighbor setup
Semi-global classifier for part config : linear -> less expressive representation (only arms)
Pose regression : 3D pose
CNNs with Neighborhood component analysis to regress : No cascade
NN-based pose embedding : contrastive loss

Encoding locations of all k body joints in Pose vector
- x : Input Image data
- k : # of body joints
- y : GT pose vector (2k Dim)
- y_i : x, y coordinates 2D vector of i-th joint (absolute img coordinates)
Normalized y_i wrt bounding box b
- b = (b_c, b_w, b_h)
- b_c : center of b (2D)
- b_w : width of b
- b_h : heigh of b
Normalized Pose vector

Architecture
- x : Input Image data
- φ : regression function based on conv DNN
  - Input : 220 x 220 img -> 55 x 55 (by stride = 4)
  - 7 layers (filter size : 11x11, 5x5, 3x3, 3x3, 3x3)
  - Pooling : applied after 3 layers
  - Total # of params : 40M
  - Generic DNN Arch -> Holistic modeling & all internal features can be shared
- θ : parameters of model
- y* : pose prediction vector (absolute img coordinates vector)
Loss function and Training
- L2 loss : minimize distance bw prediction and true pose vector
- Using Normalized training set D_N
- Optimization over individual joints (if not all joints are labeled, omit that terms)
- Mini-batch size = 128, lr rate = 0.0005
- Data Augmentation : random translated crop, left/right flip
- DropOut regularization rate = 0.6

Purpose : to solve limited capacity for detail (fixed input size) and achieve better precision
Same network Arch for all stages of cascade but Different learnable parameters
1st stage : estimate an initial pose
Subsequent stage : predict and refine displacement of joint locations y^s - y^(s-1)
- θ_s : learned network params
- φ_i : pose displacement regressor
- y_i : joint location
- b_i : joint bbox
- diam(y^s) : distance bw opposing joints on human torso
- σ : scale parameter for diam(y^s)
Process
- Using predicted joint locations to focus on relevant parts of img
- Cropping sub-imgs around predicted joint location
- Applying pose displacement regressor on sub-imgs
Result : higer resolution imgs -> finer features -> higher precision
Full augmented Training data
- Data Augmentation : multiple normalizations
- Using predictions from previous stage + simulated predictions (generated by randomly displacing GT)

Frames Labeled In Cinema (FLIC)
- 4000 train img + 1000 test img from Hollywood movies
- diverse poses and clothing
- 10 upper body joints are labeled for each human
Leeds Sports Dataset (LSP)
- 11000 train img + 1000 test img from sports activities
- 150 pixel height for majority of people
- 14 joints labeled for each person full body

Percentage of Correct Parts (PCP) : detected if distance bw predicted and true limb joint is at most half of limb length -> hard to detect for shorter limbs, lower arms
Percentage of Detected Joints (PDJ) : varying degrees, detected if distance bw predicted and true limb joint is within certain fraction of torso diameter -> all joints are based on same distance threshold

FLIC : Rough estimate of initial bbox by Face-based body detector
LSP : Full img as initial bbox
To measure optimally of params, Use Average over PDJ at 0.2 across all joints
To improve generalization, Augment data by sampling 40 randomly translated crop boxes
Running time : 0.1s per img on a 12 core CPU
Training complexity is higher