jeonggg119/DL_paper

[CV_Pose Estimation] DeepPose: Human Pose Estimation via Deep Neural Networks

Opened this issue · 0 comments

DeepPose: Human Pose Estimation via Deep Neural Networks

1. Introduction

Previous challenges (Limitations)

  • Localization of human joints using local detector

strong articulations, small visible joints, occlusions, need to capture context
modeling only a small subset of all interactions bw body parts

  • Holistic manner proposed but limited success in real-world problems

DNN (Deep Neural Networks)

  • visual classification tasks, object localization

Holistic human Pose estimation as DNN

  • Pose estimation <=> Joint regression (location of each joint is regressed)
  • Input : full img & 7-layered generic convolutional DNN
  • Capturing full context of each body joint
  • Simpler to formulate : no need to design whole feature representations, detectors for parts, interactions bw joints
  • Cascade of DNN-base pose predictors : increased precision of joint localization
  • SOTA or better than SOTA on 4 benchmarks

2. Related Work

  • Pictorial Strictures (PSs) : distance transform trick
  • Tree-based pose models with simple binary potential
  • Richer part detectors : enriching representational power + maintaining tractability
  • Mixture models on full scale
  • Richer higher-order spatial relationships
  • Transfer joint locations, Nearest neighbor setup
  • Semi-global classifier for part config : linear -> less expressive representation (only arms)
  • Pose regression : 3D pose
  • CNNs with Neighborhood component analysis to regress : No cascade
  • NN-based pose embedding : contrastive loss

3. Deep Learning Model for Pose Estimation

  • Encoding locations of all k body joints in Pose vector

    image

    • x : Input Image data
    • k : # of body joints
    • y : GT pose vector (2k Dim)
    • y_i : x, y coordinates 2D vector of i-th joint (absolute img coordinates)
  • Normalized y_i wrt bounding box b

    image

    • b = (b_c, b_w, b_h)
    • b_c : center of b (2D)
    • b_w : width of b
    • b_h : heigh of b
  • Normalized Pose vector

    image

3.1 Pose Estimation as DNN-based Regression [Initial stage]

  • Architecture

    image

    image

    • x : Input Image data
    • φ : regression function based on conv DNN
      • Input : 220 x 220 img -> 55 x 55 (by stride = 4)
      • 7 layers (filter size : 11x11, 5x5, 3x3, 3x3, 3x3)
      • Pooling : applied after 3 layers
      • Total # of params : 40M
      • Generic DNN Arch -> Holistic modeling & all internal features can be shared
    • θ : parameters of model
    • y* : pose prediction vector (absolute img coordinates vector)
  • Loss function and Training

    image

    • L2 loss : minimize distance bw prediction and true pose vector
    • Using Normalized training set D_N
    • Optimization over individual joints (if not all joints are labeled, omit that terms)
    • Mini-batch size = 128, lr rate = 0.0005
    • Data Augmentation : random translated crop, left/right flip
    • DropOut regularization rate = 0.6

3.2 Cascade of Pose Regressors

  • Purpose : to solve limited capacity for detail (fixed input size) and achieve better precision

  • Same network Arch for all stages of cascade but Different learnable parameters

  • 1st stage : estimate an initial pose
    image

  • Subsequent stage : predict and refine displacement of joint locations y^s - y^(s-1)
    image

    • θ_s : learned network params
    • φ_i : pose displacement regressor
    • y_i : joint location
    • b_i : joint bbox
    • diam(y^s) : distance bw opposing joints on human torso
    • σ : scale parameter for diam(y^s)
  • Process

    • Using predicted joint locations to focus on relevant parts of img
    • Cropping sub-imgs around predicted joint location
    • Applying pose displacement regressor on sub-imgs
  • Result : higer resolution imgs -> finer features -> higher precision

  • Full augmented Training data

    • Data Augmentation : multiple normalizations
    • Using predictions from previous stage + simulated predictions (generated by randomly displacing GT)
      image

4. Empirical Evaluation

4.1 Setup

Datasets

  • Frames Labeled In Cinema (FLIC)
    • 4000 train img + 1000 test img from Hollywood movies
    • diverse poses and clothing
    • 10 upper body joints are labeled for each human
  • Leeds Sports Dataset (LSP)
    • 11000 train img + 1000 test img from sports activities
    • 150 pixel height for majority of people
    • 14 joints labeled for each person full body

Metrics

  • Percentage of Correct Parts (PCP) : detected if distance bw predicted and true limb joint is at most half of limb length -> hard to detect for shorter limbs, lower arms
  • Percentage of Detected Joints (PDJ) : varying degrees, detected if distance bw predicted and true limb joint is within certain fraction of torso diameter -> all joints are based on same distance threshold

Experimental Details

  • FLIC : Rough estimate of initial bbox by Face-based body detector
  • LSP : Full img as initial bbox
  • To measure optimally of params, Use Average over PDJ at 0.2 across all joints
  • To improve generalization, Augment data by sampling 40 randomly translated crop boxes
  • Running time : 0.1s per img on a 12 core CPU
  • Training complexity is higher

4.2 Results and Discussion

image
image
image
image
image

5. Conclusion

  • First application of DNNs to human pose estimation
  • Capturing context and reasoning about pose in a holistic manner
  • Generic CNN for classification tasks can be applied localization task