[CV_Pose Estimation] DeepPose: Human Pose Estimation via Deep Neural Networks
Opened this issue · 0 comments
jeonggg119 commented
DeepPose: Human Pose Estimation via Deep Neural Networks
1. Introduction
Previous challenges (Limitations)
- Localization of human joints using local detector
strong articulations, small visible joints, occlusions, need to capture context
modeling only a small subset of all interactions bw body parts
- Holistic manner proposed but limited success in real-world problems
DNN (Deep Neural Networks)
- visual classification tasks, object localization
Holistic human Pose estimation as DNN
- Pose estimation <=> Joint regression (location of each joint is regressed)
- Input : full img & 7-layered generic convolutional DNN
- Capturing full context of each body joint
- Simpler to formulate : no need to design whole feature representations, detectors for parts, interactions bw joints
- Cascade of DNN-base pose predictors : increased precision of joint localization
- SOTA or better than SOTA on 4 benchmarks
2. Related Work
- Pictorial Strictures (PSs) : distance transform trick
- Tree-based pose models with simple binary potential
- Richer part detectors : enriching representational power + maintaining tractability
- Mixture models on full scale
- Richer higher-order spatial relationships
- Transfer joint locations, Nearest neighbor setup
- Semi-global classifier for part config : linear -> less expressive representation (only arms)
- Pose regression : 3D pose
- CNNs with Neighborhood component analysis to regress : No cascade
- NN-based pose embedding : contrastive loss
3. Deep Learning Model for Pose Estimation
-
Encoding locations of all k body joints in Pose vector
- x : Input Image data
- k : # of body joints
- y : GT pose vector (2k Dim)
- y_i : x, y coordinates 2D vector of i-th joint (absolute img coordinates)
-
Normalized y_i wrt bounding box b
- b = (b_c, b_w, b_h)
- b_c : center of b (2D)
- b_w : width of b
- b_h : heigh of b
-
Normalized Pose vector
3.1 Pose Estimation as DNN-based Regression [Initial stage]
-
Architecture
- x : Input Image data
- φ : regression function based on conv DNN
- Input : 220 x 220 img -> 55 x 55 (by stride = 4)
- 7 layers (filter size : 11x11, 5x5, 3x3, 3x3, 3x3)
- Pooling : applied after 3 layers
- Total # of params : 40M
- Generic DNN Arch -> Holistic modeling & all internal features can be shared
- θ : parameters of model
- y* : pose prediction vector (absolute img coordinates vector)
-
Loss function and Training
- L2 loss : minimize distance bw prediction and true pose vector
- Using Normalized training set D_N
- Optimization over individual joints (if not all joints are labeled, omit that terms)
- Mini-batch size = 128, lr rate = 0.0005
- Data Augmentation : random translated crop, left/right flip
- DropOut regularization rate = 0.6
3.2 Cascade of Pose Regressors
-
Purpose : to solve limited capacity for detail (fixed input size) and achieve better precision
-
Same network Arch for all stages of cascade but Different learnable parameters
-
Subsequent stage : predict and refine displacement of joint locations y^s - y^(s-1)
- θ_s : learned network params
- φ_i : pose displacement regressor
- y_i : joint location
- b_i : joint bbox
- diam(y^s) : distance bw opposing joints on human torso
- σ : scale parameter for diam(y^s)
-
Process
- Using predicted joint locations to focus on relevant parts of img
- Cropping sub-imgs around predicted joint location
- Applying pose displacement regressor on sub-imgs
-
Result : higer resolution imgs -> finer features -> higher precision
-
Full augmented Training data
4. Empirical Evaluation
4.1 Setup
Datasets
- Frames Labeled In Cinema (FLIC)
- 4000 train img + 1000 test img from Hollywood movies
- diverse poses and clothing
- 10 upper body joints are labeled for each human
- Leeds Sports Dataset (LSP)
- 11000 train img + 1000 test img from sports activities
- 150 pixel height for majority of people
- 14 joints labeled for each person full body
Metrics
- Percentage of Correct Parts (PCP) : detected if distance bw predicted and true limb joint is at most half of limb length -> hard to detect for shorter limbs, lower arms
- Percentage of Detected Joints (PDJ) : varying degrees, detected if distance bw predicted and true limb joint is within certain fraction of torso diameter -> all joints are based on same distance threshold
Experimental Details
- FLIC : Rough estimate of initial bbox by Face-based body detector
- LSP : Full img as initial bbox
- To measure optimally of params, Use Average over PDJ at 0.2 across all joints
- To improve generalization, Augment data by sampling 40 randomly translated crop boxes
- Running time : 0.1s per img on a 12 core CPU
- Training complexity is higher
4.2 Results and Discussion
5. Conclusion
- First application of DNNs to human pose estimation
- Capturing context and reasoning about pose in a holistic manner
- Generic CNN for classification tasks can be applied localization task