
Clarifications for implementation

Closed this issue · 10 comments

Hi @dluvizon,

Congrats on the nice work. I have been trying to reproduce your results and implemented the network after building up on the code provided in your pose regression repo . I have a few questions/clarifications. It would be great if you respond to the following-

  1. Can you share the exact details of which portion of dataset you used from mpii for training 2D action recognition.
  2. Can you share the parameters (mean/variance) you used to generate GT heatmaps for pose estimation network?


Hi @ashar-ali ,

  1. For MPII on single person, there is a standard split between training and validation samples.
    I uploaded the file mpii_annotations.mat with this split.

  2. We use soft-argmax to regress directly joint coordinates, so the method does not rely on generate GT heatmaps. Please take a look in the paper for more details.


Oh great,

figured out the soft-argmax thing. But can you please re-verify the link you have provided above for file download? It throws a 404: not found error for me when I click on it. also I remember you released the weights for mpii yesterday. But am not able to access them today.

Would be a great help.

Thanks @dluvizon

The links should work now!


Thanks a ton for providing these. Do you also plan to release models for pose estimation and/or actiivity recognition for penn_action dataset anytime soon?


I am planning to release the weights finetuned for action, for both Penn and NTU.

Thanks @dluvizon ,

As a sanity check experiment, I was also trying to train the action recognition nets independently for a few epochs.

By independently, I mean I just took the pose ground truth and tried to learn actions with categorical cross-entropy.
Similarly, I extracted the appearance features and pose heat maps offline and tried to learn action categories with hyper parameters mentioned in Appendix B of the paper.


  1. For both the above cases, I could only get the accuracy close to ~8% on the training data itself in 3-4 epochs. Is this performance expected, or should I get at least some prior accuracy with this kind of offline independent learning?

  2. Do you suggest it is better to directly jump to learn jointly with pose estimator networks (after 2 epochs) as mentioned in the paper?

P.S.- all the discussion above is based on experiments I did on Penn Action dataset. Features and probability heat maps were extracted from 2d pose estimator (trained on MPII) as provided by you.

Hi @ashar-ali ,

  1. Considering that PennAction contains 15 classes, you are getting random predictions.
    In my experience, after 3-4 epochs it should be close to 80% using visual features and a bit lower using only pose.

  2. If your net is not learning at all, I guess that the problem is not with the pose data.

PennAction is a pretty easy dataset, so even a naive method should attain 80% relatively fast.

Hi @dluvizon ,

Thanks for sharing these insights. I think it could be also because I am not using any kind of data augmentation for now as I was doing a proof of concept and the architecture is not converging because of that.

Now that you have uploaded the full code for action model as well as weights, I will try and see if I can reproduce its results.

Can you please point me to the annotations.mat file for Penn Action Dataset?
If you could just verify if I am encoding the labels right that would be great-

'baseball_pitch', - 0
'baseball_swing', - 1
'bench_press', - 2
'bowl', - 3
'clean_and_jerk', - 4
'golf_swing', - 5
'jump_rope', - 6
'jumping_jacks', - 7
'pullup', - 8
'pushup', - 9
'situp', - 10
'squat', - 11
'strum_guitar', - 12
'tennis_forehand', - 13
'tennis_serve' - 14


The file should be OK now at https://github.com/dluvizon/deephar/releases/download/v0.3/penn_annotations.mat

You can check the penn action labels by doing:

    print (penn_seq.action_labels)

just after loading the dataset. That gives:

['baseball_pitch' 'baseball_swing' 'bench_press' 'bowl' 'clean_and_jerk'
 'golf_swing' 'jump_rope' 'jumping_jacks' 'pullup' 'pushup' 'situp' 'squat'
 'strum_guitar' 'tennis_forehand' 'tennis_serve']

which corresponds to your list.

Sounds great,

Thanks a lot for all your help @dluvizon