paninski-lab/lightning-pose

How many labeled frames should we have?

Closed this issue · 3 comments

Hi, lightning pose team
In your preprint, Figure 4: Unlabeled frames improve pose estimation (raw network predictions.), Fig 4C and 4D show that when there are 75 labeled frames, semi-super context model performs the best. But when it goes to 631 labeled frames, it seems that different models(dlc, baseline, semi-super, semi-super context) performance would be very similar but all are better than those in 75 label frames.
So how many frames should I extract to label at the very beginning? With less as tens or more as hundreds?

The typical workflow I would recommend would be to start with ~100 labeled frames from 2-4 different videos. With this you should be able to train models that provide somewhat reasonable predictions that you can then do some preliminary analyses with. This is a good regime if you think you might later change the experimental setup, or the specific keypoints you're labeling, etc.

Once you are happy with your experimental setup and plan to acquire a lot of data, then it is time to reassess how good the pose tracking is, and how good it actually needs to be for your scientific question of interest. If all you end up analyzing with the pose estimates is where an animal is located in an open field, then maybe super precise tracking of the keypoints isn't necessary. But if you care about very subtle changes in pose then precision is much more important.

If you decide you need better predictions, I would recommend labeling another 100-200 frames across multiple videos (maybe 20-50 frames per video), training another model, and reassessing the output (more on this in the next paragraph). And then repeat this process until you are happy with the results. Even once you reach a point where you think you are satisfied, I would recommend recording information about videos/frames that you come across with poor predictions. You might find that after a couple months you have collected enough of these problematic frames to warrant another round of labeling.

Regarding evaluation of the models: there are multiple ways to assess model performance, and each have their advantages and disadvantages. One option is to keep a test set of labeled frames using animals the model is never trained on, and compute pixel error on this test set for each new network (similar to Figure 1 in the LP preprint). The advantage is you get a quantitative number that is easy to track over multiple network trainings. The disadvantage is this will only ever be a small subset of frames, and will miss important aspects of the model predictions. A second option is to keep a test set of unlabeled videos using animals the model was never trained on, and compute the various metrics on the outputs of multiple networks. The disadvantage is that you don't know exactly which predictions are right or wrong, but the advantage is you can look over a much larger number of frames.

Besides plotting these quantitative metrics, I will always always recommend just watching snippets of the video overlaid with model predictions, you'll get a much better (though qualitative) feel for how the model is doing.

Finally, I'll mention that training multiple models with different random seeds (training.rng_seed_data_pt) and computing the variance of the predictions for each keypoint on each frame is a great way to assess model performance. This "ensembling variance" can (1) identify problematic frames for future labeling; and (2) provide another metric that you can continually compute on held-out test videos as you add more labels. This ensembling variance is highly correlated with pixel error.

@danbider anything to add here?

EDIT: I realized I never explicitly answered your question about how many frames to label. TL;DR: start with 100, you might find that you need to go up to something like 500. If you have particularly complex behaviors or variability across animals/sessions, you might need to get closer to 1000. These are just ballpark estimates, again it really comes down to your specific setup and how precise you need the predictions to be to answer your scientific questions.

I have updated the Pose-app docs to include a version of this answer in the FAQ: https://pose-app.readthedocs.io/en/latest/source/faqs.html