MSU-AI/SignLanguageTranslator

Testing Overall Accuracy and Performance

Opened this issue · 1 comments

According to the original project, it is stated that about a dozen reference videos are needed for high accuracy sign recognition, and around 5 are needed for 'decent' accuracy.

We should make a point to test these numbers and ensure they are accurate. Ideally we should confirm or find the minimum number of videos for 'basic' accuracy, as well as the amount needed for 'high accuracy'.

Another thing to note is the performance impact of a large dataset. Each set of frames to consider is checked against all signs, so adding more videos may result in slower performance. We should find the maximum number of videos that leads to unacceptable performance.

Finally, both issues should be taken into account to find the ideal number of videos to reach a level of accuracy and performance that we find acceptable. This may include making decisions about which we care about more, speed or accuracy.

This is not an easy problem to tackle! It will likely be revisited and worked upon many times. We can put findings and discussion about speed and accuracy here.

I'm attempting to benchmark the performance of this thing, and I've run into some issues. Noting them here:

  • Some videos' landmarks cannot be extracted. I'm not quite sure what's going on, but I'm skipping them for now. My feeling is that a couple videos here and there aren't a big deal, but examples include home-home-sports ASL and stop-How To Sign The Word Stop In ASL
  • Of greater concern is the the model's inability to differentiate between signs like "please" and "me". Using benchmark_signs in f90a845, most things displayed as "unknown sign". But reading the console output of distances, it still correctly ranked signs most of the time. More investigation needed.