sign-language-processing/sign-language-processing.github.io

Update "Corpus NGT": Features and broken link

Closed this issue · 11 comments

image
image

TODO:

Related:

One issue: it seems the NGT Corpus may be superseded by a newer dataset?

https://signbank.cls.ru.nl/ says:

image

Second issue: which link?

The original NGT Corpus is still available at:

Third issue: there's RGB video... and it's got multiple views/angles of multiple speakers. The fact that it's conversations between two speakers seems relevant. Is there a way to capture all this? Or do we just say video:RGB?

I wasn't sure what the vocabulary of glosses was, so I just... counted it. Turns out it's 3185.
image

Here's the dump of gloss counts:

ngt_gloss_counts.json

ngt_gloss_counts_sorted.csv
Here they are sorted. Looks like only about 800 glosses have 10 or more examples. And about 2300 have more than 1. The rest, about 800ish, are one-offs

Regardless, "3185" is the total gloss count I suppose.

image
image

As for number of conversations, at the moment the official website lists 2278, not 2375. And then the dataloader lists 2280.

Not sure where "15 hours" comes from, the official citation says 12 at the time.

Similarly, the number of signers is all over the place depending on source.

Gonna go with this for "samples":

#samples": "~2375 multi-cam, multi-signer sessions",

which is within the length limits, shorter than other entries in the list of datasets