Update "Corpus NGT": Features and broken link
Closed this issue · 11 comments
One issue: it seems the NGT Corpus may be superseded by a newer dataset?
Second issue: which link?
The original NGT Corpus is still available at:
- https://corpusngt.nl/, which seems to be the official site.
- https://hdl.handle.net/1839/8e5a77a3-8d1a-492a-bc86-9a3398b0809c, which seems to be a "persistent identifier" of an archived version at The Language Archive. But this is where SignBank's official site links to.
Third issue: there's RGB video... and it's got multiple views/angles of multiple speakers. The fact that it's conversations between two speakers seems relevant. Is there a way to capture all this? Or do we just say video:RGB
?
Here's the dump of gloss counts:
ngt_gloss_counts_sorted.csv
Here they are sorted. Looks like only about 800 glosses have 10 or more examples. And about 2300 have more than 1. The rest, about 800ish, are one-offs
Regardless, "3185" is the total gloss count I suppose.
Not sure where "15 hours" comes from, the official citation says 12 at the time.
Similarly, the number of signers is all over the place depending on source.
Gonna go with this for "samples":
#samples": "~2375 multi-cam, multi-signer sessions",
which is within the length limits, shorter than other entries in the list of datasets