sign-language-processing/datasets

dataset(wikidata): add wikidata parallel dataset for SignWriting and spoken languages

Opened this issue · 0 comments

WikiData includes some data in SignWriting, for example https://www.wikidata.org/wiki/Special:EntityData/Q14759.json

Under "ase" it shows

"ase": {
"language": "ase",
"value": "M528x523S14c02497x497S14c0a472x500S2e85e483x478 M525x535S2e748483x510S10011501x466S2e704510x500S10019476x475 M551x515S1dc50504x485S1dc58474x485S26512449x501S26506536x501"
},

All the data with SignWriting:
https://w.wiki/6LDX

They are also assigned with a concept ID, which has a spoken language word in many languages. This means that while there are only 543 entries, it can be that there are up to 54K multilingual parallel examples.