thammegowda/mtdata

CCMatrix?

Closed this issue · 1 comments

kpu commented

CCMatrix seems to be missing from the corpora perhaps due to an old OPUS crawl? https://opus.nlpl.eu/CCMatrix.php

Mostly likely because of an old crawl. I am not sure when CCMatrix was added to OPUS.

Going forward, we should have an easy way to sync with OPUS.

Taking notes (for myself) for resolving this issue:

# download all datasets as JSON
$ curl "https://opus.nlpl.eu/opusapi/?preprocessing=moses" > opus_all.json 
# JSON is 34MB

# Parse JSON to TSV 
$ cat opus_all.json |  jq -r  '.corpora[] | [.corpus, .version, .source, .target] | @tsv'  | sort  > opus_all.tsv 
# TSV is 2.5MB  ; 124011 datasets