Corpus?

Question

Corpus?

dgerding opened this issue 5 years ago · 3 comments

Can you point to the Universal Dependencies data you used? Or include it, guessing, in the Corpus project? Really excited to be able to try training.

Thanks
Dave G

dgerding commented 5 years ago

Thanks!

Answer 1 · 2019-10-26T16:46:49.000Z

Hi Dave,

The training data used for the Catalyst.Training project can be found bellow:

FastText Language Detection: Original post: https://fasttext.cc/blog/2017/10/02/blog-post.html and dataset here: http://downloads.tatoeba.org/exports/sentences.tar.bz2
CLD2 Language Detection: https://github.com/CLD2Owners/cld2/ (although I cannot find where exactly I got the data)
Universal Dependencies: https://universaldependencies.org/#download
OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19 (the website seems to be offline as of today - you can see the (cached version here)[https://webcache.googleusercontent.com/search?q=cache:KsBYQVqINjQJ:https://catalog.ldc.upenn.edu/LDC2013T19+&cd=1&hl=en&ct=clnk&gl=de]). Unfortunatelly OntoNotes has a quite restrictive license, so I can't send you a direct link. In any case, it's not necessary to use OntoNotes for training, you can train only with the UD dataset.
WikiNER: https://github.com/dice-group/FOX/tree/master/input/Wikiner

You can also use the pre-trained models available in the online repository, for example:

//Configures the model storage to use the online repository backed by the local folder ./catalyst-models/
Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var nlp = await Pipeline.ForAsync(Language.English);
nlp.Add(await AveragePerceptronEntityRecognizer.FromStoreAsync(language: Language.English, version: Version.Latest, tag: "WikiNER"));

If you want, I can also provide you a direct download link for all the data - it's about 3.4GB without the OntoNotes dataset.

Answer 2 · 2021-06-17T12:02:00.000Z

Hi! I know this issue is long closed, but I would be grateful if that download link was published :^)