Corpus?
dgerding opened this issue · 3 comments
dgerding commented
Can you point to the Universal Dependencies data you used? Or include it, guessing, in the Corpus project? Really excited to be able to try training.
Thanks
Dave G
theolivenbaum commented
Hi Dave,
The training data used for the Catalyst.Training project can be found bellow:
- FastText Language Detection: Original post: https://fasttext.cc/blog/2017/10/02/blog-post.html and dataset here: http://downloads.tatoeba.org/exports/sentences.tar.bz2
- CLD2 Language Detection: https://github.com/CLD2Owners/cld2/ (although I cannot find where exactly I got the data)
- Universal Dependencies: https://universaldependencies.org/#download
- OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19 (the website seems to be offline as of today - you can see the (cached version here)[https://webcache.googleusercontent.com/search?q=cache:KsBYQVqINjQJ:https://catalog.ldc.upenn.edu/LDC2013T19+&cd=1&hl=en&ct=clnk&gl=de]). Unfortunatelly OntoNotes has a quite restrictive license, so I can't send you a direct link. In any case, it's not necessary to use OntoNotes for training, you can train only with the UD dataset.
- WikiNER: https://github.com/dice-group/FOX/tree/master/input/Wikiner
You can also use the pre-trained models available in the online repository, for example:
//Configures the model storage to use the online repository backed by the local folder ./catalyst-models/
Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var nlp = await Pipeline.ForAsync(Language.English);
nlp.Add(await AveragePerceptronEntityRecognizer.FromStoreAsync(language: Language.English, version: Version.Latest, tag: "WikiNER"));
If you want, I can also provide you a direct download link for all the data - it's about 3.4GB without the OntoNotes dataset.
dgerding commented
Thanks!
ADD-eNavarro commented
Hi! I know this issue is long closed, but I would be grateful if that download link was published :^)