curiosity-ai/catalyst

Catalyst.Training Details Request: OntoText & UD Version

dgerding opened this issue · 2 comments

Hi,
I'm trying to add the closest match UD resources and Ontonotes resources to run WikiNERTraining.

Can you point me to which US English UD files your are using? Is it UD_English-EWT?

And which Ontonotes data? Is connll formatted and /or 5.0? ( like https://github.com/ontonotes/conll-formatted-ontonotes-5.0/tree/master/conll-formatted-ontonotes-5.0/data )

Thanks!

I'm going to assume your using Ontonotes 5 from LDC like everyone else.

Still wondering about EWT version of UD.

Hi @dgerding
I updated yesterday our models to use the data from UD2.7 - also switching to a new distribution model over NuGet, and fixing a couple of issues with the training data from some english files that had the text removed.

For WikiNER, we use the data provided here.
And obviously for English we use Ontonotes - that dataset is unfortunately not available for direct download, but you can request access here.