Details about training corpus, availability of pre-trained model and data
Closed this issue · 10 comments
-
The "train_1b.tgt" file is built from the One Billion Benchmark. Tokenization was applied to each sentence.
-
I will share them later.
Thank you
Can you please share the pre-processed data and pre-trained models?
Can you share the ablation metrics for the pre-trained model with and without using a spell-corrector? Thanks.
I didn't do that, but later I can provide the spelling error corrected source sentences of the CoNLL 2014 test dataset.
Thanks.
Can you please provide the spell-corrected source sentences on CoNLL-14 test set?
Thanks.
Already shared.
Can you please provide the link to download train_1b.tgt?
Can you please provide the link to download train_1b.tgt?
It is the tokenized One Billion Benchmark dataset. You can download and tokenize it yourself.