Details about training corpus, availability of pre-trained model and data

Question

Closed this issue 6 years ago · 10 comments

Thanks for open sourcing the code. I have a few queries:

For pretraining and adding noise, 'train_1b.tgt' is used. Is this the same as concatenating all the files from . Was any further pre-processing done?
Can a link to the pre-processed data and pre-trained files be provided?
Thanks!

Thank you

Thanks.

Answer 1 · 2019-05-07T10:55:33.000Z

The "train_1b.tgt" file is built from the One Billion Benchmark. Tokenization was applied to each sentence.
I will share them later.

Answer 2 · 2019-05-20T13:54:23.000Z

Can you please share the pre-processed data and pre-trained models?

Answer 3 · 2019-05-21T06:00:32.000Z

Can you share the ablation metrics for the pre-trained model with and without using a spell-corrector? Thanks.

Answer 4 · 2019-05-27T09:55:26.000Z

I didn't do that, but later I can provide the spelling error corrected source sentences of the CoNLL 2014 test dataset.

Answer 5 · 2019-06-06T07:44:24.000Z

Can you please provide the spell-corrected source sentences on CoNLL-14 test set?
Thanks.

Answer 6 · 2019-06-12T02:35:27.000Z

Already shared.

Answer 7 · 2019-07-18T02:55:11.000Z

Can you please provide the link to download train_1b.tgt?

Answer 8 · 2019-07-18T05:42:22.000Z

Can you please provide the link to download train_1b.tgt?

It is the tokenized One Billion Benchmark dataset. You can download and tokenize it yourself.