kanyun-inc/fairseq-gec

Details about training corpus, availability of pre-trained model and data

Closed this issue · 10 comments

Thanks for open sourcing the code. I have a few queries:

  • For pretraining and adding noise, 'train_1b.tgt' is used. Is this the same as concatenating all the files from Google 1B dataset. Was any further pre-processing done?
  • Can a link to the pre-processed data and pre-trained files be provided?
    Thanks!
  • The "train_1b.tgt" file is built from the One Billion Benchmark. Tokenization was applied to each sentence.

  • I will share them later.

Thank you

Can you please share the pre-processed data and pre-trained models?

Can you share the ablation metrics for the pre-trained model with and without using a spell-corrector? Thanks.

I didn't do that, but later I can provide the spelling error corrected source sentences of the CoNLL 2014 test dataset.

Can you please provide the spell-corrected source sentences on CoNLL-14 test set?
Thanks.

Already shared.

Can you please provide the link to download train_1b.tgt?

Can you please provide the link to download train_1b.tgt?

It is the tokenized One Billion Benchmark dataset. You can download and tokenize it yourself.