thu-coai/DA-Transformer

Would you like to share distilled datasets ?

Closed this issue · 4 comments

Hi,
Thanks for your nice paper and code!
Would you like to share the distilled datasets used in this paper?

Yes. You can download our distilled data from the following links. All data are distilled from Transformer-big.

_ _ _ _
WMT14 En-De De-En validtest
WMT17 Zh-En En-Zh validtest

Sorry but note that I lost the files containing source texts of WMT14, and manage to recover them from the original dataset. It is unlikely but if you find any problem when using WMT14 data (such as mistakenly paring some sources and targets), please let me know. Thank you very much!

Some Details:
The distilled dataset contains less samples than the original one because some sentences exceed the max length of our AT teacher. To recover the source file, I have to align the source and target texts. Fortunately, the sample order remains unchanged. I just achieve it by removing some lines in the original dataset.

Thank you very much!
Are you using the original dictionaries of WMT14 EN-DE and WMT17 EN-ZH?

Yes. Here are the dictionaries: En-De Zh-En

OK. Thanks again!!