Would you like to share distilled datasets ?

Question

Would you like to share distilled datasets ?

Closed this issue 2 years ago · 4 comments

Hi,
Thanks for your nice paper and code!
Would you like to share the distilled datasets used in this paper?

Answer 1 · 2022-09-16T03:42:06.000Z

Yes. You can download our distilled data from the following links. All data are distilled from Transformer-big.

_	_	_	_
WMT14	En-De	De-En	validtest
WMT17	Zh-En	En-Zh	validtest

Sorry but note that I lost the files containing source texts of WMT14, and manage to recover them from the original dataset. It is unlikely but if you find any problem when using WMT14 data (such as mistakenly paring some sources and targets), please let me know. Thank you very much!

Some Details:
The distilled dataset contains less samples than the original one because some sentences exceed the max length of our AT teacher. To recover the source file, I have to align the source and target texts. Fortunately, the sample order remains unchanged. I just achieve it by removing some lines in the original dataset.

Answer 2 · 2022-09-18T03:54:11.000Z

Thank you very much!
Are you using the original dictionaries of WMT14 EN-DE and WMT17 EN-ZH?

Answer 3 · 2022-09-18T04:19:05.000Z

Yes. Here are the dictionaries: En-De Zh-En

Answer 4 · 2022-09-18T04:28:50.000Z

OK. Thanks again!!