Is the edit space consistent during pre-train and fine-tune?

Question

Is the edit space consistent during pre-train and fine-tune?

Serenade-J opened this issue 5 years ago · 1 comments

Hi, I have a question about whether the edit_space(\Sigma_a) is consistent when pre-train and fine-tune? Is it derived from the Lang8 dataset although the distribution of synthetic data is different from the Lang8?

Answer 1 · 2019-12-05T13:05:13.000Z

Yes, the edit space is consistent during the synthetic-training and fine-tuning steps of training the GEC model and was generated using \Sigma_a from here
\Sigma_a here is composed of word-piece uni-grams (common_inserts.p) or bi-grams (common_multitoken_inserts.p).

However, while generating pseudo data, we did not use a word-piece tokenizer. Thus, pickle files in the errorify directory are somewhat different (i.e. they contain whole words and not word-pieces). Also, the replace pickle in the errorify directory also contains a mapping of replacements with commonly replaced words. This is helpful for introducing systematic errors while generating synthetic GEC data.

All the pickles were obtained through diffs between lang8+nucle+fce datasets