Is the edit space consistent during pre-train and fine-tune?
Serenade-J opened this issue · 1 comments
Hi, I have a question about whether the edit_space(\Sigma_a) is consistent when pre-train and fine-tune? Is it derived from the Lang8 dataset although the distribution of synthetic data is different from the Lang8?
Yes, the edit space is consistent during the synthetic-training and fine-tuning steps of training the GEC model and was generated using \Sigma_a from here
\Sigma_a here is composed of word-piece uni-grams (common_inserts.p) or bi-grams (common_multitoken_inserts.p).
However, while generating pseudo data, we did not use a word-piece tokenizer. Thus, pickle files in the errorify directory are somewhat different (i.e. they contain whole words and not word-pieces). Also, the replace pickle in the errorify directory also contains a mapping of replacements with commonly replaced words. This is helpful for introducing systematic errors while generating synthetic GEC data.
All the pickles were obtained through diffs between lang8+nucle+fce datasets