About prepare_data.sh

Question

About prepare_data.sh

Closed this issue 6 years ago · 5 comments

I saw we have to remove empty target sentences for the NUCLE development data.
Do we have to do the same for the NUCLE training data?
Thank you very much.

Answer 1 · 2018-06-29T11:40:43.000Z

The parallel training data (NUCLE+Lang8) is cleaned https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh#L94 so that only non-empty sentence pairs are retained.

Answer 2 · 2018-12-02T13:04:25.000Z

In many <incorrect, correct> pairs of lang-8 data, there are additional comments. If we feed this to our models as is, I think it will not be very useful.
I guess the current scripts for cleaning data does not remove additional comments provided along with correct sentences.
Is there any script which handles this problem?

Answer 3 · 2018-12-02T13:06:56.000Z

No, the current pre-processing pipeline does not involve any specific rules to remove additional comments. However, the clean-corpus-n.perl script (from Moses SMT toolkit) that is used within the preprocess.sh script removes source-target sentence pairs which are substantially different in terms of length.

Answer 4 · 2018-12-02T13:09:18.000Z

Thanks,
I see. Removing source-target pairs where len(target)> 1.5*len(source) rejects around 30% of data. :(

Answer 5 · 2018-12-02T13:11:22.000Z

I think the ratio used is 9, not 1.5. The script also removes sentences which are more than 80 tokens and less than 1 token.