nusnlp/mlconvgec2018

About prepare_data.sh

Closed this issue · 5 comments

I saw we have to remove empty target sentences for the NUCLE development data.
Do we have to do the same for the NUCLE training data?
Thank you very much.

The parallel training data (NUCLE+Lang8) is cleaned https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh#L94 so that only non-empty sentence pairs are retained.

In many <incorrect, correct> pairs of lang-8 data, there are additional comments. If we feed this to our models as is, I think it will not be very useful.
I guess the current scripts for cleaning data does not remove additional comments provided along with correct sentences.
Is there any script which handles this problem?

No, the current pre-processing pipeline does not involve any specific rules to remove additional comments. However, the clean-corpus-n.perl script (from Moses SMT toolkit) that is used within the preprocess.sh script removes source-target sentence pairs which are substantially different in terms of length.

Thanks,
I see. Removing source-target pairs where len(target)> 1.5*len(source) rejects around 30% of data. :(

I think the ratio used is 9, not 1.5. The script also removes sentences which are more than 80 tokens and less than 1 token.