Span corruption in data preprocessing not worked as expected
Ethan-yt opened this issue · 1 comments
Ethan-yt commented
The returned mask is consisted of: non-noise + noise + non-noise + noise + ... + non-noise + noise.
This means the final tokens must be noise and the first token must be non-noise.
It won't have negative effects in most long text situations. However when input length is short, for example, 20 tokens, the output only consisted of one noise span in the last few tokens.
I think the mask should better be: non-noise(allow empty) + noise + ... + noise + non-noise(allow empty). In this way, the returned mask will be randomly distributed in the whole text.
Ethan-yt commented