google-research/text-to-text-transfer-transformer

Span corruption in data preprocessing not worked as expected

Ethan-yt opened this issue · 1 comments

The returned mask is consisted of: non-noise + noise + non-noise + noise + ... + non-noise + noise.
This means the final tokens must be noise and the first token must be non-noise.
It won't have negative effects in most long text situations. However when input length is short, for example, 20 tokens, the output only consisted of one noise span in the last few tokens.

def random_spans_noise_mask(length,

I think the mask should better be: non-noise(allow empty) + noise + ... + noise + non-noise(allow empty). In this way, the returned mask will be randomly distributed in the whole text.