universal-ie/UIE

bug: DataCollatorForT5MLM

zdgithub opened this issue 2 years ago · 2 comments

zdgithub commented 2 years ago

你好！我发现你们代码文件t5mlm_data_collator.py中的函数create_sentinel_ids()里有一句如下代码：
sentinel_ids = np.where(sentinel_ids != 0, (sentinel_ids + self.tokenizer.vocab_size - 101), 0)

但是t5官方实现中代码却是：
sentinel_ids = np.where(sentinel_ids != 0, (sentinel_ids + self.tokenizer.vocab_size - 1), 0)

请问这是有意为之吗？官方实现会把<extra_id_xxx>作为sentinel_ids，但是你们这里却使用词典里末尾的一些字作为sentinel_ids，虽然考虑到在SEL中<extra_id_0>等被赋予了新的含义，但是感觉直接用词典里的字作为span corruption的sentinel也不合理

luyaojie commented 2 years ago

你好，非常感谢你对我们工作感兴趣！

我们的 tokenizer.vocab_size=32100 以及 len(tokenizer)=32102。

我们是参考 Huggingface 中 T5 的 flax 实现。
https://github.com/huggingface/transformers/blob/4c8ec66a7433589436d13d95d48601f274c92b44/examples/flax/language-modeling/run_t5_mlm_flax.py#L378
这里的做法是

sentinel_ids = np.where(sentinel_ids != 0, (len(self.tokenizer) - sentinel_ids), 0)

使用 self.tokenizer.vocab_size 而不是 len(self.tokenizer) 的原因是不考虑我们结尾加入的 <spot> 和 <asoc>。
修改后，在我们的代码中生成Mask结果是 <extra_id_0> text0 <extra_id_1> text1 ...

正如你说，我们结构生成采用了 <extra_id_0>，印象中当时我们做了一个前后顺序的替换。

sentinel_ids = np.where(sentinel_ids != 0, (sentinel_ids + self.tokenizer.vocab_size - 101), 0)

最终，在我们的代码中生成Mask结果是 <extra_id_99> text0 <extra_id_98> text1 ...

zdgithub commented 2 years ago

好的，感谢回复！