> In the WordPiece tokenization, one word may be divided into multiple sub-word. But, how should we handle the lalel ？AM

Question

> In the WordPiece tokenization, one word may be divided into multiple sub-word. But, how should we handle the lalel ？AM

Closed this issue 2 years ago · 2 comments

In the WordPiece tokenization, one word may be divided into multiple sub-word. But, how should we handle the lalel ？

The new part will be represented by a special label at the corresponding label location.
For example, I use a special flag ‘X’:
['Nadim', 'Ladki', 'AL-AIN', ','] -----> ['Nadim', 'Ladki', 'AL', '-', '[UNK]', ',']
['B-PER', 'I-PER', 'B-LOC', 'O'] ------> ['B-PER', 'I-PER', 'B-LOC', 'X', 'X', 'O']

Originally posted by @yuanxiaosc in google-research/bert#291 (comment)

Answer 1 · 2022-10-27T16:26:20.000Z

收到啦，谢谢啦

Answer 2 · 2022-10-27T16:28:06.000Z

ANTHONY MARCELLINUS