How to get INVALID_TOKEN_IDS and VALID_TOKEN_IDS?

Question

How to get INVALID_TOKEN_IDS and VALID_TOKEN_IDS?

Closed this issue 9 months ago · 1 comments

Hello,

I noticed that you use INVALID_TOKEN_IDS in the code to pre-filter out some unwanted tokens from the vocabulary. I'm very curious about how this list of data is generated.

In the paper, I only found a mention of "discard the unused tokens, resulting in a vocabulary V with a size of |V|=29522," but I noticed that the actual length of VALID_TOKEN_IDS in the code is only 27623. Could you provide the specific method for generating INVALID_TOKEN_IDS?

Thank you very much for your attention to this issue.

Answer 1 · 2024-04-16T17:45:06.000Z

Hi, @Clementine24, thank you for your interest. Below is a function help filter out valid tokens from the BERT vocabulary.

def check_valid_token(token):
    punctuation_escaped = re.escape(string.punctuation)
    pattern = f"[a-z0-9{punctuation_escaped}]*"
    return bool(re.fullmatch(pattern, token)) and not (token.startswith('[') and token.endswith(']'))