How to get INVALID_TOKEN_IDS and VALID_TOKEN_IDS?
Closed this issue · 1 comments
Clementine24 commented
Hello,
I noticed that you use INVALID_TOKEN_IDS in the code to pre-filter out some unwanted tokens from the vocabulary. I'm very curious about how this list of data is generated.
In the paper, I only found a mention of "discard the unused tokens, resulting in a vocabulary V with a size of |V|=29522," but I noticed that the actual length of VALID_TOKEN_IDS in the code is only 27623. Could you provide the specific method for generating INVALID_TOKEN_IDS?
Thank you very much for your attention to this issue.
jzhoubu commented
Hi, @Clementine24, thank you for your interest. Below is a function help filter out valid tokens from the BERT vocabulary.
def check_valid_token(token):
punctuation_escaped = re.escape(string.punctuation)
pattern = f"[a-z0-9{punctuation_escaped}]*"
return bool(re.fullmatch(pattern, token)) and not (token.startswith('[') and token.endswith(']'))