yxuansu/TaCL

how cani get vocab from the model

ersamo opened this issue · 5 comments

Thanks for sharing the code . I'm trying to get vocab from this model but couldn't
the way i used is vocab=bert.get_tokenizer().get_vocab() with BERT but how can i get it from your model please

vocab=bert.get_tokenizer().get_vocab()

Thank you for your interest in our work. You can get the vocab of our model using the following script.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('cambridgeltl/tacl-bert-base-uncased')
vocab= tokenizer.get_vocab()

Hope this can help you :)

thanks for replying i got it but how can i get tokenizer.word_index.items() from your model as I'm trying to get it by

def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

but didn't work

thanks for replying i got it but how can i get tokenizer.word_index.items() from your model as I'm trying to get it by

def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

but didn't work

Hi,

I think you can check the official huggingface document. The usage of our model should be the same as the original BERT model.

thanks for replying i got it but how can i get tokenizer.word_index.items() from your model as I'm trying to get it by

def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

but didn't work

Hi,

I think you can check the official huggingface document. The usage of our model should be the same as the original BERT model.

thanks a lot for replying . i already checked it but couldn't find the same manner there . can you help please as i need to your model with this method ..

thanks for replying i got it but how can i get tokenizer.word_index.items() from your model as I'm trying to get it by

def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

but didn't work

Hi,
I think you can check the official huggingface document. The usage of our model should be the same as the original BERT model.

thanks a lot for replying . i already checked it but couldn't find the same manner there . can you help please as i need to your model with this method ..

Hi,

Please try the following lines:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('cambridgeltl/tacl-bert-base-uncased')
vocab = tokenizer.get_vocab()
id2word_dict = {}
for key, value in vocab.items():
       id2word_dict[value] = key

# input: id2word_dict[0]
# output: '[PAD]'

For example, if you want to find the word that has an ID of 0. Just use: id2word_dict[0]. It should give you the output as '[PAD]'. Hope this helps :)