davidnvq/grit

Some issues regarding generating vocab.json files

yan1617262965 opened this issue · 4 comments

Example of how you previously answered other people's questions:

Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:

from datasets.caption.field import TextField

text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)

given a list of captions

source = [
"This is a first caption",
"This is a second caption",
....
]

text_field.build_vocab(source)
That's how it works.

Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible

Thank you for asking. I am not sure which errors you encounter.

  • itos is a short name of index_to_string (or to_token).
  • Similarly, stoi is a short name of string_to_index (or to_token).

Therefore, I guess you just need a simple line of code to get stoi from itos. For example:

stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

谢谢你的提问。我不确定您遇到了哪些错误。

  • itos是 (或) 的简称。index_to_string``to_token
  • 同样,是 (或) 的简称。stoi``string_to_index``to_token

因此,我想您只需要一行简单的代码即可从.例如:stoi``itos

stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

Okay, thank you very much for your reply

1686127765(1)
Hello author, I would like to inquire if there may be some "" generated during the inference process after fine-tuning my own dataset. Can you answer my doubts

I generated my own vocab.json, but this phenomenon still exists after training

I think in this case, you need to do a little bit effort in the generate code. For example, if the logit is the highest value, then choose the second highest logit at this timestep.