Some issues regarding generating vocab.json files

Question

Some issues regarding generating vocab.json files

yan1617262965 opened this issue a year ago · 4 comments

yan1617262965 commented a year ago

Example of how you previously answered other people's questions:

Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:

from datasets.caption.field import TextField

text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)

given a list of captions

source = [
"This is a first caption",
"This is a second caption",
....
]

text_field.build_vocab(source)
That's how it works.

Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible

Answer 1 · 2023-06-06T10:47:07.000Z

Thank you for asking. I am not sure which errors you encounter.

itos is a short name of index_to_string (or to_token).
Similarly, stoi is a short name of string_to_index (or to_token).

Therefore, I guess you just need a simple line of code to get stoi from itos. For example:

stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

Answer 2 · 2023-06-06T11:37:30.000Z

谢谢你的提问。我不确定您遇到了哪些错误。

itos是（或）的简称。index_to_string``to_token

同样，是（或）的简称。stoi``string_to_index``to_token

因此，我想您只需要一行简单的代码即可从.例如：stoi``itos
stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

Okay, thank you very much for your reply

Answer 3 · 2023-06-07T08:52:52.000Z

Hello author, I would like to inquire if there may be some "" generated during the inference process after fine-tuning my own dataset. Can you answer my doubts

I generated my own vocab.json, but this phenomenon still exists after training

Answer 4 · 2023-06-08T01:59:45.000Z

I think in this case, you need to do a little bit effort in the generate code. For example, if the logit is the highest value, then choose the second highest logit at this timestep.