Some issues regarding generating vocab.json files
yan1617262965 opened this issue · 4 comments
Example of how you previously answered other people's questions:
Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:
from datasets.caption.field import TextField
text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)
given a list of captions
source = [
"This is a first caption",
"This is a second caption",
....
]
text_field.build_vocab(source)
That's how it works.
Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible
Thank you for asking. I am not sure which errors you encounter.
itos
is a short name ofindex_to_string
(orto_token
).- Similarly,
stoi
is a short name ofstring_to_index
(orto_token
).
Therefore, I guess you just need a simple line of code to get stoi
from itos
. For example:
stoi = {string:index for index, string in itos.items()} # if itos is a dictionary
# or
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.
谢谢你的提问。我不确定您遇到了哪些错误。
itos
是 (或) 的简称。index_to_string``to_token
- 同样,是 (或) 的简称。
stoi``string_to_index``to_token
因此,我想您只需要一行简单的代码即可从.例如:
stoi``itos
stoi = {string:index for index, string in itos.items()} # if itos is a dictionary # or stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.
Okay, thank you very much for your reply
I think in this case, you need to do a little bit effort in the generate code. For example, if the logit is the highest value, then choose the second highest logit at this timestep.