xichenpan/ARLDM

Why resize token embedings when dataset is pororo or flintstone?

LiamTTT opened this issue · 5 comments

Hi!

I am reproducing this work, and I noticed that the token embeding is resized when training on pororo or flintstone datasets.
My question is:

  1. Why do that?
  2. Why resize to those num?

BTW, thanks for your opensource!
look forward to your reply :)

Hi, thanks for your interests. I not quite clear about what dose "resized token embeding" means. Could you please refer the corresponding code using a link?

Thanks for your comments! this is because we added some new tokens for characters in these two datasets.
https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/config.yaml#L35
https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/config.yaml#L42
https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/datasets/flintstones.py#L36-L39
https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/datasets/pororo.py#L37-L40
as a result, the vocab size of tokenizer has been changed, and we need to resize the token embeddings to make sure the embedding layer can still encode the sentence. And the num can be obtained by printing len(clip_tokenizer) and len(blip_tokenizer) after adding those new tokens.

Got it! Thanks!
That is a fantastic work! Looking forward to more work from you.

Thanks! Feel free to open an issue if you have any further questions!