General question about padding in the setting of soft-prompt tuning

Question

General question about padding in the setting of soft-prompt tuning

krishnakanthnakkav2 opened this issue 8 months ago · 2 comments

krishnakanthnakkav2 commented 8 months ago

Hello Authors,

I have a general question about padding in soft-prompt tuning setting.

In a batch when sequences are different length, typically we left-pad the smaller sequences like

# first example is left padded, batch size is set to two.
input_tokens = [ 
[<pad>, <pad>,  "my" "name"], 
["where", "are", "you", "from"]
]

And then, do you add soft-prompt embedding to the embeddings of the above padded tokens like below?

# se1, se2 are soft-prompt embeddings that needs to be tuned
input_embedding = [ 
[ se1, se2, Embedding(<pad>), Embedding(<pad>), Embedding("my"), Embedding("name"), ],
[ se1, se2, Embedding("where"), Embedding("are"), Embedding("you"), Embedding("from"), ],
]

Is my understanding right? Section 2.1 in the paper mentions the X to be padded using max sequence length.

Also other concern I have is, shouldnt we prepend the soft-prompt embeddings to the unpaded user embeddings and then pad the smaller sequences? Like,

# please note the change in the first example
input_embedding = [ 
[ Embedding(<pad>), Embedding(<pad>), se1, se2, , Embedding("my"), Embedding("name"), ],
[ se1, se2, Embedding("where"), Embedding("are"), Embedding("you"), Embedding("from"), ],
]

Can you please share your insights on how padding is done while doing soft-prompt tuning in general? Thank you for clarification.

Answer 1 · 2024-04-03T13:00:35.000Z

Hi,

Thanks for your question. Yes, you are correct. We add the soft prompt after the padding. For the T5 model, the padding is on the right side. For the decoder model, we do the same as what you mentioned above.

Regarding whether we should prepend the soft-prompt embeddings to the unpaded user embeddings and then pad the smaller sequences, it is very interesting to try and it might be helpful for improving the model performance. However, it might make it difficult to apply our proposed method, where we need to add the multiplication of the low-rank matrices to the word representations.

Answer 2 · 2024-04-03T13:14:27.000Z

Thank you!