Loss Masking

Question

Loss Masking

Closed this issue 5 months ago · 2 comments

Thank you for providing the model code and checkpoints.

I'm planning to fine-tune the base model you provided for a downstream task. From what I've seen in the code you shared, it doesn't seem like there is separate loss masking (an action where the prompt doesn't calculate loss and only the target token calculates loss and passes the gradient).

I'm curious if you actually didn't use loss masking for all tokens when conducting instruct tuning (while building the -chat model).

Answer 1 · 2024-05-06T09:09:42.000Z

Hi, it's great to hear that you're interested in using our model.

Regarding your question, actually, during training, we didn't calculate loss based on the prompt; instead, we only calculated loss based on the responses. However, I don't think this detail is particularly important. As far as I know, some models also calculate loss based on the entire sequence.

Answer 2 · 2024-05-06T09:52:03.000Z

Thank you for your response!