OpenMOSS/AnyGPT

Loss Masking

Closed this issue · 2 comments

Thank you for providing the model code and checkpoints.

I'm planning to fine-tune the base model you provided for a downstream task. From what I've seen in the code you shared, it doesn't seem like there is separate loss masking (an action where the prompt doesn't calculate loss and only the target token calculates loss and passes the gradient).

I'm curious if you actually didn't use loss masking for all tokens when conducting instruct tuning (while building the -chat model).

Hi, it's great to hear that you're interested in using our model.

Regarding your question, actually, during training, we didn't calculate loss based on the prompt; instead, we only calculated loss based on the responses. However, I don't think this detail is particularly important. As far as I know, some models also calculate loss based on the entire sequence.

Thank you for your response!