Thartvigsen/GRACE

About results on GPT2XL

littlefive5 opened this issue · 8 comments

I'm curious about the hyper-parameters for running GPT2-XL.
I tried the default setting and achieved loss 0.04.
The output before editing is the same as the output after editing.
Is there anything I have overlooked?

Hi there! Could you be more specific about what you actually ran? For this experiment there should be a training phase where a new GPT2-XL model is trained, did this happen?

Well, I didn't train a new model.
I just want do a simple test and I modified the example.ipynb as the following:
image
image
I find that the output before editing is the same as the output after editing.
I think that pre-training the model can make the editing get a better performance?
But without pre-training, it is odd that nothing happens after the edit.

Thanks for the extra details and for pointing this out! I will try to reproduce this error this week and will follow up. It's possible but unlikely that the training of the model's the problem. Instead, the token replacement method in grace/editors/grace_barebones.py is the one that works best for QA models, not GPT2-XL. The point of the barebones version was just for a QA example so this might be a case where that doesn't work as is.

Thanks, I also build a Grace model from the grace.py using the setting in the config files.
But I encounter the same problem.
I think that maybe the default setting in config/editor/grace.yaml is not for GPT-style model?

I'll look into this this week. You could also try setting replacement to replace_last When using grace.py this time, did you train the GPT2-XL model like in main.py?

I did not train it. I would try to train the model this time. By the way, can you share me a checkpoint for the trained model since pretraining is time-consuming?

Yes I am working to upload the checkpoint now, will let you know when it's up!

@littlefive5 I've confirmed that the issue is that we evaluate solely on the log likelihoods of the correct outputs, not based on decoding and checking. This is actually expected behavior! I'll try to add in another version of GRACE training for GPT models that will fix this (it's easy to fix: just don't turn labels off from the prompt), but we didn't conduct our experiments this way.