Fine-tuning existing models with new relation labels?
serenalotreck opened this issue · 5 comments
I've taken a look through previous issues, and I don't believe anyone has asked this yet, apologies if that's not the case.
I'm looking to fine-tune the full SciERC model to predict entities and relations for a new domain. I have a (very small) dataset for this new domain with entities and relations labeled (30 abstracts in training set, 14 in validation/development set, and 11 in test set). The ontologies of entity and relation labels used in this new dataset are both different from those used in SciERC. It would be nice to predict the correct entity labels as well; however, I mainly need the correct relation labels to be predicted.
I know from #87 that I can train the model using multiple datasets, and from #85 that I can get new tags by training the model from scratch with a new dataset that contains different labels. However, to my understanding, multi-dataset training is not equivalent to fine tuning one model with a new dataset, as there is a separate model created for the second dataset, and while you can get the mean performance of two datasets' models on a given task, it isn't actually using both models to obtain optimal predictions. It's also not totally clear to me if I can train on multiple datasets that contain different target labels; I assume this would work since it creates separate models, but again that returns to the problem of it not being true fine-tuning.
Wondering if you have any thoughts about how best to approach this problem?
Thanks!
Thanks for the question!
You can train on multiple datasets that contain different target labels; the details are in the docs on multi-dataset training.
The DyGIE model first builds span representations using BERT or similar, and then makes predictions by training a feedforward classification head on top of the span representations. For multi-dataset training, the span representation part of the model is shared across all datasets, but each dataset gets its own classification head; this is necessary since different datasets have different ontologies / label spaces.
In your case, it sounds like you might want to take the model that's already been trained on SciERC and then finetune further. To do this, you could create a new model, then initialize all weights except the classification heads to the values of the trained SciERC model, and then continue training. You could also try freezing the non-classification-head weights. I can try to point to the relevant places in the code if you want to do this, but I can't make any guarantees.
At the risk of being obvious: have you checked whether GPT-4 can do this for you? I haven't actually run any IE baselines with GPT-4, so I'm curious whether it can just crush these tasks or whether we still need specialized models.
Thanks for the quick reply!
In your case, it sounds like you might want to take the model that's already been trained on SciERC and then finetune further. To do this, you could create a new model, then initialize all weights except the classification heads to the values of the trained SciERC model, and then continue training. You could also try freezing the non-classification-head weights. I can try to point to the relevant places in the code if you want to do this, but I can't make any guarantees.
It would be great if you could point me to those places!
At the risk of being obvious: have you checked whether GPT-4 can do this for you? I haven't actually run any IE baselines with GPT-4, so I'm curious whether it can just crush these tasks or whether we still need specialized models.
I have! tl;dr it shows promise, but effective multi-part prompt engineering seems to be a limiting factor, as asking the model to do it start to finish in a single prompt yields terrible performance. The main reason I want to try fine-tuning these in addition is to see what the comparative performance is, as models like these were the previous SOTA.
Thanks again!
I'll try to send some links to the code this weekend.
As for GPT-4, that's good to know it can't just trivially solve the task (there's still stuff left for NLP researchers to do!). If you've run any GPT-4 baselines on any of the datasets in this repo, feel free to submit a PR; it would be interesting to compare performance.
Closing because I decided to train from scratch like in the docs rather than do this!
Sorry I dropped the ball on this. If you still wanted to freeze the embedder and just finetune the classification heads, here's the embedder. The text embeddings get created in the forward pass here, and the task-specific heads get invoked starting here.
I'm not totally sure how you'd freeze the embedder -- it's been a while since I looked at the AllenNLP code, which is in maintenance mode now. Sorry I can't help more.