Documentation for multi-dataset training
stepgazaille opened this issue · 15 comments
Hello!
First of all, thank you so much for publishing this code base. It's a great contribution!
I'm curious about the multi-datasets training feature of DyGIE++.
The model.md file explains how to read evaluation results for models trained on multiple datasets, but sadly the corresponding section from config.md remains to be done.
I've been poking around the code base but so far couldn't find any clue on how to write a configuration file to train a model on multiple datasets (I must admit that although I am more and more familiar with allennlp, I'm not really what you may call a "power user" yet).
I also looked at the list of merged PRs and commits and couldn't find anything related.
Is multi-dataset training really integrated to DyGIE++?
If it is, could anyone have a good starting point to start learning aboutn this feature?
An example configuration file would be amazing, but if no one has that then a reference to a commit of PR would also be a great help.
Thank you for your help!
Hi,
Multi-dataset training should definitely be possible, apologies if it's not well-documented. I'll look into this over the weekend.
Dave
No worries
Thank you for the swift answer and for taking some time to help
Have a good one
Steph
OK, I took a look.
The relevant section of the docs is here, which is pretty sparse, and you're right there's no example.
Fortunately, I don't think you actually have to change the config at all. Just create your dataset in jsonl
format, as described in data.md, making sure to specify a dataset
field for each instance indicating which dataset that instance is part of.
The model will take care of the rest. Let me know if this makes sense. I'll try to clarify the docs at some point - or, if you're willing, feel free to submit a PR with an update to the docs and I'll merge.
Hello David,
Thank you so much for taking time over the weekend to help me with this, it's really appreciated.
After reading your last message and re-reading the data.md doc, what I understand is the following:
If I want to train a model on datasets A and B, I have to merge both datasets training sets into and a single jsonl
file (same goes for the datasets' validation and test sets).
So the model's config file stays the same, I just need to update it to use the merged dataset's jsonl
files.
Is this correct?
Yep, that's all that should be necessary. The model should do the right thing, including computing different metrics for the different datasets. If that doesn't happen, post here and we can debug.
Ok so I tested with 2 datasets today. Let's call them datasets A and B.
Dataset A has labels for events, ner, relation and coref.
Dataset B has labels for events, ner and coref.
The target task is events
.
I previsously trained models on those datasets independently without issues.
Instances from dataset A use dataset
label dataset-a
, instances from dataset B use dataset
label dataset-b
I merged the datasets into a single set of train, valid and test jsonl files.
I loaded the merged dataset into instances of dygie.data.dataset_readers.document
using the provided notebook and didn't notice anything wrong
Nothing special on the configuration side. Here it is actually:
data_paths: {
train: 'data/merged/train.jsonl',
validation: 'data/merged/valid.jsonl',
test: 'data/merged/test.jsonl',
},
loss_weights: {
ner: 0.5,
relation: 0.5,
coref: 0.5,
events: 1.0
},
model +: {
modules +: {
coref +: {
coref_prop: 0
}
},
},
target_task: "events",
max_span_width: 12
Do you see anything wrong with the information above?
Because when I launch training I get the following exception at the first instance that is processed (first processed instance might be from dataset A or B depending on the run):
'dataset-a__trigger_labels'
File "dygiepp/dygie/models/events.py", line 273, in _compute_trigger_scores
trigger_scorer = self._trigger_scorers[self._active_namespaces["trigger"]]
File "dygiepp/dygie/models/events.py", line 137, in forward
trigger_embeddings, trigger_mask)
File "dygiepp/dygie/models/dygie.py", line 282, in forward
ner_labels, metadata)
File "dygiepp/scripts/train.py", line 28, in <module>
Sorry for the slow response, I had a deadline last week. What you're doing looks reasonable. Can attach your training config and data here, and provide the command used to kick off model training? I'll attempt to reproduce the error.
Hello David,
Sorry for the delay, I had to discuss with my supervisors before I could move with this.
I was allowed to send you samples of the datasets by email.
Would it work for you if I sent those to the uni email I can see on your youtube profile page?
I thank you again and apologies for the inconvinence, but my hands are tied up on this :/
Sure, you can send them to dwadden@cs.washington.edu
, I won't re-share. I think you meant my GitHub profile rather than my YouTube profile (I don't think I have a YouTube profile)?
No worries, I'm fairly confident this is a bug on my end and it will be good to get it fixed.
Hahaha yes I did mean GitHub and not YouTube.
Ok I sent the bug replication data at your @cs.washington.edu
email address
Thanks again for taking time to look into this
This should be fixed now. Give it a try and let me know what happens.
Ok so I pulled the latest version and it looks like the problem is solved!
Thank you
One last question: is there a way to execute model selection using one particular dataset/task combination instead of using the average over all datasets for a particular task?
For exemple, I'm training a model for event extraction over 2 datasets, but I'd like to select the model which performed the best on one of those 2 datasets because I think the model might start over-fitting on one dataset before the other.
Is this model selection feature already implemented in DyGIE++ by any chance?
OK, glad it worked!
There is a way to do this by modifying the training config. The line you'd need to update from the config is here. I think you can update just a single field using +:
syntax, like here, but I'm rusty so you should probably double-check the DyGIE docs or the jsonnet documentation to make sure.
If you want to display a different set of metrics during model training, here's the line you'd need to change. But this is just aesthetic, it doesn't influence model validation behavior.
Great!
Thank you so much for everything David.
Have a good one.
Happy to help!