Tweaking
Magpi007 opened this issue · 5 comments
Hi Thilina,
I have done already several iterations of your repo with different data and it works very well.
Now, I want to tweak it using the BERT hyperparameters (Gradient, LR, ...) and adjusting the config.json parameters (if possible). I am wondering if there are any rules or guidelines to do it , or maybe is just learn each of them, try and see?
Thanks.
I don't think there are any BERT specific rules. But you can take a look at the original paper to see what they recommend. However, unless you are set on using the original BERT, I'd recommend trying RoBERTa (see paper). It's the same architecture as BERT, but with the training process (including hyperparameters) tweaked to obtain significantly better performance.
Either way, I think you'll find it much easier to do hyperparameter tuning on the updated library. I have a new repo that should make it quite straightforward to get started. You don't need to mess around with config files, there's a dictionary with all the useful parameters that you can change directly.
That's very interesting, I want to iterate a bit more this BERT model alone and then I will go through your new repo (I already did a first test, couldn't wait :).
So my point with tweaking is thinking in the application of BERT to a real-world problem. For example, if I have a text classification problem and I want to use BERT as part of my decision making to choose the best solution, I would think in adapting BERT to the specific problem (i.e. based on the language, data available, resources...), but with these approaches I feel that only feeding the model with the data and hit play I am not doing too much, but maybe this is something that will happen more in the future if models like XLNET, roBERTa or other are open sourced for the whole community. Maybe most "data scientists(?)" will only have to select open sourced models and test/adjust a bit for the specific problem. What do you think?
Anyway, specifically of tweaking, from what I have learnt reading the papers is that the bigger and varied your data, the better, and use as much resources as you can spend (i.e. 20 epochs better than 10), right?
The idea is to intentionally "not do too much". If we did a significant amount of training on these pre-trained models, we would lose whatever the model has learned from being trained on huge datasets using incredible amounts of computational power (which most people cannot hope to match. Training one of these models is only feasible with a TPU). These models are already open-sourced. It's just that most of us don't have the computational power to realistically train one of these.
There is an option to further pre-train these models (before adding the classification head) on domain-specific data as shown here. It seems to be a promising option for many tasks, particularly when the number of labelled examples is low. I haven't tried it yet, but I probably will soon. I'll add this option to the new repo if and when I do. You can find a good comparison of various training methods in this paper.
Unfortunately, it's not that straightforward when it comes to pre-trained transformers. Training for a higher number of epochs can cause the model to forget what it learned during pre-training, leading to overfitting and worse performance. Keep in mind that these are huge models which can easily overfit even on large datasets. Personally, I can't recall seeing an improvement in performance after 2-3 epochs of fine-tuning.
It would be great to have it together with the repo. I think in NLP tasks having the option to "tune" the model to the domain is very interesting, if not a must.
It makes sense what you are saying. I understand there is not too much room to improvement with these models out of its already tuned parameters.
I'll see what I can do about adding the pre-training option to the repo. It will most likely be added to the new repo though.