PruningCallback doesn't work
himkt opened this issue · 6 comments
Thank you so much for diving into allennlp-optuna.
Which storage do you use? (should be one of sqlite3, MySQL, PostgreSQL, Redis)
And could you please share with me a simple reproducible configuration?
Thanks for creating the issue @himkt I use the default sqlite3 storage
local model_name = "models/distilroberta-base-msmarco-v1/0_Transformer";
local num_gpus = 8;
local data_base_url = "data/mydata/processed/";
local batch_size = std.parseInt(std.extVar('batch_size'));
local lr = std.parseJson(std.extVar('lr'));
local model = "my_model";
local dataset_reader = "my_reader";
{
"train_data_path": data_base_url + "train.tsv.part*",
"validation_data_path": data_base_url + "valid.tsv.part*",
"dataset_reader": {
"type": "sharded",
"base_reader": {
"type": dataset_reader,
"query_tokenizer": {
"type": "pretrained_transformer",
"model_name": model_name,
"max_length": 500,
},
"query_token_indexers": {
"tokens": {
"type": "pretrained_transformer",
"model_name": model_name,
"namespace": "tokens"
}
},
}
},
'model': {
'type': model,
'transformer_model': model_name,
},
"data_loader": {
"batch_size": batch_size,
"shuffle": true
},
"distributed": {
"cuda_devices": if num_gpus > 1 then std.range(0, num_gpus - 1) else 0,
},
"trainer": {
"num_epochs": 10,
"optimizer": {
"type": "huggingface_adamw",
"lr": lr,
"betas": [0.9, 0.999],
"eps": 1e-8,
"correct_bias": true
},
"learning_rate_scheduler": {
"type": "polynomial_decay",
},
"use_amp": true,
"grad_norm": 1.0,
"validation_metric": "+rec1",
"epoch_callbacks": [
{
"type": "optuna_pruner"
}
]
}
}
This was the config I was using. You would have to change the models and dataset readers, I can try to reproduce with a simpler example with predefined models etc, but it would take me a while since I won't be using the multi GPU cluster for some time.
@vikigenius Thank you for your help.
Let me ask a question: does this configuration work well if it runs on a single GPU? (means that it disables distributed
).
The current implementation of AllenNLP integration for a pruning feature may not work with a distributed setting.
If your configuration works on a single GPU, I'll investigate AllenNLP integration in Optuna. But, it may take time because the mechanism for supporting PruningCallback
in the integration is relatively complicated (I implemented...) and I don't have a cluster with multi GPUs now.
Sorry for the inconvenience. 🙇
Related to optuna/optuna#1990.
FYI @vikigenius
I'm working on the entirely refactoring AllenNLP integration in Optuna (optuna/optuna#2796).
After this PR being merged, PruningCallback would work with distributed training.
In the Optuna v3.0.0a0, we finally introduced the support for the pruning callback in distributed training.
https://github.com/optuna/optuna/releases/tag/v3.0.0-a0
pip install -U optuna==3.0.0a0