himkt/allennlp-optuna

PruningCallback doesn't work

himkt opened this issue · 6 comments

himkt commented

Apart from #18

#18 (comment)

himkt commented

@vikigenius

Thank you so much for diving into allennlp-optuna.

Which storage do you use? (should be one of sqlite3, MySQL, PostgreSQL, Redis)
And could you please share with me a simple reproducible configuration?

Thanks for creating the issue @himkt I use the default sqlite3 storage

local model_name = "models/distilroberta-base-msmarco-v1/0_Transformer";
local num_gpus = 8;
local data_base_url = "data/mydata/processed/";
local batch_size = std.parseInt(std.extVar('batch_size'));
local lr = std.parseJson(std.extVar('lr'));
local model = "my_model";
local dataset_reader = "my_reader";

{
  "train_data_path": data_base_url + "train.tsv.part*",
  "validation_data_path": data_base_url + "valid.tsv.part*",
  "dataset_reader": {
    "type": "sharded",
    "base_reader": {
      "type": dataset_reader,
      "query_tokenizer": {
        "type": "pretrained_transformer",
        "model_name": model_name,
        "max_length": 500,
      },
      "query_token_indexers": {
        "tokens": {
          "type": "pretrained_transformer",
          "model_name": model_name,
          "namespace": "tokens"
        }
      },
    }
  },
  'model': {
    'type': model,
    'transformer_model': model_name,
  },
  "data_loader": {
    "batch_size": batch_size,
    "shuffle": true
  },
  "distributed": {
    "cuda_devices": if num_gpus > 1 then std.range(0, num_gpus - 1) else 0,
  },
  "trainer": {
    "num_epochs": 10,
    "optimizer": {
      "type": "huggingface_adamw",
      "lr": lr,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "correct_bias": true
    },
    "learning_rate_scheduler": {
      "type": "polynomial_decay",
    },
    "use_amp": true,
    "grad_norm": 1.0,
    "validation_metric": "+rec1",
    "epoch_callbacks": [
      {
        "type": "optuna_pruner"
      }
    ]
  }
}

This was the config I was using. You would have to change the models and dataset readers, I can try to reproduce with a simpler example with predefined models etc, but it would take me a while since I won't be using the multi GPU cluster for some time.

himkt commented

@vikigenius Thank you for your help.

Let me ask a question: does this configuration work well if it runs on a single GPU? (means that it disables distributed).
The current implementation of AllenNLP integration for a pruning feature may not work with a distributed setting.

If your configuration works on a single GPU, I'll investigate AllenNLP integration in Optuna. But, it may take time because the mechanism for supporting PruningCallback in the integration is relatively complicated (I implemented...) and I don't have a cluster with multi GPUs now.

Sorry for the inconvenience. 🙇

himkt commented

Related to optuna/optuna#1990.

himkt commented

FYI @vikigenius

I'm working on the entirely refactoring AllenNLP integration in Optuna (optuna/optuna#2796).
After this PR being merged, PruningCallback would work with distributed training.

himkt commented

In the Optuna v3.0.0a0, we finally introduced the support for the pruning callback in distributed training.
https://github.com/optuna/optuna/releases/tag/v3.0.0-a0

pip install -U optuna==3.0.0a0