[Request]: Sweeps in mmdetection

Question

[Request]: Sweeps in mmdetection

David-Biggs opened this issue 2 years ago · 9 comments

David-Biggs commented 2 years ago

Thanks for this amazing tool!! I have been blown away ever since I came across it.

I am using mmdetection to train my models. How do I go about setting my sweep within mmdetection?

Many thanks,
David

Answer 1 · 2022-05-30T13:29:41.000Z

Hey @David-Biggs, thanks for the request. I will try to come up with something and let you know when I have an example.

Not sure if you are aware about this PR that was recently merged in the dev branch of MMDetection that provides some dedicated support for MMDetection.

For sweeps it might require some workaround but a proper integration would be better. I will try to scope it out.

Answer 2 · 2022-05-30T14:47:41.000Z

Hey @ayulockin,

Great, thanks so much!

Looking forward to hearing from you.

Many thanks

Answer 3 · 2022-05-31T10:33:51.000Z

Hey @David-Biggs, after thinking about this for sometime, here's a rough solution for you if you wanna take a stab:

W&B requires a sweep config and a train function. You can see the same in this intro to sweep colab:
A sweep config is nothing but a dict of hyperparameter space you wanna search from/tune. Below is an example of sweep config:

import wandb
sweep_config = {
  "name" : "my-sweep",
  "method" : "random",
  "parameters" : {
    "epochs" : {
      "values" : [10, 20, 50]
    },
    "learning_rate" :{
      "min": 0.0001,
      "max": 0.1
    }
  }
}

You will then generate a sweep id by doing this: sweep_id = wandb.sweep(sweep_config).
The train function looks something like this:

def train():
    with wandb.init() as run:
        config = wandb.config
        model = make_model(config)
        for epoch in range(config["epochs"]):
            loss = model.fit()  # your model training code here
            wandb.log({"loss": loss, "epoch": epoch})

The train function will have access to wandb.config. They are coming from sweep_config (for a range a hyperparameter, the value is selected based on optimization method (grid, random, etc)).
MMDetection also has a train_detector function what you can call from the train function. The interesting bit would be to manage MMDetection config.
You can do get MMDetection inside the train function and update the required config using wandb.config. Something like this:

def train():
    with wandb.init() as run:
        config = wandb.config
        model = make_model(config)
        # MMDetection config
        config_file = 'mmdetection/configs/mask_rcnn/mask_rcnn_r50_caffe_fpn_mstrain-poly_1x_coco.py'
        cfg = Config.fromfile(config_file)
        cfg.optimizer.lr = config.learning_rate
        train_detector(model, datasets, cfg, distributed=False, validate=True, meta=meta)

It should work. I will also try to write a colab and share with you.

Answer 4 · 2022-05-31T10:52:16.000Z

Hey @ayulockin

Thanks a lot!

I will try this out ASAP ;)

Answer 5 · 2022-06-01T06:45:19.000Z

Hi @ayulockin,

So I've been working on it for a while and I made some alterations and have some interesting observations.

Alterations:

In train() you make use of model = make_model(config). I removed this and used model = build_detector(cfg.model) (from mmdetection). I wasn't sure which one to use. I thought I would use the information in the 'current' sweep_config to update the cfg, by doing cfg.optimizer.lr = config.learning_rate ... etc , then pass cfg.model into build_detector()
I removed meta=meta. The default for meta is None and I wasn't sure where you defined your meta variable.

This works... Sort of. The model trains the sweeps loop over the different values but:

Obervations:

My training losses are all Nan
During training, I get this error The testing results of the whole dataset is empty . There are no validation results (mAP nor Losses)

I removed all sweep related code and ran the MMDetection train_detector(model, datasets, cfg, distributed=False, validate=True) command and it worked perfectly fine. Losses were real values and I got validation results. I did some digging but could not resolve either of the two issues.

Answer 6 · 2022-06-02T15:45:52.000Z

Thanks for trying it out @David-Biggs.

Sorry, I should have clarified that make_model was more of a pseudocode and not the actual API. Glad it worked (sort of).

Were you able to resolve the NaN loss issue?

Answer 7 · 2022-06-06T05:03:28.000Z

Hi @ayulockin,

So I found that the reason for the Nan loss issues was the learning rate. All the values I chose were slightly too large. After reducing the values I was able to do a successful sweep.

Thanks again for your help.

Answer 8 · 2022-06-06T05:07:25.000Z

Glad it worked for you. 👯‍♂️

Closing the issue since it's resolved.

Answer 9 · 2022-06-06T05:08:16.000Z

@David-Biggs once I have my colab ready, I will share it here.