ARM-software/mango

more iteration than setting

wt12318 opened this issue · 6 comments

Hi,

When I set the num_iteration is 50, the actual running iteration is more than 50:

config = dict()
config["optimizer"] = "Bayesian"
config["num_iteration"] = 50

tuner = Tuner(HYPERPARAMETERS, 
              objective=run_one_training,
              conf_dict=config) 
results = tuner.minimize()

The MLflow shows it has run 62 iterations:
image

Hi,
Thanks for asking this question.

Internally, Mango will run a few random iterations to do a proper initialization
The number of these random iterations by default is 2.
You can modify this by the config parameter 'initial_random': 2
So, in most cases, your total iterations will be num_iteration + initial_random

However, this random parameter is a suggestion to the optimizer, and in some cases,
it may run more random iterations to do proper initialization. This happens for problems where the variation in the objective value is very little, and Mango may internally decide to more random iterations to make sure it finds good regions in the hyperparameter space. For most of the problems setting initial_random will make the iterations bounded as needed.

This may also happen in cases when some of the random iterations didn't succeed, and your objective function was able to consider their failures, due to which Mango ran more random iterations to make sure 2 random iterations succeeded.

Thank you

Hi,

When I set the initial_random is one, but it still run more iterations than I set. And the total number combination of my all parameter is 36, but it run more iterations than 36. Why this happened?

Thank you.

Can you share more details about your parameter space and the definition of your objective function?

Thank you for reply. This is my objective function and parameter space:

@scheduler.parallel(n_jobs=36)
def run_one_training(**params):
    with mlflow.start_run() as run:
        # Log parameters used in this experiment
        for key in params.keys():
            mlflow.log_param(key, params[key])

        # Loading the dataset
        print("Loading dataset...")
        train_dataset = TCRpMHCDataset(root="/public/slst/home/wutao2/TCR_neo/data/", filename="train_dt.csv",aaindex=aaindex, test=False, val=False)
        test_dataset = TCRpMHCDataset(root="/public/slst/home/wutao2/TCR_neo/data/", filename="val_dt.csv", aaindex=aaindex, test=False, val=True)

        # Prepare training
        train_loader = DataLoader(train_dataset, batch_size=params["batch_size"], shuffle=True)
        test_loader = DataLoader(test_dataset, batch_size=params["batch_size"], shuffle=True)

        # Loading the model
        print("Loading model...")
        model_params = {k: v for k, v in params.items() if k.startswith("model_")}
        model = GNN(feature_size=train_dataset[0].x.shape[1], model_params=model_params) 
        model = model.to(device)
        print(f"Number of parameters: {count_parameters(model)}")
        mlflow.log_param("num_params", count_parameters(model))

        # < 1 increases precision, > 1 recall
        loss_fn = torch.nn.BCEWithLogitsLoss()##
        optimizer = torch.optim.Adam(model.parameters(), 
                                    lr=params["learning_rate"],
                                    weight_decay=0)
        #scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=params["scheduler_gamma"])
        
        # Start training
        best_loss = 1000
        early_stopping_counter = 0
        for epoch in range(20): 
            if early_stopping_counter <= 5: # = x * 5 
                # Training
                model.train()
                loss = train_one_epoch(epoch, model, train_loader, optimizer, loss_fn)
                print(f"Epoch {epoch} | Train Loss {loss}")
                mlflow.log_metric(key="Train loss", value=float(loss), step=epoch)

                # Testing
                model.eval()
                if epoch % 1 == 0:
                    loss = test(epoch, model, test_loader, loss_fn)
                    print(f"Epoch {epoch} | Test Loss {loss}")
                    mlflow.log_metric(key="Test loss", value=float(loss), step=epoch)
                    
                    # Update best loss
                    if float(loss) < best_loss:
                        best_loss = loss
                        # Save the currently best model 
                        mlflow.pytorch.log_model(model, "model", signature=SIGNATURE)
                        
                        early_stopping_counter = 0
                    else:
                        early_stopping_counter += 1

            else:
                print("Early stopping due to no improvement.")
                return [best_loss]
    print(f"Finishing training with best test loss: {best_loss}")
    return [best_loss]

HYPERPARAMETERS = {
    "batch_size": [32,64,128],
    "learning_rate": [0.001,0.0001],
    "model_embedding_size": [32,64,128],
    "model_layers": [2,3],
    "model_dropout_rate": [0.5]
}

torch.set_num_threads(36)
torch.manual_seed(2022060801)
print("Running hyperparameter search...")
config = dict()
config["optimizer"] = "Bayesian"
config["num_iteration"] = 36
config["initial_random"] = 1

tuner = Tuner(HYPERPARAMETERS, 
              run_one_training,
              config) 
results = tuner.minimize()

image

Hi,
Thanks for providing the details. I am a little busy due to an immediate deadline for the last few days.
I will work on reproducing this issue next week and will update you with a solution or more information.