fabsig/GPBoost

Improving performance

Closed this issue · 1 comments

IanQS commented

Hey there! So sorry to bother you for such a specific problem. Say I have a dataset of

  • latitude
  • longitude
  • regression_target (predicting %nitrogen in soil)
  • (other things like string categories, etc.)

I tried using GPBoost hoping that it would perform better than a naive LightGBM, but I'm finding that the two are basically on-par with one another. To be fair, I only have 70 data points so perhaps GPBoost hasn't had the opportunity to shine.

I'm hyperparameter searching over

{
        "num_leaves": {
            "values": list(range(10, 101, 10)),
        },
        "n_estimators": {
            "values": list(range(100, 1001, 100)),
        },
        "learning_rate": {
            "values": list(el / 1000 for el in range(1, 100)) + list(el/ 10 for el in range(1, 20))
        },
        "max_depth": {
            "values": list(range(4, 100, 5)),
        },
        "lambda_l1": {
            "values": [0.1, 0.2, 0.3, 0.5, 0.7],
        },
        "lambda_l2": {
            "values": [0.1, 0.2, 0.3, 0.5, 0.7],
        },
        "extra_trees": {
            "values": [True, False],
        },
        "cov_function": {
            "values": ["exponential", "gaussian"]
        }
}

(using weights and biases' interface - in case this looks unusual to you)

I'm setting

data_train = gpb.Dataset(train_x, train_y, categorical_feature=categorical_features)  # the columns of categorical_features are left as strings
gp_model = gpb.GPModel(gp_coords=coords_train, cov_function=cov_function)  # cov_function is as specified in the hyperparameter search
bst = gpb.train(params=config, train_set=data_train,
    gp_model=gp_model)

is there anything that is glaringly incorrect to you?

Apologies for my slow reply. This seems good. Maybe include also shallower trees:

"max_depth": {
"values": list(range(2, 98, 5)),
},

"num_leaves": {
"values": list(range(2, 103, 10)),
},

70 data points is very little. Not sure if I would use any machine learning model at all...