openml-labs/gama

Cross validation error when running on small classification data set

Closed this issue ยท 2 comments

Hey,

I apologize if I'm raising too many issues ๐Ÿ˜… , but I came across another issue when trying to run GAMA on a smaller data set like OpenML data set #10. I get an error if I try to run the following code:

import openml
from gama import GamaClassifier

if __name__ == "__main__":
  dataset = openml.datasets.get_dataset(10)
  X, y, categorical_indicator, attribute_names = dataset.get_data(
    target=dataset.default_target_attribute, dataset_format="dataframe"
  )

  automl = GamaClassifier(max_total_time=180, store="nothing")
  print("Starting `fit` which will take roughly 3 minutes.")
  automl.fit(X, y)

With the following error:

Starting `fit` which will take roughly 3 minutes.
Traceback (most recent call last):
  File "/Users/chris/Development/gradproject/issues/gama/gama/./test.py", line 13, in <module>
    automl.fit(X, y)
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/GamaClassifier.py", line 134, in fit
    super().fit(x, y, *args, **kwargs)
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/gama.py", line 549, in fit
    self._search_phase(warm_start, timeout=fit_time)
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/gama.py", line 610, in _search_phase
    self._search_method.search(self._operator_set, start_candidates=pop)
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/search_methods/async_ea.py", line 66, in search
    self.output = async_ea(
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/search_methods/async_ea.py", line 138, in async_ea
    new_individual = ops.create(current_population, 1)[0]
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/genetic_programming/operator_set.py", line 97, in create
    return self._create_from_population(self, *args, **kwargs)
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/genetic_programming/selection.py", line 22, in create_from_population
    parent_pairs = nsga2_select(pop, n, metrics)
  File "/Users/chris/Development/gradproject/issues/gama/gama/gama/genetic_programming/nsga2.py", line 47, in nsga2_select
    raise ValueError("population must be at least size 3 for a pair to be selected")
ValueError: population must be at least size 3 for a pair to be selected

After looking into it, I found out that no individual is being properly added to current_population in function async_ea() of gama/gama.py. This happens because the cross validation at every evaluation of a pipeline (scikitlearn.py/evaluate_pipeline) raises an error:

<class 'ValueError'> y_true and y_pred contain different number of classes 2, 4. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [1 2]

I've found that it is raised by sklearn.model_selection.cross_validate, and I've already found a working solution to the problem. The issue is caused by making validation folds in a small data set -- where it could encounter true y-labels that it has not seen before during validation training. This issue can be solved by providing the labels beforehand to the cross-validator. I'll make a proper pull request fixing this issue soon.

I apologize if I'm raising too many issues ๐Ÿ˜…

Never :) they are proper bug reports and the PRs are icing on the cake ๐Ÿ˜ truly much appreciated.

Fixed in #151