HunterMcGushion/hyperparameter_hunter

Specify the train_dataset variable

Closed this issue · 2 comments

The documentation states:

train_dataset (Pandas.DataFrame, or str path) – The training data for the experiment. Will be split into train/holdout data, if applicable, and train/validation data if cross-validation is to be performed. If str, will attempt to read file at path via pandas.read_csv()

There are several unclear definitions here.

  • If you supply a pandas dataframe: which columns are used to train and which are used as the outputs? are all other columns apart from the defined "target"-column used as inputs? or are only numerical columns used?
  • If i supply a file path - does it have to be loadable as a pandas dataframe directly?
  • Is there the option to supply different types of data, e.g., a catboost.Pool for the catboost experiments? this could be required because categorical variables are delclared in a Pool

making this documentation text a bit more precise could help enormously! thanks

Great points, thanks for bringing this up!

  • You are correct, the target_column kwarg to Environment is used for the output. In addition, if an id_column is specified for Environment, it is removed prior to training.
    Apart from those two, the actual way to specify which columns are used to fit the Experiment, is via the feature_selector kwarg to the CrossValidationExperiment class. If feature_selector is not specified, all columns aside from target_column, and id_column in train_dataset will be used during training/predicting. All columns are used regardless of type.

  • Currently, the file path is expected to be a .csv file, and as the documentation notes, it will be opened via pandas.read_csv, so anything accepted by pandas.read_csv is fine.

  • No, at the moment, we can only provide DataFrames. I’d like to expand this to include other dataset types, including Pools and standard NumPy arrays. Unfortunately, I don’t have much experience with Pools, so I’d need some help adding that functionality, but I’ve been able to use CatBoost without issues in the meantime.

Thanks for bringing all this to my attention! I’ll update the documentation shortly!