Specify the train_dataset variable
Closed this issue · 2 comments
The documentation states:
train_dataset (Pandas.DataFrame, or str path) – The training data for the experiment. Will be split into train/holdout data, if applicable, and train/validation data if cross-validation is to be performed. If str, will attempt to read file at path via pandas.read_csv()
There are several unclear definitions here.
- If you supply a pandas dataframe: which columns are used to train and which are used as the outputs? are all other columns apart from the defined "target"-column used as inputs? or are only numerical columns used?
- If i supply a file path - does it have to be loadable as a pandas dataframe directly?
- Is there the option to supply different types of data, e.g., a catboost.Pool for the catboost experiments? this could be required because categorical variables are delclared in a Pool
making this documentation text a bit more precise could help enormously! thanks
Great points, thanks for bringing this up!
-
You are correct, the
target_column
kwarg toEnvironment
is used for the output. In addition, if anid_column
is specified forEnvironment
, it is removed prior to training.
Apart from those two, the actual way to specify which columns are used tofit
the Experiment, is via thefeature_selector
kwarg to theCrossValidationExperiment
class. Iffeature_selector
is not specified, all columns aside fromtarget_column
, andid_column
intrain_dataset
will be used during training/predicting. All columns are used regardless of type. -
Currently, the file path is expected to be a .csv file, and as the documentation notes, it will be opened via
pandas.read_csv
, so anything accepted bypandas.read_csv
is fine. -
No, at the moment, we can only provide DataFrames. I’d like to expand this to include other dataset types, including Pools and standard NumPy arrays. Unfortunately, I don’t have much experience with Pools, so I’d need some help adding that functionality, but I’ve been able to use CatBoost without issues in the meantime.
Thanks for bringing all this to my attention! I’ll update the documentation shortly!
Closed by 0ea8077