udellgroup/oboe

More datasets that in the paper?

Closed this issue · 4 comments

I was checking the tensor on the repository "oboe/large_files/error_tensor_f16_compressed.npz", and I noticed there are 551 datasets, while in the [paper] (https://people.ece.cornell.edu/cy/_papers/tensor_oboe.pdf) you mentioned only 215 for meta-training. Did you add more? Moreover, is it possible to get the meta-features of these 551 datasets? Or how do you compute the best initializations when meta-learning with Auto-sklearn?

Thanks!

Thanks for the question! Yes, I added more datasets, and collected the meta-training performance with 5-fold cross-validation (instead of 3-fold in the TensorOboe paper) to make the system more robust.

As for meta-features, we are using factors of matrix or tensor decomposition as the dataset embeddings (or data-driven meta-features) in Oboe and TensorOboe, if that's what you are asking about.

For the best initializations given by auto-sklearn, we just use the default implementation in the auto-sklearn code repository at that time (I believe it was v0.12.1).

Thank you for your answer!

Would it be possible to get the source of the new datasets? Are they from openml? If so, would it be possible to share the corresponding task-id?

Thanks!

The meta-training datasets are all from OpenML, and their IDs are stored in oboe/defaults/TensorOboe/training_index.pkl. You can read the file by

import pickle
import os

path = '......oboe/oboe/defaults/TensorOboe'
with open(os.path.join(path, 'training_index.pkl'), 'rb') as handle:
    IDs = pickle.load(handle)

I did not collect the meta-training data by following some OpenML task IDs, though. I did 5-fold stratified cross validations (sklearn.model_selection.StratifiedKFold with random_state=0) to evaluate the pipelines assembled by the components listed in Table 2 of the paper.

Hope this information can help and please feel free to ask.

Closing this issue for now. Feel free to reopen for anything.