yandex-research/rtdl-num-embeddings

How to use it to evaluate on other datasets and for other embedding algorithms?

herdonyan opened this issue · 2 comments

Should I change the dataset into a csv file or excel file or other formats?
Which lines or files should I change if I want to use a new dataset and a new embedding algorithms for evaluation while keeping the awesome hyper-parameter tuning mechanisms?

Yura52 commented

How to add new datasets

First, download and unpack the data as described here. You will see the new data/ directory in the repository. In the directory, there are datasets used in the paper.

Then, you have to add your dataset in the data/ directory following the format of other datasets. Let's say your dataset's name is iris. Then you should use np.save and create the directory data/iris with the following content:

  • (only if the dataset has numerical features) X_num_train.npy, X_num_val.npy, X_num_test.npy (numpy arrays of float32)
  • (only if the dataset has categorical features) X_cat_train.npy, X_cat_train.npy, X_cat_train.npy (numpy arrays of strings)
  • y_train.npy, y_val.npy, y_test.npy (numpy arrays of {float32 for regression, int64 for classification}); for classification, the classes must be from range(n_classes)
  • info.json: see this file for other datasets to see its content

Let's say you want to run the tuning & evaluation pipeline for MLP on your dataset. Then copy any existing config (for example, this one) and change the path inside the config to point to your dataset ("data/iris" instead of "data/california").

Full script:

export CUDA_VISIBLE_DEVICES="0"
mkdir exp/mlp/iris
cp exp/mlp/california/0_tuning.toml exp/mlp/iris/0_tuning.toml
<edit the new config as described above>
python bin/tune.py exp/mlp/iris/0_tuning.toml
python bin/evaluate.py exp/mlp/iris/0_tuning 15
python bin/ensemble.py exp/mlp/iris/0_evaluation

How to add new embedding algorithms

I don't understand the question :) You can use bin/train4.py as a starting point.

Yura52 commented

Feel free to reopen the issue if needed.