Label File/Test dataset

Question

Label File/Test dataset

ValerieF412 opened this issue 4 years ago · 1 comments

Hello,
I'm currently trying to use flowgmm model on my own nlp dataset. However, I found two problems quite confusing, do you mind clarifying: 1. Should all the data in my test dataset be manually labelled? Or I could just manually label parts of them?
2. I'm not sure how to generate this label file (LABELPATH — path to the label split generated by the data preparation scripts),
should I put all the unlabelled and labelled data indices into this file?

Thank you so much for reading this issue!

Answer 1 · 2020-09-23T02:47:11.000Z

Hi, sorry for a delayed response!

I think for our scripts all the test data is labeled because we are tracking the accuracy on the test set. However, for practical application you don't need your test data to be labeled at all. Maybe you could use a subset of the train data (labeled) for validation?
You can create an npz file like this: https://github.com/izmailovpavel/flowgmm/blob/public/data/bin/unpack_mnist.py#L34-L36. The file has to store labeled_indices and unlabeled_indices, numpy arrays containing indices of labeled and unlabeled data. Here is another example: https://github.com/izmailovpavel/flowgmm/blob/public/data/nlp_datasets/text_preprocessing/AGNewsPreprocessing.ipynb; you need to uncomment

np.savez(os.path.join(label_dir, str(i)),
    labeled_indices=indices[mask],
    unlabeled_indices=indices[~mask]))

You would also need to create a class for your data here https://github.com/izmailovpavel/flowgmm/blob/public/flow_ssl/data/nlp_datasets.py. Then, I believe, you need to add the processing of your dataset here https://github.com/izmailovpavel/flowgmm/blob/public/flow_ssl/data/ssl_data_utils.py analogously to AG_News.

Sorry that this is so complicated :) Please let me know if you have any issues