Question about Unlabeled Data
bibudhlahiri opened this issue · 2 comments
I am using your example code from the following library on a dataset created for one of our applications:
https://www.tensorflow.org/neural_structured_learning/tutorials/graph_keras_lstm_imdb
Can you please clarify how and where in the code the data has been split between the labeled and unlabeled sets? I see the concept of supervision ratio and see that you took 90% of the original train_dataset as validation_dataset and the remaining 10% as train_dataset. Do you mean the validation_dataset was used as the unlabeled dataset? But that's not what it looks like based on the documentation here:
https://www.tensorflow.org/neural_structured_learning/api_docs/python/nsl/keras/GraphRegularization#fit
Thanks for the question, @bibudhlahiri !
Your understanding is correct that the validation_dataset is used as the unlabeled set.
But before training, the unlabeled set is merged with the labeled set using nsl.tools.pack_nbrs to create "neighbor-augmented" training examples. That is, in addition to each example's own feature (word), they will also contain features from its neighbors (NL_nbr_0_words, NL_nbr_1_words) and the weight of its neighbors (NL_nbr_0_weight, NL_nbr_1_weight). In the tutorial, the pack_nbrs step is happened before the train-validation split, so effectively we allow both labeled and unlabeled examples to be neighbors of (labeled) training examples.
When calling graph_reg_model.fit, the train_dataset already contains neighbor-augmented training examples, so we do not have to specify the unlabeled set again. The argument validation_data=validation_dataset is for monitoring validation accuracy, just like in ordinary model training.
Closing this issue due to inactivity. Please feel free to reopen if you have further questions.