facts.txt vs. train.txt
Closed this issue · 2 comments
Hey,
Really enjoyed your work, I just have a quick question.
I'm confused by the differentiation between the facts.txt
and train.txt
files. From what I can tell they both combine to form the full the train dataset (e.g. for FB15K-237 the number of triples in both files add up to 272,115 which is the size of the training set).
They both seem to populate the self.train_data
variable (in DataLoader.shuffle_train
) so I guess it doesn't matter. I was just curious if there was some reason behind it.
Thanks,
Harry
Hello, you can check here #1 first.
The reason we need such a splition during training is that we hope that the training (query) triples will not be covered in the fact triples. Otherwise, there will be some information leakage. The way to split can be random, and the ratio of splition in the two files can be regarded as a hyperparameter.
Makes sense. Thanks for the quick response!