facts.txt vs. train.txt

Question

facts.txt vs. train.txt

Closed this issue 3 years ago · 2 comments

Hey,

Really enjoyed your work, I just have a quick question.

I'm confused by the differentiation between the facts.txt and train.txt files. From what I can tell they both combine to form the full the train dataset (e.g. for FB15K-237 the number of triples in both files add up to 272,115 which is the size of the training set).

They both seem to populate the self.train_data variable (in DataLoader.shuffle_train) so I guess it doesn't matter. I was just curious if there was some reason behind it.

Thanks,
Harry

Answer 1 · 2022-06-09T08:37:11.000Z

Hello, you can check here #1 first.

The reason we need such a splition during training is that we hope that the training (query) triples will not be covered in the fact triples. Otherwise, there will be some information leakage. The way to split can be random, and the ratio of splition in the two files can be regarded as a hyperparameter.

Answer 2 · 2022-06-10T06:27:04.000Z

Makes sense. Thanks for the quick response!