specific number of pairs for training/val/testing

Question

specific number of pairs for training/val/testing

Closed this issue 5 years ago · 12 comments

Hello, thanks for sharing this repo.

As you mentioned in the paper that you randomly split 60%, 20%, 20% for training/validation/testing dataset (420, 140, 140 graphs), does it mean your pairs for training/validation/testing are 420420=176400, 140140=19600, 140*140=19600?

Also, for AIDS700nef dataset, I cannot find the grand-truth dist_mat file for the testing dataset's graphs. All Files about AIDS700nef in the /save (you shared)(https://drive.google.com/drive/folders/1Eusvi4_iOKM0AsO1LhxQFkY62kDEtuMq?usp=sharing) are about the distance matrix between training or validation graphs.
File of 'aids700nef_ged_astar_gidpair_dist_map.pickle' contains 313600 entries (560*560 pairs, only for training/validation graphs), could you share the astar grand-truth dist map for testing graphs or it is in somewhere I didn't pay attention to?

Thanks,
xiang

yunshengb commented 5 years ago

Yup!

Answer 1 · 2019-05-15T00:30:47.000Z

Hi, all.

I have a same problem.
I think there are only train/valid labels.
@yunshengb , Could you share the label for testset?

Thanks,
Junhyun Lee

Answer 2 · 2019-05-15T08:04:55.000Z

All the ged_astar_gidpair_dist_map.pickles have been updated to include GEDs between test and train graphs (label for testset): https://drive.google.com/drive/folders/1Eusvi4_iOKM0AsO1LhxQFkY62kDEtuMq

Simply download them and put under /save and by loading each of these pickles you get a Python dict mapping graph id pairs to their true GED scores (raw score, unnormalized).

Please let me know if you have trouble finding or using these files. Thanks!

Answer 3 · 2019-05-15T08:25:08.000Z

Thanks @yunshengb .

But the data split is not aligned with your new pickle file.

For example, in AIDS testset google drive, there are (6,30) pair.
But in AIDS pickle google drive, which is updated, there are not (6,30) pair.

Even there are (train,test) pair in your new pickle file.

Could you share exact split of your data?

Answer 4 · 2019-05-15T08:30:57.000Z

I see the problem. We do not need (6,30) because both 6 and 30 are in the test set. In the paper,

The evaluation reflects the real-world scenario of graph query: For each graph in the testing set, we treat it as a query graph, and let the model compute the similarity between the query graph and every graph in the database. The database graphs are ranked according to the computed similarities to the query.

We only need the test-train GEDs as test label.

Answer 5 · 2019-05-15T08:33:00.000Z

Oh, I got it. Thanks!

Then, can I just randomly select validation set within train-train set?

Answer 6 · 2019-05-15T08:34:56.000Z

Thank you!

Your answer is very helpful for me!

And I really appreciate about your fast feedback (update pickle file)!

Answer 7 · 2019-05-15T08:35:37.000Z

You're welcome :)

Answer 8 · 2019-05-15T08:44:06.000Z

@yunshengb

I have a last question.
In your new pickle, there are IMDB data label with A star.

Is this label used in the paper?
According to the description, the label is determined by Beam, Hungarian, and VJ.

Answer 9 · 2019-05-15T08:56:45.000Z

Yes. It is just called A star (sorry about the naming) but it is generated by Beam, Hungarian, and VJ.

Answer 10 · 2019-05-15T09:00:03.000Z

Thanks, again.

And congrats your new paper accepted at IJCAI-19 ! :)

Have a nice day!

Answer 11 · 2019-05-15T19:19:07.000Z

Thank you!