Dataset issue

Question

Dataset issue

goodman1204 opened this issue 3 years ago · 2 comments

Hi authors,

Thanks for sharing the code. I want to reproduce your results but only found that AIDS dataset is available. The other datasets are missing. I tried to read the papers you referred to the other datasets like Fingerprint, WinMal, and Toxicant. But it seems that the other papers have different data size compared with what you have used in your experiments.

Could you kindly share the preprocessed datasets in your experiments? Or can you share the preprocessing steps on these raw datasets as in your paper, you didn't mention the preprocessing steps?

Issues for the following dataset:

Fingerprint, this dataset seems from the TUDataset collection, but the raw dataset has 2800 graphs with 15 classes,
WinMal, the referred paper [46]:"Comparative Analysis of Feature Extraction Methods of Malware Detection" didn't share the this dataset or point out any links for this dataset.
Toxicant, same issue with Fingerprint dataset.

goodman1204 commented 3 years ago

Thanks

Answer 1 · 2022-04-05T04:41:58.000Z

Hi, the Fingerprint and Toxicant datasets come from TUDataset (link). Note that (1) we remove some labels of the original Fingerprint set while filtering some sparse graphs and isolated nodes from the remaining instances; (2) since the original Toxicant datasets are very imbalanced -- they're 5% toxicant and 95% benign for almost each Tox set; we then extract toxicant graphs from all Tox datasets and randomly pick some benign graphs to balance the dataset.

As for the WinMal, we directly use this graph set but removing all isolated nodes and only keeping graphs with 100~1000 nodes for classification.