susheels/adgcl

Question about ablation study in unsupervised learning

zwb29 opened this issue · 7 comments

zwb29 commented

Hi, thanks for your great work!
I have a question about the detail of NAD-GCL.
In 5.1 of the paper it writes:

NAD-GCL drops the edges of a graph uniformly at random. We consider NAD-GCL-FIX and NAD-GCL-OPT with different edge drop ratios. NAD-GCL-GCL adopts the edge drop ratio of AD-GCL-FIX at the saddle point of the optimization (Eq.8) while NAD-GCL-OPT optimally tunes the edge drop ratio over the validation datasets to match AD-GCL-OPT.

I didn't quite understand how to define the edge drop ratio in NAD-GCL.
What's the difference between NAD-GCL and GraphCL which using EdgePert?
Thank you!

HI @zwb29

Thanks for writing to us.
Uniformly random Edge drop augmentation for NAD-GCL is implemented in https://github.com/susheels/adgcl/blob/main/unsupervised/utils.py lines 138-154. You can have a look to understand.

The reasoning is that each edge is independently included or not with a constraint on total number of edges to be dropped defined based on edge-drop ratio. The process is random and non-learnable hence we term it NAD (non-adversarial).

It is different from GraphCL because conceptually, GraphCL uses multiple augmentations like a combination of Node dropping, edge perturbing/dropping, subgraph sampling, feature masking. The original GraphCL paper proposes to use a combination of augmentations and not a particular one. The results they obtain are also for combinations of such random augmentations.

We use NAD-GCL as an ablation to AD-GCL to focus on a particular family of augmentation i.e. edge dropping. That said, the underlying loss function of NAD-GCL and GraphCL is the same infonce loss based on simclr.

Thanks

zwb29 commented

Hi @susheels . Thanks for your quick reply!

In the experiment of GraphCL, multiple combination of augmentations (5*5) are used, and the best result among them is reported as the final experiment result. The best augment combination for TU-dataset can be find here.

We use NAD-GCL as an ablation to AD-GCL to focus on a particular family of augmentation i.e. edge dropping.

In my understanding, NAD-GCL is the contrast between the original graph (View1) and an edge dropping graph which the edge_dropout_ratio is fixed (View2). So NAD-GCL-FIX can be seen as a branch experiment of GraphCL, in other words, the result of NAD-GCL can't be better than GraphCL.

However, the result of NAD-GCL is better than GraphCL in some datasets in the unsupervised learning performance (Table 1 in the paper). Which makes me confusing.

I also replicated GraphCL based on your code and it surpassed the result of GraphCL on your paper in some datasets (using the best manually selected augmentations mentioned above).

What am I missing?
(By the way I really love your work, it's quite easy to understand and makes augmentation automated elegantly.)

Thanks

Hi @zwb29
Great that you like our work.

Firstly the link you mentioned is not for GraphCL but graphCL automated which are different papers. The original GraphCL selects augmentations based on domain knowledge.

Yes you can think of NAD-GCL as a very specific type of GraphCL, but I tend to say that NAD-GCL's purpose is to be an ablation to our AD-GCL.

You bring an important point which even graphCL misses. Edge dropping is actually really powerful for some datasets even when implemented as a uniformly random process.

A very important distinction is to be made with edge perturbation and edge dropping. Original GraphCL uses edge perturbation where you drop and add edges uniformly and randomly. NAD-GCL considers only edge dropping. It maybe the case that Edge dropping is a better family compared to edge perturbation for some datasets.

I also want to point out that we use a linear downstream classifier unlike the non linear ones used in GraphCL. That said we do provide results for the non-linear case in the appendix.

As far as the results for GraphCL are concerned, they are either taken from their paper (non-linear case) or reproduced with the same augmentation they mention in for certain datasets using a linear downstream classifier.

Happy to answer in more detail if the above doesn't fully answer your question.

zwb29 commented

Hi @susheels
Thank you for solving my questions. I have no more questions and I'll close it.

Firstly the link you mentioned is not for GraphCL but graphCL automated which are different papers. The original GraphCL selects augmentations based on domain knowledge.

GraphCL Automated is an improvement of GraphCL by the same authors. Appendix C of the link I mentioned is the detail of GraphCL they did as a contrast experiment with GraphCL Automated.

Edge dropping is actually really powerful for some datasets even when implemented as a uniformly random process.

Thank you for telling me this, I don’t know that and will try it.

I also want to point out that we use a linear downstream classifier unlike the non linear ones used in GraphCL.

This is clear in your paper and I understand this part well.

Thank you again and wish you success in your research!

zwb29 commented

Sorry, one more question.
Can you explain more about the relationship between Graph information bottleneck with your work?

Hi @zwb29
If you look at the GIB principle,
image

In principle it says if we had access to labels Y, we can be clever to reduce irrelevant / noisy information about the original data in our representations (second term has minus sign)and have only task relevant information (first term). So we extend this to unsupervised setting when there is no Y (labels).

AD-GCL is like playing a game with the augmenter and encoder wherein the augmenter tries to corrupt the original input graph while the encoder tries to capture information required to identify graphs in the dataset. It so happens that this is also the minimum sufficient information (Our Thm 1). We believe that this principle will help us get at the information that is somewhat relevant although we can't quantify without label info unlike GIB. The best we can do is bound the information.

Thus the motivation comes from GIB, in that broadly it's the interplay of how much useful info and noisy info is present in the representations. Moreover, our motivating experiment also highlights that noisy information is actually enough to identify graphs in the dataset (MI maximization), so we have to be careful about what information gets encoded in the representations. Clearly, adversarial training helps at least according to our experiments, but the space of augmentations i.e., the family of aug process, in itself might contribute to what is actually captured. Also more work has to be done to understand what the adversarial optimization does to reprs.

zwb29 commented

Much thanks, I'll close it.