tensorflow/neural-structured-learning

Massive Label Leakage in GAM/GAM* Implementation

VijayLingam95 opened this issue · 5 comments

Massive Label Leakage in GAM/GAM* Implementation

Hi GAM Authors,

I have noticed a massive label leakage bug in your implementation of GAM/GAM*. Both edge_iterator and pair_iterator are using true labels instead of predicted labels. Below are more details on label leakage bug in each of these iterators.

  1. bug in edge_iterator (defined in trainer_classification_gcn.py)
    Source of label leakage:
    elif labeling == 'lu':
    edges = (
    data.get_edges(src_labeled=True, tgt_labeled=False) + // Adds edges where src node is labeled and tgt node is unlabeled
    data.get_edges(src_labeled=False, tgt_labeled=True)) // Adds edges where src node is unlabeled and tgt node in unlabeled

I have bold the line of concern.

In line: 692, LU_edges and UL_edges are concatenated. Note that unlabeled edges are added as source nodes in the bolded line.
We also see this by printing the edges variable in line::710.

While iterating through edges in Line::718, true labels are assigned to unlabeled indices. I have pasted the lines below and have highlighted the line of concern.

for edge in iterator:
indices_src = edge[:, 0]
indices_tgt = edge[:, 1]
features_src = data.get_features(indices_src)
features_tgt = data.get_features(indices_tgt)
labels_src = data.get_labels(indices_src)
labels_tgt = data.get_labels(indices_tgt)
yield (indices_src, indices_tgt, features_src, features_tgt, labels_src,
labels_tgt)

data.get_edges() returns true labels for unlabeled indices, thus showing massive improvements as reported in the paper.

Post fixing this label-leakage bug (either by remove UL edges and reversing UL edges), we can only observe marginal improvements over baselines.

  1. bug in pair_iterator:

The way pair-iterator is defined in trainer_classification_gcn.py::633
printing variables labels_src, labels_tgt in line 668, 669, we can see that true label instead of predicted labels are assigned for LU and UU pair iterators.
the _select_from_pool() method invoked by pair_iterator assigns labels by using data.get_labels(indices_batch). This call returns true labeles for unlabeled indices instead of predicted labels.

Hi Vijay,

Thanks for discovering this issue! I have made a pull request with a fix for the edge_iterator. See pull request #82. Given this change, we need to tune again the hyperparameters, so we are reruning some experiments and will post updates on the GAM repository.

However, I am not sure I understand the issue with "pair_iterator". You are right that it returns the true labels for unlabeled nodes, but these are not used. If you check the function "_construct_feed_dict" line 610 in "trainer_classification_gcn.py", we do not use the labels of the targets of LU edges (which are the unlabeled ones). Based on this, GAM* results should not be affected.

Hi, Otilia and Krishna. Can this bug be closed, now that PR #82 has been merged? Thanks.