ndif-team/nnsight

[Bug] Grad Setting Malfunctioning

Closed this issue · 3 comments

I have come across an important issue related to setting .grad to a new tensor object. After setting .grad and saving it; the values returned by the Proxy reflect the original content of .grad before setting it to a new tensor, and is thus incorrect. This indicates that the update was somehow not set properly. Below is a minimal reproducible example, as well as some additional testing I conducted to help determine the exact provenance of the bug.

Note: It seems that the problem only directly affects the user’s ability to save and access the correct Grad Proxy value and does not affect the correctness of the gradient values in the interleaved execution of the model; the updated gradient values still propagate properly within the model and work as intended.

Code:

with model.trace(input):
        model.layer1.output.requires_grad = True

        l1_grad_before = model.layer1.output.grad.clone().save()

        model.layer1.output.grad = model.layer1.output.grad.clone() * 2

        l1_grad_after = model.layer1.output.grad.save()

        loss = model.output.sum()
        loss.backward()

    print("L1_Grad_Before: ", l1_grad_before)
    print("L1_Grad_After: ", l1_grad_after)

Out:

L1_Grad_Before:  tensor([[-0.0336, -0.0316,  0.5292, -0.1040, -0.0569,  0.0234, -0.1658, -0.3092,
          0.0051,  0.3399]])
L1_Grad_After:  tensor([[-0.0336, -0.0316,  0.5292, -0.1040, -0.0569,  0.0234, -0.1658, -0.3092,
          0.0051,  0.3399]])

Intervention Graph:
Layer 1 Grad Setting

This example shows that the Layer 1 Grad Proxies are identical and show the same values even after the modification. (L1_Grad_After should be double the value L1_Grad_Before)


To support the claim that the issue causing this bug is coming from the setting operation on the .grad attribute, I want to share with you the following two examples:

The first example demonstrates that even after .grad is set to a new tensor object, not only both saved proxies have the same values but also point to the same memory address. This indicates that the execution of the setting operation did not replace the reference of the old grad tensor with the new one.

Code:

with model.trace(input):
        model.layer1.output.requires_grad = True

        l1_grad_before = model.layer1.output.grad.save()

        model.layer1.output.grad = model.layer1.output.grad.clone() * 2

        l1_grad_after = model.layer1.output.grad.save()

        loss = model.output.sum()
        loss.backward()

    print("L1_Grad_Before: ", l1_grad_before)
    print("L1_Grad_Before_Address: ", hex(id(l1_grad_before)))
    print("L1_Grad_After: ", l1_grad_after)
    print("L1_Grad_After_Address: ", hex(id(l1_grad_after)))

Out:

L1_Grad_Before:  tensor([[-0.0336, -0.0316,  0.5292, -0.1040, -0.0569,  0.0234, -0.1658, -0.3092,
          0.0051,  0.3399]])
L1_Grad_Before_Address:  0x11f3c1ea0
L1_Grad_After:  tensor([[-0.0336, -0.0316,  0.5292, -0.1040, -0.0569,  0.0234, -0.1658, -0.3092,
          0.0051,  0.3399]])
L1_Grad_After_Address:  0x11f3c1ea0

The second example shows that modifying directly the .data attribute of the Grad Proxy does not bring up the bug and produces the correct expected results. This occurs because setting .data does not involve a "swap" of the Grad Proxy since .data is not a Proxy itself but just an attribute. This is also shown by the intervention graph.

Code:

with model.trace(input):
        model.layer1.output.requires_grad = True

        l1_grad_before = model.layer1.output.grad.clone().save()

        model.layer1.output.grad.data = model.layer1.output.grad.clone() * 2

        l1_grad_after = model.layer1.output.grad.save()

        loss = model.output.sum()
        loss.backward()

    print("L1_Grad_Before: ", l1_grad_before)
    print("L1_Grad_After: ", l1_grad_after)

Out:

L1_Grad_Before:  tensor([[-0.0336, -0.0316,  0.5292, -0.1040, -0.0569,  0.0234, -0.1658, -0.3092,
          0.0051,  0.3399]])
L1_Grad_After:  tensor([[-0.0673, -0.0633,  1.0584, -0.2081, -0.1137,  0.0469, -0.3317, -0.6185,
          0.0102,  0.6798]])

Intervention Graph:
Layer 1 Grad Setting Data Attr

I'd like to add a few remarks about this bug. I witnessed an additional "backward" when setting the grad property.

Setup:

from nnsight import NNsight
import torch

class DebugSubModule(torch.nn.Module):

    def forward(self, x):
        return 5 * x + 1
    
network = DebugSubModule()
model = NNsight(network)

When setting the grad, in the end, it is added twice [bug]. The first comes from the .grad = const that sets the tensor gradient to const. The other comes from the backward that adds the gradient (w.r.t. to the output) intervened as const. The saved_gradient is the original gradient as expected. Reproduction:

input = torch.randn(1, 1)
const = torch.full_like(input, 3)
with model.trace(input):
    model.input[0][0].requires_grad = True
    output = model.output.save()
    saved_grad = model.input[0][0].grad.save()
    model.input[0][0].grad = const
    output.backward()
print("input.grad: ", input.grad)
print("saved_grad: ", saved_grad)

Output:

input.grad:  tensor([[6.]])
saved_grad:  tensor([[5.]])

When adding another backward pass I observe the following changes: The saved grad is the original gradient of the first backward as expected. Setting the grad as const leads to the same bug grad in fact set to 2 * const [bug], additionally, the grad of the second backward pass is added to the gradient (as expected) proving that the intervention indeed intervened the first backward pass.

input = torch.randn(1, 1)
const = torch.full_like(input, 3)
with model.trace(input):
    model.input[0][0].requires_grad = True
    output = model.output.save()
    saved_grad = model.input[0][0].grad.save()
    model.input[0][0].grad = const
    (2*output).backward(retain_graph=True)
    output.backward()
print("input.grad: ", input.grad)
print("saved_grad: ", saved_grad)

Output

input.grad:  tensor([[11.]])
saved_grad:  tensor([[10.]])

I might have missed a lot and still am doing some digging about this.

Yes it is!