/TorchSnooper

Debug PyTorch code using PySnooper

Primary LanguagePythonMIT LicenseMIT

TorchSnooper

Status:

PyPI PyPI - Downloads

Checks:

Build Status

Deploy (only succeed on tagged commits):

Build Status

Do you want to look at the shape/dtype/etc. of every step of you model, but tired of manually writing prints?

Are you bothered by errors like RuntimeError: Expected object of scalar type Double but got scalar type Float, and want to quickly figure out the problem?

TorchSnooper is a PySnooper extension that helps you debugging these errors.

To use TorchSnooper, you just use it like using PySnooper. Remember to replace the pysnooper.snoop with torchsnooper.snoop in your code.

To install:

pip install torchsnooper

TorchSnooper also support snoop. To use TorchSnooper with snoop, simply execute:

torchsnooper.register_snoop()

or

torchsnooper.register_snoop(verbose=True)

at the beginning, and use snoop normally.

Example 1: Monitoring device and dtype

We're writing a simple function:

def myfunc(mask, x):
    y = torch.zeros(6)
    y.masked_scatter_(mask, x)
    return y

and use it like below

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)

The above code seems to be correct, but unfortunately, we are getting the following error:

RuntimeError: Expected object of backend CPU but got backend CUDA for argument #2 'mask'

What is the problem? Let's snoop it! Decorate our function with torchsnooper.snoop():

import torch
import torchsnooper

@torchsnooper.snoop()
def myfunc(mask, x):
    y = torch.zeros(6)
    y.masked_scatter_(mask, x)
    return y

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)

Run our script, and we will see:

Starting var:.. mask = tensor<(6,), int64, cuda:0>
Starting var:.. x = tensor<(3,), float32, cuda:0>
21:41:42.941668 call         5 def myfunc(mask, x):
21:41:42.941834 line         6     y = torch.zeros(6)
New var:....... y = tensor<(6,), float32, cpu>
21:41:42.943443 line         7     y.masked_scatter_(mask, x)
21:41:42.944404 exception    7     y.masked_scatter_(mask, x)

Now pay attention to the devices of tensors, we notice

New var:....... y = tensor<(6,), float32, cpu>

Now, it's clear that, the problem is because y is a tensor on CPU, that is, we forget to specify the device on y = torch.zeros(6). Changing it to y = torch.zeros(6, device='cuda'), this problem is solved.

But when running the script again we are getting another error:

RuntimeError: Expected object of scalar type Byte but got scalar type Long for argument #2 'mask'

Look at the trace above again, pay attention to the dtype of variables, we notice

Starting var:.. mask = tensor<(6,), int64, cuda:0>

OK, the problem is that, we didn't make the mask in the input a byte tensor. Changing the line into

mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda', dtype=torch.uint8)

Problem solved.

Example 2: Monitoring shape

We are building a linear model

class Model(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(2, 1)

    def forward(self, x):
        return self.layer(x)

and we want to fit y = x1 + 2 * x2 + 3, so we create a dataset:

x = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([3.0, 5.0, 4.0, 6.0])

We train our model on this dataset using SGD optimizer:

model = Model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(10):
    optimizer.zero_grad()
    pred = model(x)
    squared_diff = (y - pred) ** 2
    loss = squared_diff.mean()
    print(loss.item())
    loss.backward()
    optimizer.step()

But unfortunately, the loss does not go down to a low enough number.

What's wrong? Let's snoop it! Putting the training loop inside snoop:

with torchsnooper.snoop():
    for _ in range(100):
        optimizer.zero_grad()
        pred = model(x)
        squared_diff = (y - pred) ** 2
        loss = squared_diff.mean()
        print(loss.item())
        loss.backward()
        optimizer.step()

Part of the trace looks like:

New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model(  (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0    dampening: 0    lr: 0....omentum: 0    nesterov: False    weight_decay: 0)
22:27:01.024233 line        21     for _ in range(100):
New var:....... _ = 0
22:27:01.024439 line        22         optimizer.zero_grad()
22:27:01.024574 line        23         pred = model(x)
New var:....... pred = tensor<(4, 1), float32, cpu, grad>
22:27:01.026442 line        24         squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4, 4), float32, cpu, grad>
22:27:01.027369 line        25         loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:27:01.027616 line        26         print(loss.item())
22:27:01.027793 line        27         loss.backward()
22:27:01.050189 line        28         optimizer.step()

We notice that, y has shape (4,), but pred has shape (4, 1). As a result, squared_diff has shape (4, 4) due to broadcasting!

This is not the expected behavior, let's fix it: pred = model(x).squeeze(), now everything looks good:

New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model(  (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0    dampening: 0    lr: 0....omentum: 0    nesterov: False    weight_decay: 0)
22:28:19.778089 line        21     for _ in range(100):
New var:....... _ = 0
22:28:19.778293 line        22         optimizer.zero_grad()
22:28:19.778436 line        23         pred = model(x).squeeze()
New var:....... pred = tensor<(4,), float32, cpu, grad>
22:28:19.780250 line        24         squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4,), float32, cpu, grad>
22:28:19.781099 line        25         loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:28:19.781361 line        26         print(loss.item())
22:28:19.781537 line        27         loss.backward()
22:28:19.798983 line        28         optimizer.step()

And the final model converge to the desired values.

Example 3: Monitoring nan and inf

Let's say we have a model that output the likelihood of something. For this example, we will just use a mock:

class MockModel(torch.nn.Module):

    def __init__(self):
        super(MockModel, self).__init__()
        self.unused = torch.nn.Linear(6, 4)

    def forward(self, x):
        return torch.tensor([0.0, 0.25, 0.9, 0.75]) + self.unused(x) * 0.0

model = MockModel()

During training, we want to minimize the negative log likelihood, we have code:

for epoch in range(100):
    batch_input = torch.randn(6, 6)
    likelihood = model(batch_input)
    log_likelihood = likelihood.log()
    target = -log_likelihood.mean()
    print(target.item())

    optimizer.zero_grad()
    target.backward()
    optimizer.step()

Unfortunately, we first get inf then nan for our target during training. What's wrong? Let's snoop it:

with torchsnooper.snoop():
    for epoch in range(100):
        batch_input = torch.randn(6, 6)
        likelihood = model(batch_input)
        log_likelihood = likelihood.log()
        target = -log_likelihood.mean()
        print(target.item())

        optimizer.zero_grad()
        target.backward()
        optimizer.step()

We will see the part of the output of the snoop looks like:

19:30:20.928316 line        18     for epoch in range(100):
New var:....... epoch = 0
19:30:20.928575 line        19         batch_input = torch.randn(6, 6)
New var:....... batch_input = tensor<(6, 6), float32, cpu>
19:30:20.929671 line        20         likelihood = model(batch_input)
New var:....... likelihood = tensor<(6, 4), float32, cpu, grad>
19:30:20.930284 line        21         log_likelihood = likelihood.log()
New var:....... log_likelihood = tensor<(6, 4), float32, cpu, grad, has_inf>
19:30:20.930672 line        22         target = -log_likelihood.mean()
New var:....... target = tensor<(), float32, cpu, grad, has_inf>
19:30:20.931136 line        23         print(target.item())
19:30:20.931508 line        25         optimizer.zero_grad()
19:30:20.931871 line        26         target.backward()
inf
19:30:20.960028 line        27         optimizer.step()
19:30:20.960673 line        18     for epoch in range(100):
Modified var:.. epoch = 1
19:30:20.961043 line        19         batch_input = torch.randn(6, 6)
19:30:20.961423 line        20         likelihood = model(batch_input)
Modified var:.. likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.961910 line        21         log_likelihood = likelihood.log()
Modified var:.. log_likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.962302 line        22         target = -log_likelihood.mean()
Modified var:.. target = tensor<(), float32, cpu, grad, has_nan>
19:30:20.962715 line        23         print(target.item())
19:30:20.963089 line        25         optimizer.zero_grad()
19:30:20.963464 line        26         target.backward()
19:30:20.964051 line        27         optimizer.step()

Reading the output, we find that, at the first epoch (epoch = 0), the log_likelihood has has_inf flag. The has_inf flag means, your tensor contains inf in its value. The same flag appears for target. And at the second epoch, starting from likelihood, tensors all have a has_nan flag.

From our experience with deep learning, we would guess this is because the first epoch has inf, which causes the gradient to be nan, and when parameters are updated, these nan propagate to parameters and causing all future steps to have nan result.

Taking a deeper look, we figure out that the likelihood contains 0 in it, which leads to log(0) = -inf. Changing the line

log_likelihood = likelihood.log()

into

log_likelihood = likelihood.clamp(min=1e-8).log()

Problem solved.