[Bug] Outputs of torch.abs abnormally mismatch on GPU and CPU when applying commutative law of multiplication

Question

[Bug] Outputs of torch.abs abnormally mismatch on GPU and CPU when applying commutative law of multiplication

Azyka opened this issue 7 months ago · 15 comments

Describe the bug
The commutative law of multiplication should always hold for torch.mul. And a*b*c == a*c*b.
However, this law fails when applying commutative law of multiplication to the input of torch.abs, producing mismatch on the outputs of torch.abs.
This mismatch is seen both on cuda and cpu.

To Reproduce
Repro script:

import numpy as np
from numpy import testing
import torch

DEVICE='cuda'

p0 = torch.tensor(4, device=DEVICE, dtype=torch.int8)
p1 = torch.tensor(6, device=DEVICE, dtype=torch.int8)

class Model0(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.v3_0 = p0
        self.v5_0 = p1

    def forward(self, *args):
        v3_0 = self.v3_0
        v5_0 = self.v5_0
        mul = torch.mul(v5_0, v3_0)
        mul_1 = torch.mul(args[0], mul)
        abs_1 = torch.abs(mul_1)
        mul_2 = torch.mul(mul_1, mul)
        return (abs_1, mul_2)

model_0 = Model0()
output_names_0 = ['v5_0', 'v4_0']

class Model1(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.v3_0 = p0
        self.v5_0 = p1

    def forward(self, *args):
        v3_0 = self.v3_0
        v5_0 = self.v5_0
        mul = torch.mul(v3_0, args[0])
        mul_1 = torch.mul(v5_0, mul)
        abs_1 = torch.abs(mul_1)
        return (abs_1, )

model_1 = Model1()
output_names_1 = ['v5_0',]

data = np.random.normal(10, 0.1, size=(53, 33)).astype(np.int8)
input_data_0 = [data]

optmodel_0 = torch.compile(model_0, fullgraph=True, backend='hidet', mode=None)
model_out_0 = optmodel_0(*[torch.from_numpy(v).to(DEVICE) for v in input_data_0])
model_out_0 = [v.to(DEVICE).detach() for v in model_out_0] if isinstance(model_out_0, tuple) else [model_out_0.to(DEVICE).detach()]
model_out_0 = [v.cpu().resolve_conj().numpy() if v.is_conj() else v.cpu().numpy() for v in model_out_0]
output_0 = dict(zip(output_names_0, model_out_0))

input_data_1 = [data]

optmodel_1 = torch.compile(model_1, fullgraph=True, backend='hidet', mode=None)
model_out_1 = optmodel_1(*[torch.from_numpy(v).to(DEVICE) for v in input_data_1])
model_out_1 = [v.to(DEVICE).detach() for v in model_out_1] if isinstance(model_out_1, tuple) else [model_out_1.to(DEVICE).detach()]
model_out_1 = [v.cpu().resolve_conj().numpy() if v.is_conj() else v.cpu().numpy() for v in model_out_1]
output_1 = dict(zip(output_names_1, model_out_1))
output_name_dict = {'v5_0': 'v5_0'}

print('=========================')
try:
    for tensor_name_0, tensor_name_1 in output_name_dict.items():
        testing.assert_allclose(output_0[tensor_name_0], output_1[tensor_name_1], rtol=1, err_msg=f'at {tensor_name_0}, {tensor_name_1}')
    print("hidet does not trigger assertion")
except AssertionError as e:
    print("hidet triggers assertion")
    print(e)
print('=========================')

model_out_0 = model_0(*[torch.from_numpy(v).to(DEVICE) for v in input_data_0])
model_out_0 = [v.to(DEVICE).detach() for v in model_out_0] if isinstance(model_out_0, tuple) else [model_out_0.to(DEVICE).detach()]
model_out_0 = [v.cpu().resolve_conj().numpy() if v.is_conj() else v.cpu().numpy() for v in model_out_0]
output_0 = dict(zip(output_names_0, model_out_0))

model_out_1 = model_1(*[torch.from_numpy(v).to(DEVICE) for v in input_data_1])
model_out_1 = [v.to(DEVICE).detach() for v in model_out_1] if isinstance(model_out_1, tuple) else [model_out_1.to(DEVICE).detach()]
model_out_1 = [v.cpu().resolve_conj().numpy() if v.is_conj() else v.cpu().numpy() for v in model_out_1]
output_1 = dict(zip(output_names_1, model_out_1))

print('=========================')
try:
    for tensor_name_0, tensor_name_1 in output_name_dict.items():
        testing.assert_allclose(output_0[tensor_name_0], output_1[tensor_name_1], rtol=1, err_msg=f'at {tensor_name_0}, {tensor_name_1}')
    print("torch_eager does not trigger assertion")
except AssertionError as e:
    print("torch_eager triggers assertion")
    print(e)
print('=========================')

Output:

=========================
hidet triggers assertion

Not equal to tolerance rtol=1, atol=0
at v5_0, v5_0
Mismatched elements: 1749 / 1749 (100%)
Max absolute difference: 80
Max relative difference: 2.
 x: array([[16, 40, 40, ..., 16, 16, 16],
       [16, 40, 16, ..., 40, 40, 40],
       [40, 16, 40, ..., 40, 16, 40],...
 y: array([[-16, -40, -40, ..., -16, -16, -16],
       [-16, -40, -16, ..., -40, -40, -40],
       [-40, -16, -40, ..., -40, -16, -40],...
=========================
=========================
torch_eager does not trigger assertion
=========================

Expected behavior
The output of torch.abs is expected to be the same for the same input.

Enviroment

OS: Ubuntu 22.04.3 LTS (x86_64)
GPU: RTX 1660
NVIDIA GPU Driver: 525.147.05
Hidet Version: 0.3.0
PyTorch Version: 2.1.0+cu118

Answer 1 · 2023-11-20T22:20:18.000Z

Hi @Azyka,

Thanks for reporting these bugs (seems found by some awesome fuzzer) to us! We will work on them when later (good onboarding exercises for interns).

Answer 2 · 2023-11-21T01:38:13.000Z

Hi @Azyka,

Thanks for reporting these bugs (seems found by some awesome fuzzer) to us! We will work on them when later (good onboarding exercises for interns).

Thanks for your attention, and we may submit more issues found by our fuzzer later. Hope this will not bother you :)

Answer 3 · 2023-11-21T02:12:39.000Z

Hope this will not bother you

Not at all. Feel free to submit the issues, which will make our compiler more robust. But because these bugs are not from acutal models, they will have low priority on our side and we will work on them when we have more time and hands.

Answer 4 · 2023-11-21T02:27:29.000Z

they will have low priority on our side and we will work on them when we have more time and hands.

Sure. And when you have the time to review the bugs, could you please inform us whether or not they can be reproduced? We appreciate your assistance.

Answer 5 · 2023-11-27T13:18:36.000Z

Hi @yaoyaoding . Sorry for bothering you, I just hope to get some feedback from you. Did you try the issues? Are they reproducible?

Answer 6 · 2023-11-28T22:31:13.000Z

Hi @Azyka, @Aalanli will help to reproduce the bugs soon.

Answer 7 · 2023-11-29T20:54:01.000Z

Hi @Azyka,

I can reproduced all the bugs you have found.

Answer 8 · 2023-11-29T22:01:32.000Z

I think that all of these are related to using intermediates as outputs in some way. For this one, when you remove the second output of Model0, the problem goes away.

I manage to reduce to a minimum example for #381

import hidet

a = hidet.symbol([1, 2], device='cuda')
b = a.mean(1)
g = hidet.trace_from([b, b], a)

cg = g.build()
print(cg(hidet.ones([1, 2], device='cuda'))[1])

Thanks @Azyka! This issue is more serious than I thought.

Answer 9 · 2023-11-29T22:07:48.000Z

After another look at the CompileGraph _create_outputs() function, it seems that this function always assumes that outputs do not alias, so it creates two tensors, not updating the second one when they do alias.
@yaoyaoding , I can try to fix this tomorrow and see if that takes care of the bugs.

Answer 10 · 2023-11-29T22:19:46.000Z

Sure, go ahead @Aalanli. Thanks for digging into these bugs.

Answer 11 · 2023-12-04T17:51:01.000Z

All of the other issues passes with #384, the only problem is this one, which I think is rather different.
The issue lies with some interesting behaviour in C++ namely

#include <stdio.h>

char f1(char a) {
    return char(0) < a ? a : -a;
}

char f2(char a) {
    return char(0) < a * char(24) ? a * char(24) : -(a * char(24));
}

int main() {
    for (char i = 0; i < 10; i++) {
        printf("%d %d %d\n", i, f1(i * 24), f2(i));
    }
}

prints:

The two situations can happen due to operator fusion, and produces different outputs due to overflow. I think a * char(24) is an int at the time of the comparison.

This is the script that reproduces the same behaviour in hidet

import hidet
import numpy as np

hidet.option.cache_dir('fuzz1')
hidet.option.save_lower_ir()

a = hidet.symbol([53, 33], device='cuda', dtype=hidet.int8)
p0 = hidet.asarray(4, device='cuda', dtype=hidet.int8)
p1 = hidet.asarray(6, device='cuda', dtype=hidet.int8)

a1 = (a * (p0 * p1))
a2 = hidet.ops.abs((a * p0) * p1)

g1 = hidet.trace_from([hidet.ops.abs(a1), a1 * (p0 * p1)], [a])
g2 = hidet.trace_from([a2], [a])


g1 = hidet.graph.optimize(g1)
g2 = hidet.graph.optimize(g2)

g1c = g1.build()
g2c = g2.build()

data = np.random.normal(10, 0.1, size=(53, 33)).astype(np.int8)
data = hidet.asarray(data, device='cuda', dtype=hidet.int8)

y1, _ = g1c(data)
y2 = g2c(data)

print(y1)
print(y2)

Answer 12 · 2023-12-04T17:53:37.000Z

To fix it one can downcast to char again, eg. wrap char(a * char(24)) before the comparison.

Answer 13 · 2023-12-06T12:34:00.000Z

thank you for fixing these bugs. it's good to receive such a positive feedback from developers. I will close the fixed issues. Regarding to this issue, the patch (pytorch/pytorch#113253) from PyTorch may help.

Answer 14 · 2023-12-14T18:32:10.000Z

Hi @Aalanli, thanks for the digging into the bugs.

Hi @Azyka, the bug should be fixed in #391. Let me know if it does not work and thanks for reporting!

Answer 15 · 2023-12-15T09:24:09.000Z

This does fix the bug! Close here and thanks for your insights in this bug! @yaoyaoding @Aalanli