microsoft/knossos-ksc

Bug: Segmentation fault in sqrl_pytorch-PyTorch CUDA

awf opened this issue · 2 comments

awf commented

Just saw this while working on something else. I haven't done a lot to debug it, but note that it's in copydown, on a fairly innocuous operation (aten::sum(Tensor 2) -> Float), so might be something to do with KS_ALLOCATOR not being defined?
Or could just be out of memory not caught?
image

awf commented

Easiest way to replicate the situation above is to edit launch.json to include

		{
			"name": "(gdb) pytest",
			"type": "cppdbg",
			"request": "launch",
			"program": "/anaconda/envs/knossos/bin/python",
			"args": [
				"-m",
				"pytest",
				"src/bench/",
				"-v",
				"--modulepath=examples/dl-capsule/sqrl",
				"--benchmarkname=sqrl",
			],
			"stopAtEntry": false,
			"cwd": "${workspaceFolder}",
			"environment": [
				{"name":"PYTHONPATH", "value":"./src/python"}
			],
			"externalConsole": false,
			"MIMode": "gdb",
			"setupCommands": [
				{
					"description": "Enable pretty-printing for gdb",
					"text": "-enable-pretty-printing",
					"ignoreFailures": true
				}
			]
		},

And then "Debug: Select and Start Debugging" in VS Code, picking "(gdb) pytest".

dcrc2 commented

The problem is that we have

@knossos.register
def sqrl(x: torch.Tensor):
    ...


def sqrl_pytorch(x: torch.Tensor):
    return sqrl(x)

which means that sqrl_pytorch isn't actually a PyTorch implementation at all: it calls the Knossos implementation. I think this was accidentally broken by the addition of the knossos.register decorator in #960. We'll need to rewrite sqrl_pytorch so that it's a genuine PyTorch implementation.

Before #976 was merged this morning, functions defined using @knossos.register were compiled for CPU only; but the "PyTorch CUDA" benchmark puts the input tensors on the GPU. The segmentation fault occurs when trying to read this data on the CPU.

After #976 is merged, the KscStub detects that the input is on the GPU and tries to compile for the GPU, but this raises an error ("Only elementwise operations can be compiled for GPU"), which I think is the correct behaviour. There is no "Knososs CUDA" benchmark for sqrl, because the "Knossos CUDA" benchmark is only enabled for elementwise operations.