ELEKTRONN/elektronn3

"Invalid" targets with out-of-bounds elements

Closed this issue · 2 comments

mdraw commented

In some of the batches that are created by PatchCreator, the target tensor contains elements that are not inside the expected value range (which is given by the number of unique classes that exist in the data set).

Quoting a comment from a previous commit message (46d0b2b):

I found that the values of the maximum elements of the
invalid targets are usually quite similar. Here are the last few examples
from the warning message at cnndata:145, collected from a few different
training runs at random steps:

65072
39121
65535
63480
65535 # found directly after the previous value
65509
64205

All of those are below 65536, which is 2**16.
Most are only slightly smaller than 65536.
(Why 2**16? Everything should be 32 bit (float) or 64 bit (int)...)

Such invalid targets are automatically detected by PatchCreator and their batches are discarded as a workaround for this problem, but that's certainly not a good way of dealing with it in the long term. We need to find out what's causing this bug.
We may find the root of the problem somewhere around this code block:

if target_src is not None:
or in the numba-jitted functions that are called from there.

mdraw commented

Accidentally closed this via a commit message...

mdraw commented

I can now finally reproduce this bug (which wasn't that easy because it happened only once every ~200,000 iterations) and found out what's causing it: The issue happens in the numba-jitted generalized ufunc code at

@numba.guvectorize(['void(float32[:,:,:], float32[:], float32[:], float32[:,],)'],
'(x,y,z),(i),(i)->()', nopython=True)#target='parallel',
def map_coordinates_nearest(src, coords, lo, dest):
u = np.int32(np.round(coords[0] - lo[0]))
v = np.int32(np.round(coords[1] - lo[1]))
w = np.int32(np.round(coords[2] - lo[2]))
dest[0] = src[u,v,w]

The reported garbage values appear if u, v and/or w point to indices in src that are out of the bounds of src, so line 30 reads from unallocated memory.
I didn't really think that was possible because the process would have just segfaulted, but it turns out that segfaults only happen sometimes in this case, while in most cases the dest array will just be silently filled with some garbage values. Segfaults seem to happen more often if the out-of-bounds memory access is further away from the actually allocated values.
It's hard to debug this because we can't set breakpoints, make shape checks or raise errors in jitted generalized ufuncs, so I'm not yet sure why exactly map_coordinates_nearest() is sometimes called in a way that causes these problems, but I'm working on finding it out.