DistConv connections to LBANN are too fragile.
Opened this issue · 0 comments
benson31 commented
Code like the following can cause problems:
# given layers 'input' and 'x_true' of suitable shapes/types/etc
...
x = lbann.Convolution(input, ..., parallel_strategy=<not None>)
y = lbann.L2Norm2(x)
z = lbann.Subtract(x, x_true)
...
It seems that the split layer introduced by LBANN's runtime between x
and the y
and z
children doesn't gracefully handle the fact that x
's tensors are actually managed by DistConv. I was seeing error messages like:
layer "conv_norm" expected an input tensor stored in a 4096 x 1 matrix from layer "convolution_layer_split", but got a 0 x 0 matrix
To fix this, I replaced x
with:
x = lbann.Identity(lbann.Convolution(input, ..., parallel_strategy=<not None>), parallel_strategy=None)
(where the parallel_strategy=None
is just to make very explicit that I do NOT want this layer to be DistConv-managed). This seems to have worked.