LLNL/lbann

DistConv connections to LBANN are too fragile.

Opened this issue · 0 comments

Code like the following can cause problems:

# given layers 'input' and 'x_true' of suitable shapes/types/etc
...
x = lbann.Convolution(input, ..., parallel_strategy=<not None>)
y = lbann.L2Norm2(x)
z = lbann.Subtract(x, x_true)
...

It seems that the split layer introduced by LBANN's runtime between x and the y and z children doesn't gracefully handle the fact that x's tensors are actually managed by DistConv. I was seeing error messages like:

layer "conv_norm" expected an input tensor stored in a 4096 x 1 matrix from layer "convolution_layer_split", but got a 0 x 0 matrix

To fix this, I replaced x with:

x = lbann.Identity(lbann.Convolution(input, ..., parallel_strategy=<not None>), parallel_strategy=None)

(where the parallel_strategy=None is just to make very explicit that I do NOT want this layer to be DistConv-managed). This seems to have worked.