Shouldn't this be total number of threads (i.e. batch size) rather than threads per block?

Line 597 in 038d1ca

    
           cudaMalloc(&d_nodeErrors, sizeof(float) * numLayers * maxLayerSize * numBlocks * threadsPerBlock);

It is total num threads because numBlocks * threadsPerBlock