simonmeister/pysc2-rl-agents

Use NHWC as default

Closed this issue · 4 comments

brean commented

At least on Windows starting run.py fails with:

2018-02-12 12:30:09.600075: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\executor.cc:651] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC.

so why not use nhwc as default? When run with the --nhwc flag it seems to work fine.

NCHW is generally much more efficient when training on GPU, see https://www.tensorflow.org/performance/performance_guide#data_formats. I can't comment on the windows issue, as we don't have a windows machine for training, but would rather keep NCHW as standard format, as GPU is what most people will want to use. Which tensorflow version are you using?

brean commented

I am using tensorflow 1.5.0, installed via pypi. I guess I should compile it myself, because in the log it says
"2018-02-12 15:39:02.576047: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2"

BTW, in the documentation it says NHWC is the TensorFlow default. But I agree with you - Learning takes a long time and saving time is always good. Keeping nchw as default should be fine.

@brean are you training on CPU or GPU when using NCHW?

If you are running on CPU that might be the reason for you error, since when trying to run on CPU on a MacBook Pro I get:

UnimplementedError (see above for traceback): 
Generic conv implementation only supports NHWC tensor format for now.
[[Node: Conv_13/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", 
dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](transpose_26, Conv_13/weights/read)]]

In order to make it run on CPU I need to use --nhwc. On our workstation with GPU we can use NCHW without this exception. So I assume that for the conv2D layer NCHW might not be implemented for CPU.

Since training these agents on CPU does not make sense we are keeping NCHW as the default for now.

brean commented

I run it on GPU (tensorflow reports that it found and uses my GPU). As I figured out tensorflow is not compiled to use AVX2. But because that is not recommended and I don't want to spend to much time on it I will just stick with nhwc for now.