tensorflow/tensorrt

UnavailableError: Can't provision more than one single cluster at a time

leo-XUKANG opened this issue · 10 comments

my code:

FP32_SAVED_MODEL_DIR = SAVED_MODEL_DIR+"_TFTRT_FP32/1"
!rm -rf $FP32_SAVED_MODEL_DIR
#Now we create the TFTRT FP32 engine
trt.create_inference_graph(
    input_graph_def=None,
    outputs=None,
    max_batch_size=1,
    input_saved_model_dir=SAVED_MODEL_DIR,
    output_saved_model_dir=FP32_SAVED_MODEL_DIR,
    precision_mode="FP32")

benchmark_saved_model(FP32_SAVED_MODEL_DIR, BATCH_SIZE=1)

and i have set:
import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"]="0"

when i run ,i got an error:
InvalidArgumentError: Failed to import metagraph, check error log for more info

and then i add a code:
tf.keras.backend.set_learning_phase(0)
the error is gone ,but one error rasie:
UnavailableError: Can't provision more than one single cluster at a time

emmm....... i just use one GPU,which is RTX 2080ti

cuda: Cuda compilation tools, release 10.0, V10.0.130

SOMEONE HELP ME, PLEASE!

@leo-XUKANG for the message InvalidArgumentError: Failed to import metagraph, check error log for more info could you share the error log?

I'm facing the same issue, sample code is here: https://gist.github.com/zyenge/2595f3369e7e6128dcc79b1a30c3e3cd
I tried both frozen model and SavedModel, neither works

@pooyadavoodi have you encountered similar issue before?
Also @bixia1

Hey guys, is there any fix to this please?

@sanjoy @bixia1 could you help to investigate this?

I think the issue was the number of GPU memory fraction i allocated

Any update on this?

My issue was fixed by fixing the output node names. I mistakenly used the output tensor names of another graph. I'd double check and see if you still have issues when setting outputs to something besides None.

For: Can't provision more than one single cluster at a time

I believe this is caused as the graph is preloaded and havent successfully convert. Therefore, when you use jupyter to rerun, the GPU mem is not released. You should check the graph again to verify whether the outputs are correct. Every time convert fails, restart the jupyter kernel.

For: Failed to import metagraph, check error log for more info
If you use jupyter notebook, pls check the result print in the terminal console. There will be a hint which node you are typing incorrect name. I suggest you should check tensorboard for the whole graph to graph correct name for the outputs