aws-deepracer-community/deepracer-core

Can't get running on GCP nvidia since this commit https://github.com/crr0004/deepracer/commit/9ba1fa2fc54336b2a9a99b32df7183933e275bfc4336b2a9a99b32df7183933e275bfc

Closed this issue · 5 comments

I'm trying to get this working with the GCP build script I wrote covered in this GIST: https://gist.github.com/fmacrae/c5bfe2e295bf2c3eec638de61e88fd9b but I can't seem to get SageMaker to launch with
cd ~/deepracer/rl_coach; source ./env.sh; source ~/sagemaker_venv/bin/activate; python rl_deepracer_coach_robomaker.py

as normal. Tried pip installing tensorflow-gpu into the sagemaker virtual env too and tested it worked OK in python so seems to be an issue within the container. Any ideas where I should look next? My older script worked great, so much so I destroyed all my VMs when done training :S
Error running sagemaker when Trying to launch image: crr0004/sagemaker-rl-tensorflow:nvidia_v1.1:
algo-1-erwt2_1 | The VNC desktop is: 4283fab7a75d:5800 algo-1-erwt2_1 | 08/10/2019 11:30:10 possible alias: 4283fab7a75d::5800 algo-1-erwt2_1 | PORT=5800 algo-1-erwt2_1 | Reporting training FAILURE algo-1-erwt2_1 | framework error: algo-1-erwt2_1 | Traceback (most recent call last): algo-1-erwt2_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 60, in train algo-1-erwt2_1 | framework = importlib.import_module(framework_name) algo-1-erwt2_1 | File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module algo-1-erwt2_1 | return _bootstrap._gcd_import(name[level:], package, level) algo-1-erwt2_1 | File "<frozen importlib._bootstrap>", line 994, in _gcd_import algo-1-erwt2_1 | File "<frozen importlib._bootstrap>", line 971, in _find_and_load algo-1-erwt2_1 | File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked algo-1-erwt2_1 | File "<frozen importlib._bootstrap>", line 665, in _load_unlocked algo-1-erwt2_1 | File "<frozen importlib._bootstrap_external>", line 678, in exec_module algo-1-erwt2_1 | File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed algo-1-erwt2_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_tensorflow_container/training.py", line 24, in <module> algo-1-erwt2_1 | import tensorflow as tf algo-1-erwt2_1 | ModuleNotFoundError: No module named 'tensorflow' algo-1-erwt2_1 | algo-1-erwt2_1 | No module named 'tensorflow'

Created a new script to use the non v1.1 ones for time being. Just had to swap to the China track https://raw.githubusercontent.com/fmacrae/AI-Learning/master/GCPDeepracerSetup_China.sh

Which v1.1 image where you using?

Nvidia
image: crr0004/sagemaker-rl-tensorflow:nvidia_v1.1

I can confirm that. It is not a GCP specific issue. Same on AWS. Will revert to older version.

Regards.

We've moved on to using https://github.com/aws-deepracer-community/deepracer-for-cloud for running deepracer in local env and GCP

Closing