Error reproducing competition results
ndkulkarni opened this issue · 2 comments
I am trying to reproduce the competition results based on the instructions in the README.
-
I download and unzip the files from the kaggle competition into the
data/
folder -
I run the command
python make_features.py data/vars --add_days=63
which creates the following pickle files:2017-08-15_2017-09-11.pkl
,all.pkl
,train_2.pkl
and the directoryvars/
in thedata/
folder -
I run the trainer
python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500
and receive the following error:
UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0
If I downgrade to TF-GPU 1.10, I still get the same error.
How can I resolve this?
Full output from train command
I have the same problem. Did you figure it out?
I am trying to reproduce the competition results based on the instructions in the README.
- I download and unzip the files from the kaggle competition into the
data/
folder- I run the command
python make_features.py data/vars --add_days=63
which creates the following pickle files:2017-08-15_2017-09-11.pkl
,all.pkl
,train_2.pkl
and the directoryvars/
in thedata/
folder- I run the trainer
python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500
and receive the following error:
UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0
If I downgrade to TF-GPU 1.10, I still get the same error.
How can I resolve this?
Full output from train command
SImply restart a new instance will work...