mhauskn/dqn-hfo

unix time error?

Closed this issue · 5 comments

Has anyone seen this error and/or know how to resolve it? After invoking the command:

mkdir state && ./bin/dqn -save state/test -alsologtostderr

And the game starts, I get,
*** Aborted at 1511282572 (unix time) try "date -d @1511282572" if you are using GNU date ***

I did not have this problem running CPU_ONLY. It's only after getting the CUDA and GPU online that I have the problem. The CAFFE runtest passes all tests.

See below.

I1121 16:42:52.013945 26478 dqn_main.cpp:355] [Agent0] Episode 0 reward = 0
EndOfTrial: 0 / 2 205 OUT_OF_TIME
I1121 16:42:52.031786 26478 dqn_main.cpp:355] [Agent0] Episode 1 reward = 0
EndOfTrial: 0 / 3 308 OUT_OF_TIME
I1121 16:42:52.047994 26478 dqn_main.cpp:355] [Agent0] Episode 2 reward = 0
EndOfTrial: 0 / 4 411 OUT_OF_TIME
I1121 16:42:52.064990 26478 dqn_main.cpp:355] [Agent0] Episode 3 reward = 0
EndOfTrial: 0 / 5 514 OUT_OF_TIME
I1121 16:42:52.081012 26478 dqn_main.cpp:355] [Agent0] Episode 4 reward = 0
EndOfTrial: 0 / 6 617 OUT_OF_TIME
I1121 16:42:52.097050 26478 dqn_main.cpp:355] [Agent0] Episode 5 reward = 0
EndOfTrial: 0 / 7 720 OUT_OF_TIME
I1121 16:42:52.113097 26478 dqn_main.cpp:355] [Agent0] Episode 6 reward = 0
EndOfTrial: 0 / 8 823 OUT_OF_TIME
I1121 16:42:52.131078 26478 dqn_main.cpp:355] [Agent0] Episode 7 reward = 0
EndOfTrial: 0 / 9 926 OUT_OF_TIME
I1121 16:42:52.148632 26478 dqn_main.cpp:355] [Agent0] Episode 8 reward = 0
EndOfTrial: 0 / 10 1029 OUT_OF_TIME
I1121 16:42:52.165283 26478 dqn_main.cpp:355] [Agent0] Episode 9 reward = 0
EndOfTrial: 0 / 11 1132 OUT_OF_TIME
I1121 16:42:52.181612 26478 dqn_main.cpp:355] [Agent0] Episode 10 reward = 0
*** Aborted at 1511282572 (unix time) try "date -d @1511282572" if you are using GNU date ***
PC: @     0x7fedcf9a1a5e (unknown)
*** SIGSEGV (@0xb05c40000) received by PID 26475 (TID 0x7fedacbbc700) from PID 96731136; stack trace: ***
    @     0x7fedcf8644b0 (unknown)
    @     0x7fedcf9a1a5e (unknown)
    @           0x440a2d dqn::ZeroGradParameters<>()
    @           0x433339 dqn::DQN::UpdateActorCritic()
    @           0x43b588 dqn::DQN::Update()
    @           0x458cc0 KeepPlayingGames()
    @           0x45ac61 std::thread::_Impl<>::_M_run()
    @     0x7fedd01d0c80 (unknown)
    @     0x7fedd13176ba start_thread
    @     0x7fedcf9363dd clone
    @                0x0 (unknown)
Segmentation fault (core dumped)

I don't know the source of this error, but you could try compiling with the debug flags enabled. This way the stack trace would be more detailed and tell you where the segfault actually happened.

Good thoughts... here is the output with debug enabled on dqn-hfo. I haven't studied this, and I don't really know what I'm looking for. I'll report back if I find something.

Apparently, a bunch of folks have had similar problems, but there doesn't appear to be a common solution, and some doubt it's a problem with caffe.

Starting game
EndOfTrial: 0 / 1 102 OUT_OF_TIME
I1121 22:49:09.749055  3101 dqn_main.cpp:355] [Agent0] Episode 0 reward = 0
EndOfTrial: 0 / 2 205 OUT_OF_TIME
I1121 22:49:09.769755  3101 dqn_main.cpp:355] [Agent0] Episode 1 reward = 0
EndOfTrial: 0 / 3 308 OUT_OF_TIME
I1121 22:49:09.788456  3101 dqn_main.cpp:355] [Agent0] Episode 2 reward = 0
EndOfTrial: 0 / 4 411 OUT_OF_TIME
I1121 22:49:09.811931  3101 dqn_main.cpp:355] [Agent0] Episode 3 reward = 0
EndOfTrial: 0 / 5 514 OUT_OF_TIME
I1121 22:49:09.829836  3101 dqn_main.cpp:355] [Agent0] Episode 4 reward = 0
EndOfTrial: 0 / 6 617 OUT_OF_TIME
I1121 22:49:09.848392  3101 dqn_main.cpp:355] [Agent0] Episode 5 reward = 0
EndOfTrial: 0 / 7 720 OUT_OF_TIME
I1121 22:49:09.868401  3101 dqn_main.cpp:355] [Agent0] Episode 6 reward = 0
EndOfTrial: 0 / 8 823 OUT_OF_TIME
I1121 22:49:09.885887  3101 dqn_main.cpp:355] [Agent0] Episode 7 reward = 0
EndOfTrial: 0 / 9 926 OUT_OF_TIME
I1121 22:49:09.904845  3101 dqn_main.cpp:355] [Agent0] Episode 8 reward = 0
EndOfTrial: 0 / 10 1029 OUT_OF_TIME
I1121 22:49:09.923943  3101 dqn_main.cpp:355] [Agent0] Episode 9 reward = 0
EndOfTrial: 0 / 11 1132 OUT_OF_TIME
I1121 22:49:09.942113  3101 dqn_main.cpp:355] [Agent0] Episode 10 reward = 0
I1121 22:49:09.942322  3101 dqn.cpp:977]  [Forward] CriticClone Through Actor
I1121 22:49:09.942342  3101 dqn.cpp:737]   [Forward] Actor
I1121 22:49:09.957927  3101 net.cpp:663]     [Forward] Layer state_input_layer, top blob states data: 0.650094
I1121 22:49:09.958070  3101 net.cpp:663]     [Forward] Layer state_input_layer, top blob dummy1 data: 0.740436
I1121 22:49:09.960063  3101 net.cpp:663]     [Forward] Layer ip1_layer, top blob ip1 data: 0.045051
I1121 22:49:09.960180  3101 net.cpp:675]     [Forward] Layer ip1_layer, param blob 0 data: 0.0079794
I1121 22:49:09.960276  3101 net.cpp:675]     [Forward] Layer ip1_layer, param blob 1 data: 0
I1121 22:49:09.960538  3101 net.cpp:663]     [Forward] Layer ip1_relu_layer, top blob ip1 data: 0.0225783
I1121 22:49:09.961609  3101 net.cpp:663]     [Forward] Layer ip2_layer, top blob ip2 data: 0.00996837
I1121 22:49:09.961733  3101 net.cpp:675]     [Forward] Layer ip2_layer, param blob 0 data: 0.00798119
I1121 22:49:09.961827  3101 net.cpp:675]     [Forward] Layer ip2_layer, param blob 1 data: 0
I1121 22:49:09.961930  3101 net.cpp:663]     [Forward] Layer ip2_relu_layer, top blob ip2 data: 0.00495125
I1121 22:49:09.962687  3101 net.cpp:663]     [Forward] Layer ip3_layer, top blob ip3 data: 0.0015543
I1121 22:49:09.962797  3101 net.cpp:675]     [Forward] Layer ip3_layer, param blob 0 data: 0.00798144
I1121 22:49:09.962890  3101 net.cpp:675]     [Forward] Layer ip3_layer, param blob 1 data: 0
I1121 22:49:09.962994  3101 net.cpp:663]     [Forward] Layer ip3_relu_layer, top blob ip3 data: 0.000788451
I1121 22:49:09.963294  3101 net.cpp:663]     [Forward] Layer ip4_layer, top blob ip4 data: 0.000192941
I1121 22:49:09.963390  3101 net.cpp:675]     [Forward] Layer ip4_layer, param blob 0 data: 0.00797795
I1121 22:49:09.963531  3101 net.cpp:675]     [Forward] Layer ip4_layer, param blob 1 data: 0
I1121 22:49:09.963636  3101 net.cpp:663]     [Forward] Layer ip4_relu_layer, top blob ip4 data: 0.000109635
I1121 22:49:09.963742  3101 net.cpp:663]     [Forward] Layer ip4_ip4_relu_layer_0_split, top blob ip4_ip4_relu_layer_0_split_0 data: 0.000109635
I1121 22:49:09.963836  3101 net.cpp:663]     [Forward] Layer ip4_ip4_relu_layer_0_split, top blob ip4_ip4_relu_layer_0_split_1 data: 0.000109635
I1121 22:49:09.964131  3101 net.cpp:663]     [Forward] Layer action_layer, top blob actions data: 1.78123e-05
I1121 22:49:09.964236  3101 net.cpp:675]     [Forward] Layer action_layer, param blob 0 data: 0.00757835
I1121 22:49:09.964325  3101 net.cpp:675]     [Forward] Layer action_layer, param blob 1 data: 0
I1121 22:49:09.964586  3101 net.cpp:663]     [Forward] Layer actionpara_layer, top blob action_params data: 1.58027e-05
I1121 22:49:09.964681  3101 net.cpp:675]     [Forward] Layer actionpara_layer, param blob 0 data: 0.00782375
I1121 22:49:09.964772  3101 net.cpp:675]     [Forward] Layer actionpara_layer, param blob 1 data: 0
I1121 22:49:09.965121  3101 dqn.cpp:985]   [Forward] CriticClone
I1121 22:49:09.965328  3101 net.cpp:663]     [Forward] Layer state_input_layer, top blob states data: 0.650094
I1121 22:49:09.965353  3101 net.cpp:663]     [Forward] Layer state_input_layer, top blob dummy1 data: 0.740436
I1121 22:49:09.965402  3101 net.cpp:663]     [Forward] Layer action_input_layer, top blob actions data: 1.78123e-05
I1121 22:49:09.965420  3101 net.cpp:663]     [Forward] Layer action_input_layer, top blob dummy2 data: 1.93005e-05
I1121 22:49:09.965466  3101 net.cpp:663]     [Forward] Layer action_params_input_layer, top blob action_params data: 1.58027e-05
I1121 22:49:09.965489  3101 net.cpp:663]     [Forward] Layer action_params_input_layer, top blob dummy3 data: 1.52097e-05
I1121 22:49:09.965538  3101 net.cpp:663]     [Forward] Layer target_input_layer, top blob target data: 0
I1121 22:49:09.965555  3101 net.cpp:663]     [Forward] Layer target_input_layer, top blob dummy4 data: 0
I1121 22:49:09.965759  3101 net.cpp:663]     [Forward] Layer concat, top blob state_actions data: 0.55588
I1121 22:49:09.966063  3101 net.cpp:663]     [Forward] Layer ip1_layer, top blob ip1 data: 0.0450972
I1121 22:49:09.966161  3101 net.cpp:675]     [Forward] Layer ip1_layer, param blob 0 data: 0.00797473
I1121 22:49:09.966253  3101 net.cpp:675]     [Forward] Layer ip1_layer, param blob 1 data: 0
I1121 22:49:09.966354  3101 net.cpp:663]     [Forward] Layer ip1_relu_layer, top blob ip1 data: 0.0221525
I1121 22:49:09.967411  3101 net.cpp:663]     [Forward] Layer ip2_layer, top blob ip2 data: 0.0100178
I1121 22:49:09.967535  3101 net.cpp:675]     [Forward] Layer ip2_layer, param blob 0 data: 0.00798512
I1121 22:49:09.967627  3101 net.cpp:675]     [Forward] Layer ip2_layer, param blob 1 data: 0
I1121 22:49:09.967730  3101 net.cpp:663]     [Forward] Layer ip2_relu_layer, top blob ip2 data: 0.00490297
I1121 22:49:09.968479  3101 net.cpp:663]     [Forward] Layer ip3_layer, top blob ip3 data: 0.00154114
I1121 22:49:09.968581  3101 net.cpp:675]     [Forward] Layer ip3_layer, param blob 0 data: 0.00795772
I1121 22:49:09.968673  3101 net.cpp:675]     [Forward] Layer ip3_layer, param blob 1 data: 0
I1121 22:49:09.968775  3101 net.cpp:663]     [Forward] Layer ip3_relu_layer, top blob ip3 data: 0.000796483
I1121 22:49:09.969084  3101 net.cpp:663]     [Forward] Layer ip4_layer, top blob ip4 data: 0.000175645
I1121 22:49:09.969187  3101 net.cpp:675]     [Forward] Layer ip4_layer, param blob 0 data: 0.00796696
I1121 22:49:09.969281  3101 net.cpp:675]     [Forward] Layer ip4_layer, param blob 1 data: 0
I1121 22:49:09.969382  3101 net.cpp:663]     [Forward] Layer ip4_relu_layer, top blob ip4 data: 9.5139e-05
I1121 22:49:09.969643  3101 net.cpp:663]     [Forward] Layer q_values_layer, top blob q_values data: 4.8952e-06
I1121 22:49:09.969739  3101 net.cpp:675]     [Forward] Layer q_values_layer, param blob 0 data: 0.00765053
I1121 22:49:09.969830  3101 net.cpp:675]     [Forward] Layer q_values_layer, param blob 1 data: 0
I1121 22:49:09.970237  3101 net.cpp:663]     [Forward] Layer loss, top blob loss data: 2.17407e-11
I1121 22:49:09.970378  3101 dqn.cpp:903]  [Step] Critic
*** Aborted at 1511304549 (unix time) try "date -d @1511304549" if you are using GNU date ***
PC: @     0x7febafeaca5e (unknown)
*** SIGSEGV (@0xb05c40000) received by PID 3098 (TID 0x7feb8d0c7700) from PID 96731136; stack trace: ***
    @     0x7febafd6f4b0 (unknown)
    @     0x7febafeaca5e (unknown)
    @           0x4cc274 dqn::ZeroGradParameters<>()
    @           0x4c089a dqn::DQN::UpdateActorCritic()
    @           0x4befe5 dqn::DQN::Update()
    @           0x4f9769 KeepPlayingGames()
    @           0x505406 _ZNSt12_Bind_simpleIFPFviNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiEiS5_iEE9_M_invokeIJLm0ELm1ELm2EEEEvSt12_Index_tupleIJXspT_EEE
    @           0x504efa std::_Bind_simple<>::operator()()
    @           0x504d1e std::thread::_Impl<>::_M_run()
    @     0x7febb06dbc80 (unknown)
    @     0x7febb18226ba start_thread
    @     0x7febafe413dd clone
    @                0x0 (unknown)
Segmentation fault (core dumped)

Doing some additional testing, I've confirmed the fault occurs at this call in dqn.cpp:

caffe::caffe_set(blob->count(), static_cast<Dtype>(0),
                             blob->mutable_gpu_diff());

When I've run dqn with --nogpu, there is no problem despite a very similar call being made:

caffe::caffe_set(blob->count(), static_cast<Dtype>(0),
                         blob->mutable_cpu_diff());

I made it work by changing the call on line 76 of dqn.cpp:

From:

caffe::caffe_set(blob->count(), static_cast<Dtype>(0), blob->mutable_gpu_diff());

To:

caffe::caffe_set(blob->count(), static_cast<Dtype>(0), blob->mutable_cpu_diff());

Does anyone know what the adverse effects of this change will be? It looks like despite this change, something is still being tasked out to my GPU, which is showing 50% utilization on the K80 I'm using on the Google Cloud Platform.

@wbwatkinson I met the same error, followed your way and solved it. Thanks!
Similarly, my TITAN XP GPU is only utilized 30%.