mhauskn/dqn-hfo

server down

Opened this issue · 7 comments

Each time I run dqn, it ends with the belowing error: server down. It always appeared after the Actor/Critic Iteration.
What is the reason for this error? And How can fix it?

I0711 15:58:31.193724 5373 dqn.cpp:807] [Agent0] Critic Iteration 64000, loss = 0.00130656
I0711 15:58:31.193823 5373 dqn.cpp:813] [Agent0] Actor Iteration 64000, avg_q_value = 0.0267761
base_left 11: [154620, 0] recv error message [(error illegal_command_form)]
base_left 11: waited 5 seconds. server down??
F0711 15:58:36.237735 5373 hfo_game.cpp:114] Server Down!
*** Check failure stack trace: ***
@ 0x7fbacb009daa (unknown)
@ 0x7fbacb009ce4 (unknown)
@ 0x7fbacb0096e6 (unknown)
@ 0x7fbacb00c687 (unknown)
@ 0x458189 HFOGameState::update()
@ 0x42c683 PlayOneEpisode()
@ 0x42e864 KeepPlayingGames()
@ 0x42fd6f std::thread::_Impl<>::_M_run()
@ 0x7fbac9eb6a60 (unknown)
@ 0x7fbaca9b6184 start_thread
@ 0x7fbac961e37d (unknown)
@ (nil) (unknown)
[1] 5370 abort (core dumped) ./dqn -save state/test -alsologtostderr

By the way, to use caffe in your project, include one more sentence in the Cmakelist:
add_definitions(${Caffe_DEFINITIONS})
This can fix 'Cannot find cublas_v2.h' error. See https://github.com/BVLC/caffe/pull/1667 for more reference

Hi, I've pushed an update to the repo that adds a --verbose flag which unsupresses output from the game server. Can you try the code again using the verbose flag and report back what error message the server is dying with?

Also, thanks for the suggestion to include Caffe Definitions - I've added it to the makefile!

It works now! Thanks!
And If I want to try train.sh,what should I change to use it?And I don't really understand what “monitor-condor-job” means...

Great! monitor-condor-job is a script that's specific to the computational cluster I'm using. If you wanted to try the experiments in the train.sh file, just comment out the monitor-condor-job line.

Hi, I also get a similar error everytime I run it. With the verbose set, the error message is the following:

Checking all players connected
Starting game
HFORef using seed: 1477248376
EndOfTrial: 0 / 1 102 OUT_OF_TIME
I1023 11:46:25.505098 743 dqn_main.cpp:355] [Agent0] Episode 0 reward = 0.00820231
syntax error
base_left 11: [105, 0] recv error message [(error illegal_command_form)]
Error parsing >(dash nan)(turn_neck 90)(done)<
base_left 11: waited 5 seconds. server down??
F1023 11:46:30.548648 743 hfo_game.cpp:125] Server Down!
*** Check failure stack trace: ***
@ 0x7fae960025cd google::LogMessage::Fail()
@ 0x7fae96004433 google::LogMessage::SendToLog()
@ 0x7fae9600215b google::LogMessage::Flush()
@ 0x7fae96004e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x45c058 HFOGameState::update()
@ 0x455d29 PlayOneEpisode()
@ 0x458569 KeepPlayingGames()
@ 0x45a341 std::thread::_Impl<>::_M_run()
@ 0x7fae9496ac80 (unknown)
@ 0x7fae95ac36fa start_thread
@ 0x7fae940d0b5d clone
@ (nil) (unknown)
./run.sh: line 6: 740 Aborted (core dumped) ./bin/dqn -save state/test -alsologtostderr

How can I fix this?

Can you try running some of the example C++ HFO agents (located in HFO/example) to determine if the problem is specific to HFO?