google-deepmind/reverb

Quickstart example deadlocks on cluster

simon-bachhuber opened this issue · 4 comments

When trying to run the quickstart example as shown in the readme the remote system deadlocks.
The example runs just fine on my local pc. The remote system is part of the university cluster and the machines are all running Ubuntu. The following shows an IPython session executed on a node of the cluster.

Screenshot from 2022-06-09 09-20-51

When executing client.insert(...) the system hangs. I could imagine this might be an issue with ports? But i have really quite limited knowledge on this topic and any pointers would be highly appreciated.
Thanks :)

Hey,

I think the problem here is a sneaky one. You can see in the logs that a checkpoint is loaded. When loading a checkpoint it will use the configuration of the original table (i.e. the original rate_limiter etc.) and I would guess that this table had a rate_limiter capable of blocking inserts. Now when you try to insert the rate limiter blocks the insert forever (unless you sample concurrently from a different thread).

Some things to try:

  • Delete the checkpoint folder and try again.
  • Use client.server_info() to inspect the state of the table after creating the server and check if it matches your expectations.

Thanks for your reponse!

I just deleted the /tmp/* folder but no luck. I also just gave the Server a Checkpoint with a new path which also didn't work.
Regarding the client.server_info(): I can not run this! As soon as i run it the system hangs. It's like the client can not communicate to the server.

Could you check whether connecting to the server by specifying IP address (not localhost) works? Maybe server somehow listens on a different interface, but "localhost loop" is prohibited?

This seems to always just work

import socket
# replace `localhost` with 
socket.gethostname()