pangeo-data/jupyter-earth

Prevent server from getting stopped during long simulations

JordiBolibar opened this issue · 12 comments

So far I am able to run long simulations as long as I keep my session more or less active (i.e. during my working hours). However, as soon as I leave it running during the night, my server is eventually stopped and the simulations get interrupted. I have configured my SSH file .ssh/ssh_config with keep alive as follows:

TCPKeepAlive yes
ClientAliveInterval 30
ClientAliveCountMax 240

X11Forwarding yes
X11UseLocalhost no

Is there anything else I should do in order to avoid that? What am I missing? Thanks in advance!

Ah thanks for reporting this @JordiBolibar, this is a novel situation for me. The automated logic to terminate an "inactive" server is quite intricate, but it won't check for open SSH connections as a sign of activity.

  1. I've opened an issue about this in the project that has enable us to have ssh connections, yuvipanda/jupyterhub-ssh, to our jupyter servers so we could get potential help solving this more thoroughly.
  2. We need to have a workaround for now.
    I'm quite confident you can accomplish a workaround if you visit your server via the web UI (hub.jupytearth.org), start any notebook, and keep it running like...
    import time
    # to avoid a mistake keeping a server running for months on end
    # adding costs to the cloud bill, avoid putting in 999999999999 etc.
    time.sleep(3600*24)

@JordiBolibar try the workaround for now and let me know if it works for now. I'm 99% confident it will work to block having the server stopped automatically.

Hi Erik, thanks for the quick reply. I am in fact connecting through the web UI. I forgot I'm not longer attempting to connect from VSCode, since now there's the VSCode integrated in the browser. So indeed, the SSH configuration I posted is useless.

And thanks for the workaround. I will implement the same thing in Julia and see if that does the trick.

Oh hmmm

  1. Do you think the bug I reported in yuvipanda/jupyterhub-ssh#67 is correct or incorrect?
  2. Is it correct that you are using visual studio code via the web interface (hub.jupytearth.org)?
  3. Can you clarify what "running a simulation" implies? Are you running something from a terminal opened in the hub.jupytearth.org based visual studio code UI?

And thanks for the workaround. I will implement the same thing in Julia and see if that does the trick.

The goal is to have a server not be stopped, to do that, you need to be seen as active. For that, this workaround relies on having a "jupyter kernel" running at all time. What's important isn't that you run it in julia or python, but it's actually a registered kernel running. In practice, you can visit https://hub.jupytearth.org/hub/user-redirect/lab and start a python notebook and associated kernel running this sleep command, and then go to https://hub.jupytearth.org/hub/user-redirect/vscode and keep working.

I think you can the running kernels from a terminal with jupyter server list.

  1. It depends if code-server is connecting via SSH or not. I'm not sure if it needs to do so or if it is run in the same server.
  2. Yes, I'm using VSCode via code-server directly in the browser, the one present in the launcher.
  3. Yes, I'm running a Julia simulation from the VSCode browser version from the hub UI.

OK, I will stick with the JupyterNotebook trick for now.

I want to come back to this issue that @JordiBolibar opened some time ago.

I had been using the time.sleep() trick for some time now, and I notice that without further notice the kernel of that notebook also dies without giving news before the established time. I wonder if there is a more stable solution in order to keep a server running for several hours, especially for long and computationally expensive simulations. @consideRatio do you have any idea of how to accomplish this? Running the sleep command from a terminal will have the same effect?

Thank you!

I think the way to do this is to:

  1. Write a jupyter server extension that has UI that says 'keep this server alive'
  2. When it is enabled, it'll keep reporting that it is active to the server's API
  3. This will ensure that the multiple killers we have (idle and in-server) don't get to it.

So users can go to this page, and say 'keep alive for 8h, 24h, until turned off' etc

@minrk's work on https://github.com/minrk/jupyter-keepalive address this quite well I think.

keepalive.mov

With #155 I'll add it to the base image.

Hi @consideRatio, I'm still having issues with this. I cannot find the Keep server alive option, and running the notebook with a sleep command still doesn't work. I cannot run long simulations since my server gets disconnected after a short while.

Is it normal that I cannot access the option you displayed above in the video? Thanks a lot in advance!

Ah, I ended up disabling it when trying to resolve a very challengeng upgrade of other packages with coupled dependencies.

# https://github.com/minrk/jupyter-keepalive/archive/main.zip \
# This is a jupyter_server extension that is controllable via a
# JupyterLab plugin to keep a server running.
#
# ref: https://github.com/minrk/jupyter-keepalive
#
# NOTE: Disabled as we don't have nodejs installed, making us
# require a pre-built wheel or installation of nodejs.

I'll see if I can upstream a resolution to this by getting the package build and published so that nodejs isn't required.

I opened minrk/jupyter-keepalive#4 @JordiBolibar.

I'll see if I can re-configure something to help you avoid getting shut down.

Awesome, thanks a lot for your help!

@JordiBolibar I've not dropped the ball on this, but I'm swamped with work items. There is progress to getting jupyter-keepalive to help us here, so I'm currently aiming for that as a resolution.