tensorflow/tensorboard

Implement liveness check for notebook extensions

Opened this issue · 9 comments

Currently, each TensorBoard process writes its meta-information to a
file in the shared .tensorboard-info temp directory, and tries to
clean up the file on graceful exit. This has two problems on Windows:

  • The base temporary directory %TMP% is never automatically cleaned,
    even after logout or reboot.
  • It is not possible to gracefully shut down an arbitrary process
    given its PID.

The result is that most any TensorBoards started by %tensorboard will
leave their info files around forever, unless manually cleaned up, and
the instructions suggested to the user (“use !kill …”) are not
adequate to effect this cleanup. See #2481.

We can ameliorate this by implementing the liveness check mentioned in a
TODO in manager.py, cleaning up the dead info file on failure:

infos = get_all()
candidates = [info for info in infos if info.cache_key == cache_key]
for candidate in sorted(candidates, key=lambda x: x.port):
# TODO(@wchargin): Check here that the provided port is still live.
return candidate
return None

We can also provide a notebook.kill(pid) function with implementation
something vaguely like

def kill(pid):
  if os.name == "nt":
    subprocess.check_output(["taskkill", "/pid", str(int(pid)), "/f"])
    manager.remove_info_file(pid)
  else:
    os.kill(pid, signal.SIGTERM)

and then replace the “reusing TensorBoard” user-facing messaging on
Windows with something like

template = (
    "Reusing TensorBoard on port {port} (pid {pid}), started {delta} ago. "
    "To kill it, run `from tensorboard import notebook; notebook.kill({pid})`."
)

though that incantation still is a bit of a mouthful.

and then replace the “reusing TensorBoard” user-facing messaging on
Windows with something like

template = (
    "Reusing TensorBoard on port {port} (pid {pid}), started {delta} ago. "
    "To kill it, run `from tensorboard import notebook; notebook.kill({pid})`."
)

though that incantation still is a bit of a mouthful.

Actually, this could “just” be

    "To kill it, run `import tensorboard; tensorboard.notebook.kill({pid})`."

or even

    "To kill it, run `__import__("tensorboard").notebook.kill({pid})`."

because we expose notebook in tensorboard/__init__.py as of #1824.

Hi @wchargin , thanks for posting this, after reading your solution, I still have no clue what I'm supposed to do to get tensorboard back to running. I'm running into the same issue where windows did not clear the temp file cleaning after a unclean exit

A workaround is to delete your %TMP%\.tensorboard-info directory.

Thank you for your reply. I tried the following things and none worked:

  1. "taskkill /im tensorboard.exe /f" to kill all live pids in command
  2. deleted all the pid-xxxx.info files in the "%TMP%.tensorboard-info" directory.
  3. deleted the whole "%TMP%.tensorboard-info" directly

However, when I got desperate and tried "--port 6005" in command and try launch TB in chrome, it worked. My guess is my port 6006 somehow corrupted?

Could you please open a separate issue for this and make sure to run the
diagnose_tensorboard.py script to dump appropriate debugging data?

thanks. will do

Same issues: TensorBoard is not reliable within Jupyter. Based on this thread I did the following with marginal success. TensorBoard was good for only one time after the cleanup.

"taskkill /im tensorboard.exe /f" to kill all live pids in command
deleted all the pid-xxxx.info files in the "%TMP%.tensorboard-info" directory.
deleted the whole "%TMP%.tensorboard-info" directly

For those on Windows trying to do this from Jupyter, the following two commands will automate the process. You can just toss them in a cell and run them after you are done with the open TensorBoard instance (or before you need to check a different log file).

!taskkill /IM "tensorboard.exe" /F
!rmdir /S /Q %temp%\.tensorboard-info

this worked on windows. Very helpful!!