Prometheus metrics blocks tornado main thread
dleen opened this issue · 4 comments
Description
A bug was reported in jtpio/jupyterlab-system-monitor#87 about the UI lagging with several kernels running. The issue was traced to the system monitor extension as disabling that extension while keeping the same load on the system made the UI issue go away.
Reproduce
Create multiple notebooks with contents:
import time
i = 0
while True:
print(f"i={i}")
i += 1
time.sleep(1)
Run 4+ kernels all executing this cell.
Open a terminal and (hopefully your key repeat speed is high enough) hold down a character e.g. "x" to get continuous input into the terminal. This should be very smooth, you should see characters appearing rapidly and without pause.
Now relaunch the server with --ResourceUseDisplay.track_cpu_percent=True
.
Repeat the process. While holding down a key in the terminal you will notice frequent lags and pauses.
Expected behavior
The UI does not lag with the extension enabled.
Problem
The API handler does the right thing by running the call to psutil on a separate thread: https://github.com/jupyter-server/jupyter-resource-usage/blob/master/jupyter_resource_usage/api.py#L66
However the prometheus metrics uses a different implementation (why?) and does the same expensive operation on the main tornado thread which blocks other calls: https://github.com/jupyter-server/jupyter-resource-usage/blob/master/jupyter_resource_usage/metrics.py#L40
You can prove this is the root cause by simply disabling this and the following lines: https://github.com/jupyter-server/jupyter-resource-usage/blob/master/jupyter_resource_usage/server_extension.py#L22
When this callback is removed the UI no longer lags every second.
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋
Welcome to the Jupyter community! 🎉
Posted PR #124 which obviously doesn't solve the underlying issue but at least lets users who aren't using prometheus to avoid the issue.
A good solution is to also run psutil on a separate thread for Prometheus.
The best solution is two merge the code for getting metrics. For example you could have the API handler use the most recent entry in the prometheus list of metrics
This turned out to be an issue for my deployments. To remedy the problem we ended up completely removing the Prometheus callback by commenting out the appropriate section in server_extension.py
The periodic Prometheus handler caused the introduction of "skipping" & "lag" while using jupyter-server-proxy to connect to a VNC server via a proxied websocket; making it completely unusable.