voila-dashboards/voila

Many widgets fail to render

Closed this issue · 36 comments

cc @aschlaep
Eg.

import ipywidgets as widgets
widgets.Widget.close_all()
def recursive(n):
    if n == 0:
        return widgets.Button(description='Hi')
    else:
        cls = (widgets.VBox if n % 2 else widgets.HBox)
        return cls([recursive(n-1), recursive(n-1)])
display(recursive(9))
display(len(widgets.Widget.widgets))

Will often fail to render in voila. I think it's likely to be a timing/order issue in the manager (message received before an event handler is attached), or packets being dropped (iorate limit?)

I think this is a good opportunity to refactor this voila manager, to allow progressive rendering of widget (create widgets as they come in, and even try to render them as they come in). Also, we can think of connecting the websocket/kernel as early as possible, not wait till the execution is done.
I assume we can reuse quite a bit from the lab manager, I believe @vidartf and @jasongrout put quite some effort into that.

@maartenbreddels - do you want to do a video call to discuss this? It sounds like a tricky issue.

Notebook 6.0.2 seems to have the same issue, it worked once with n=9, but usually does not, Jupyter Lab 1.2.4 also does not render with n=8. So I don't think this is a voila issue.

Voila sees the 2558 comms/widgets, requests all their info, but gets ~2300 replies.

@maartenbreddels - do you want to do a video call to discuss this? It sounds like a tricky issue.

I'm on whereby!

I wonder if the problem is overwhelming the websocket buffer. All of these requests are made simultaneously. I don't know what the websocket buffer size is, but perhaps it is overflowing and dropping things closing the connection?

From https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/send

The WebSocket.send() method enqueues the specified data to be transmitted to the server over the WebSocket connection, increasing the value of bufferedAmount by the number of bytes needed to contain the data. If the data can't be sent (for example, because it needs to be buffered but the buffer is full), the socket is closed automatically.

Can you check what the websocket bufferedAmount is? See also https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/bufferedAmount

@maartenbreddels and I talked about this more, and it seems that problem is here:

voila/js/src/manager.js

Lines 117 to 140 in 8557d9d

async _build_models() {
const comm_ids = await this._get_comm_info();
const models = {};
const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {
const comm = await this._create_comm(this.comm_target_name, comm_id);
return this._update_comm(comm);
}));
await Promise.all(widgets_info.map(async (widget_info) => {
const state = widget_info.msg.content.data.state;
const modelPromise = this.new_model({
model_name: state._model_name,
model_module: state._model_module,
model_module_version: state._model_module_version,
comm: widget_info.comm,
},
state
);
const model = await modelPromise;
models[model.model_id] = model;
return modelPromise;
}));
return models;
}

In particular, for his reproduction example, this gets back 2558 comm ids:

const comm_ids = await this._get_comm_info();

But this Promise.all only succeeds for 2300 or so widget state updates, and then just hangs waiting for the rest of the widget state updates:

voila/js/src/manager.js

Lines 120 to 123 in 8557d9d

const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {
const comm = await this._create_comm(this.comm_target_name, comm_id);
return this._update_comm(comm);
}));

He can reproduce this sort of issue in classic notebook and jlab by creating all these widgets, then refreshing the page (which triggers the same sort of logic to request all widget state from the kernel to populate the client widget manager)

I think the key is to first narrow down if the problem is in getting the update requests to the kernel, in the kernel, or getting the updates from the kernel. @maartenbreddels decided to instrument the kernel to track these state requests, to see if the kernel is getting all the requests, and if it thinks it sent all the updates. That should narrow down the problem.

One other quick fix might be to just break up Object.keys in this Promise.all so that it actually just iterates over, say, 500 widgets at a time:

const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {

Voila sees the 2558 comms/widgets, requests all their info, but gets ~2300 replies.

Since @maartenbreddels asked - here is where 2558 comes from...

  • 2**9 Buttons
  • 2**8 + 2**7 + 2**6 + ... + 2**0 = 2**9-1 HBoxes and VBoxes
  • 2**9 Style widgets, one for each button
  • 2**9+2**9-1 Layout widgets, one for each hbox, vbox, and button

(2**9 + 2**9 - 1)*2 + 2**9 = 2558 :)

The comms are limited to 1000 msgs/sec by default in the Notebook. This might be the reason for this issue.

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Since @maartenbreddels asked - here is where 2558 comes from...

Thanks, that helped sleeping :)

Ok, I've narrowed down the issue a bit, by patching jupyter_server like this:

$ git diff jupyter_server/services/kernels/handlers.py                                                                                                                               base
diff --git a/jupyter_server/services/kernels/handlers.py b/jupyter_server/services/kernels/handlers.py
index 23d763a84..97fb88f43 100644
--- a/jupyter_server/services/kernels/handlers.py
+++ b/jupyter_server/services/kernels/handlers.py
@@ -211,6 +211,9 @@ class ZMQChannelsHandler(AuthenticatedZMQStreamHandler):
         # Allows you to specify that the byte count should be lowered
         # by a delta amount at some point in the future.
         self._iopub_window_byte_queue = []
+        import collections
+        self.counters = collections.Counter()
+        self.request_order = {}

     @gen.coroutine
     def pre_get(self):
@@ -301,6 +304,11 @@ class ZMQChannelsHandler(AuthenticatedZMQStreamHandler):
         else:
             msg = json.loads(msg)
         channel = msg.pop('channel', None)
+        content = msg.get('content', {})
+        if channel == 'shell' and content.get('data', {}).get('method') == 'request_state':
+            self.request_order[content['comm_id']] = self.counters['request_state']
+            self.counters['request_state'] += 1
+            print(self.counters)
         if channel is None:
             self.log.warning("No channel specified, assuming shell: %s", msg)
             channel = 'shell'
@@ -329,6 +337,13 @@ class ZMQChannelsHandler(AuthenticatedZMQStreamHandler):
             self.write_message(json.dumps(msg, default=date_default))
         channel = getattr(stream, 'channel', None)
         msg_type = msg['header']['msg_type']
+        content = msg.get('content', {})
+        if channel == 'iopub' and content.get('data', {}).get('method') == 'update':
+            self.counters['update'] += 1
+            print(self.counters)
+            del self.request_order[content['comm_id']]
+            if self.counters['update'] > 1800:
+                print("missing", self.request_order)

         if channel == 'iopub' and msg_type == 'status' and msg['content'].get('execution_state') == 'idle':
             # reset rate limit counter on status=idle,

I can conclude the following

  • we get 2558 request_state messages, so we get all websocket sends from the browser.
  • we get < 2558 update messages, so either the kernels does not receive them (not the case, see below), or the server does not receive them (either the kernel does not send a reply, or the reply does not get through).

I've analyzed the request_order dict, and in 1 run, the first 1165 request_state calls got answered, the next 130 got dropped, the next 500 (or 501) got answered, then 459 dropped, and the last 305 dropped.

from the kernel

I monkey patch ipykernel and the Comm class to see what we do at the kernel side:

Counter({'update': 2558, 'request_state': 2558})

So we get all requests , and we send all the updates, but they don't arrive at the server.

The comms are limited to 1000 msgs/sec by default in the Notebook. This might be the reason for this issue.

Good idea, but I don't see any warning (could always end up in /dev/null thought). But the tracking of the reply happens before the rate limiting (i.e. the limiting is between the frontend and server, and I already lost the message between the kernel and the server)

The iopub rate limiting is at the server on messages going from the kernel to the browser. That seems to be where you are dropping messages? Can you up the rate limits to see if it solves the issue, just in case?

To up the limit to 10x its default: jupyter lab --LabApp.iopub_msg_rate_limit=10000

I see

  • 2558 request_state msg'es going from browser->server
  • 2558 request_state msg'es arriving at the kernel
  • 2558 update calls from the kernel
  • < 2558 update messages received at the server.

I intercept the messages at the server at:
https://github.com/jupyter/jupyter_server/blob/7a6cd8e5dece1a8084f6cf065188c6a004f0a38a/jupyter_server/services/kernels/handlers.py#L317

So I think this is before any rate-limiting is applied.

So I think either the zmq layer does not send the packets, or the server does not receive them.

kernel = get_ipython().kernel
kernel.iopub_socket.io_thread.socket.get_hwm()

Gives us 1000, meaning it will buffer 1000 messages, before dropping messages, as explained at: http://api.zeromq.org/2-1:zmq-setsockopt (ZMQ_HWM: Set high water mark).

Patching ipykernel, and removing that limit self.iopub_socket.set_hwm(0) will make voila get all the messages 🎉. But, this call needs to happen before binding the socket, so I don't think a notebook user can do it.

ah, nice!

By the way, here is a nice small program for spying on a specific kernel directly, in case you want to do something like count the iopub messages coming directly from the kernel zmq socket. Use it like python ./client.py <kernel-id>

from collections import Counter
from jupyter_client import BlockingKernelClient
from jupyter_core import paths
import os.path as path
import sys

connection_file = path.join(paths.jupyter_runtime_dir(), 'kernel-{}.json'.format(sys.argv[1]))
print(connection_file)

client = BlockingKernelClient()
client.load_connection_file(connection_file)
client.start_channels()
counter = Counter()

while True:
  msg = client.get_iopub_msg(timeout=100)
  counter[msg['msg_type']] += 1
  print(counter)

Patching ipykernel, and removing that limit self.iopub_socket.set_hwm(0) will make voila get all the messages 🎉. But, this call needs to happen before binding the socket, so I don't think a notebook user can do it.

Where is the high water mark set? ZMQ apparently defaults to no limit, according the docs you linked to.

Yeah, puzzled by that as well. ipython/ipython#3304 says v3 puts it by default to 1000, I am not sure what v4 does, digging into this now.

Patching ipykernel, and removing that limit self.iopub_socket.set_hwm(0) will make voila get all the messages tada. But, this call needs to happen before binding the socket, so I don't think a notebook user can do it.

Do we really want to patch this? What was the original reason for limiting to 1000msgs/sec?

If we change this here, I guess we also need to change the iopub_msg_rate_limit to the same value in JupyterLab, Jupyter Notebook and all other front-ends.

Do we really want to patch this? What was the original reason for limiting to 1000msgs/sec?

I think our first fix is just batching our update requests to try to be under this limit, which is a change to the widget manager.

I'm not saying we want to do that, but we want to understand the issue, and maybe have workarounds/configurations for people that need it.

I think our first fix is just batching our update requests to try to be under this limit, which is a change to the widget manager.

I think this is an interesting idea, so you say we should change the widget protocol?

No, just that we should do what I mentioned above in #534 (comment)

One other quick fix might be to just break up Object.keys in this Promise.all so that it actually just iterates over, say, 500 widgets at a time:

const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {

ZMQ_SNDHWM at http://api.zeromq.org/4-3:zmq-setsockopt says a 1000, so indeed. Should we discuss this at ipykernel, open an issue there?

One other quick fix might be to just break up Object.keys in this Promise.all so that it actually just iterates over, say, 500 widgets at a time:

Ah, now I understand, that's a great idea! Or, have at most 500 unanswered requests, using a Semaphore-like mechanism.

According to the zmq page:

ØMQ does not guarantee that the socket will accept as many as ZMQ_SNDHWM messages, and the actual limit may be as much as 90% lower depending on the flow of messages on the socket. The socket may even be able to accept more messages than the ZMQ_SNDHWM threshold; a notable example is for sockets using TCP transport; see zmq_tcp(7).

So actually the message queue may be far smaller than 1000. Wow, sounds like we either need to be quite conservative, or raise the limit. Yes, let's start an issue on ipykernel and CC @minrk.

Ah, now I understand, that's a great idea! Or, have at most 500 unanswered requests, using a Semaphore-like mechanism.

I was thinking just putting it in a loop and awaiting the resolutions in batches, perhaps batches of 100, given the hwm limit may actually translate to something like 100 messages in certain cases?

Was this issue fixed? I am currently having an issue where a notebook that has multiple tabs each with many widgets and some plotly plots will only sometimes render when served with voila.
If i run voila with --debug i get hung up here.

'''
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
'''
Any suggestions?

We're currently fixing this in #546 It was fixed only in the 0.1.18.x branch

This is now fixed in voila 0.1.21, let us know if you still have issues!

This seams to have made a difference. So far, it has not failed to render.

Hi.

I am posting here as I am asking advice.

I seem to be having widget refreshing issues similar to the descriptions in this ticket, however the problems are strongly correlated with having started voila with preheated kernels.

How would you diagnose it?

Using voila 0.3.1.

Hello @tilusnet, can you open another ticket and provide us with a minimal example to reproduce the issue?

Hi @trungleduc

Sadly it's not something consistently reproducible.
And as I said earlier, I don't know what to measure and how. I only notice symptoms of failed widget refresh.

What I can tell you about my setup:

  • the page is rather widget-complex, though wrt refresh rates, all widgets (like progress bars) are limited to 1 refresh per 0.5 seconds. I am using tqdm for progress bars.
  • I can confirm that preheating a page does not involve any progress bar refresh since these are triggered on button clicks

FYI @trungleduc -
The widget refresh issues I mentioned earlier seem to be strongly linked to using ipywidget's .capture() decorator.

Someone could test as per the documentation example here:
https://ipywidgets.readthedocs.io/en/stable/examples/Output%20Widget.html#Debugging-errors-in-callbacks-with-the-output-widget

+1 matter that contributes: activation of preheated kernels.

#1101 could be it, it certainly is a bug that can cause they page to fail to render