bybit-exchange/pybit

Websocket: "cannot decode". Error validating utf8

radomir9720 opened this issue · 8 comments

My websocket connection is constantly being cut off.

I turned on debug pybit and websocket logs to see what's wrong.
Here is what i've got:

2024-06-15 16:44:16,150 - websocket - DEBUG - Sending ping
2024-06-15 16:44:36,151 - websocket - DEBUG - Sending ping
2024-06-15 16:44:56,152 - websocket - DEBUG - Sending ping
2024-06-15 16:44:56,725 - websocket - ERROR - error from callback <function _WebSocketManager._connect.<locals>.<lambda> at 0x7fc65f4a56c0>: cannot decode: b'{"success":true,"ret_msg":"pong","conn_id":"1a598878-132b-4e\xf17-8864-8cee95ea1b85","req_id":"","op":"ping"}'
2024-06-15 16:44:56,726 - websocket - INFO - tearing down on exception cannot decode: b'{"success":true,"ret_msg":"pong","conn_id":"1a598878-132b-4e\xf17-8864-8cee95ea1b85","req_id":"","op":"ping"}'
2024-06-15 16:44:56,726 - pybit._websocket_stream - DEBUG - WebSocket Unified V5 closed.

I took a look inside websocket package. The exception is thrown from websocket._abnf.py.continuous_frame.extract() method.

if (
    not self.fire_cont_frame
    and data[0] == ABNF.OPCODE_TEXT
    and not self.skip_utf8_validation
    and not validate_utf8(frame.data)
):
    raise WebSocketPayloadException(f"cannot decode: {repr(frame.data)}")

So, there could be several reasons why the exception is thrown:

Disclaimer: i know that if statement is executed until first True, but, since i'm not familiar with implementation details of the websocket package, i started with the part that i can figure out - utf8 validation:

image

As we can see, the validation fails because of the \x escape sequence.

What can we do with it?

  1. Somehow change the connection id, so that it will not contain special characters
  2. Disable utf8 validation. Especially knowing the fact that it will increase the performance. So, passing skip_utf8_validation=True to WebSocketApp.run_forever() method should solve the issue

P.S. I'm using the lates pybit version - 5.7.0

Forked and made changes described above.

Currently testing. If everything's gonna be ok - i will make a PR.

I don't think there should be an escape sequence in there, it's just a UUID and should be alphanumeric with hyphens. This makes me wonder if the conn_id sent by the websocket server is the problem. Can you consistently reproduce this issue?

You're right. The issue is not the conn_id. I did some investigations, and, i suppose, the problem is the unstable connection. Perhaps, once in a while bytes get lost, that results in a broken message.

Here are some examples:

  1. Instead of kline.1.FTMUSDT got kline.1.DTMUSDT. An exception was raised because there was no such topic in the topics dict.
2024-06-20 10:34:21,623 - websocket - DEBUG - ++Rcv raw: b'\x81~\x01\x16{"topic":"kline.1.DTMUSDT","data":[{"start":1718868840000,"end":1718868899999,"interval":"1","open":"0.5702","close":"0.57","high":"0.5704","low":"0.5695","volume":"139423","turnover":"79456.8583","confirm":false,"timestamp":1718868861523}],"ts":1718868861523,"type":"snapshot"}'
2024-06-20 10:34:21,624 - websocket - DEBUG - ++Rcv decoded: fin=1 opcode=1 data=b'{"topic":"kline.1.DTMUSDT","data":[{"start":1718868840000,"end":1718868899999,"interval":"1","open":"0.5702","close":"0.57","high":"0.5704","low":"0.5695","volume":"139423","turnover":"79456.8583","confirm":false,"timestamp":1718868861523}],"ts":1718868861523,"type":"snapshot"}'
2024-06-20 10:34:21,624 - websocket - ERROR - error from callback <function _WebSocketManager._connect.<locals>.<lambda> at 0x7f7c2c31b910>: 'kline.1.DTMUSDT'
  1. The entire end parameter is messed up. Also there is a missing quote at the end, that caused a JsonDecodeError.

image

Since we don't have options to check the integrity of the message(for example, through check sum), the only thing we can do - ignore broken messages. But. also, we don't have a universal way to check whether the message is broken or not(for cases like message with topic we didn't subscribe to). The only thing we can do - ignore the exceptions, and keep the connection open. This is what i did. I added an argument disconnect_on_exception. If it's set to False, the exception will only be logged, and the connection will be maintained.

I pushed all the changes to my fork, and gonna make a PR with details what i did, and why.

I'll do it tomorrow

I'm not sure how to reproduce this issue, do you think there's a good way to do it for testing purposes?

What do you want to reproduce? A broken message from websocket?
I don't think that there is a good way. I just run the socket stream, and getting broken messages(Therefore, a JSONDecodeError) 1-2 times a day in average. But i'm using my fork, so the connection is maintained.

What is the ultimate goal in reproducing the issue? You can see the logs, that i attached, all you'll get after reproducing the issue - exactly the same logs, and the same result - JSONDecodeError.

I've never had this issue myself and never seen it reported, so I'd like to be able to test the fix before committing to master.

I have used pybit to run tests for days at a time but have never seen this problem. So I just want to be able to reproduce it before approving the PR.

This PR, actually, is not a fix. It just provides a way to handle the error in _on_error callback, instead of automatically raising an exception and closing the connection. To do so, it's necessary to pass skip_utf8_validation=True. Also, the exceptions in _on_error can be ignored by passing disconnect_on_exception=False.

So changes are backward compatible, because by default the behaviour in the PR is completely the same as in current version.

If you want to test the behaviour when a broken message is received from websocket, you can, for example, override/mock/change the _on_message method, and raise any exception in it, because exactly message = json.loads(message) from _on_message method throws an exception when the broken message can't be decoded.