julien-duponchelle/python-mysql-replication

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte`

imysm opened this issue · 8 comments

imysm commented

MySQL:
mysql Ver 8.0.21 for Linux on x86_64 (MySQL Community Server - GPL)
Python:
Python 3.8.5
mysql-replication:
mysql-replication 0.22

Error:
for row in binlogevent.rows: File "/usr/local/anaconda3/lib/python3.8/site-packages/pymysqlreplication/row_event.py", line 433, in rows self._fetch_rows() File "/usr/local/anaconda3/lib/python3.8/site-packages/pymysqlreplication/row_event.py", line 428, in _fetch_rows self.__rows.append(self._fetch_one_row()) File "/usr/local/anaconda3/lib/python3.8/site-packages/pymysqlreplication/row_event.py", line 517, in _fetch_one_row row["before_values"] = self._read_column_data(self.columns_present_bitmap) File "/usr/local/anaconda3/lib/python3.8/site-packages/pymysqlreplication/row_event.py", line 132, in _read_column_data values[name] = self.__read_string(1, column) File "/usr/local/anaconda3/lib/python3.8/site-packages/pymysqlreplication/row_event.py", line 224, in __read_string string = string.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

This is probably due to invalid characters existing in your database and your character encoding set on your database/table isn't able to process it.

The logic itself is actually pretty awesome, extracting the output encoding to meet the needs of the DB tables, but it's very frustrating because there's no way to manually set the decode format(at least none that I can see).

This is the line that's causing the(or similar) problems.

https://github.com/noplay/python-mysql-replication/blob/main/pymysqlreplication/row_event.py#L249

If there was a way to set a manual encoding value and default to the DB provided value I think it might be a more flexible solution.

I was having the same error and was able to work around it by changing the default encoding of the mysql database. ( --character-set-server=utf8 --collation-server=utf8_unicode_ci).
I read that the default changed in version 8 so thats worth checking.

The issue happens for me with BLOB data type columns. Blobs shouldn't be decoded. They store arbitrary binary data. Removing the decoding for blobs fixes the issue.

diff --git a/pymysqlreplication/row_event.py b/pymysqlreplication/row_event.py
index b371fce..d88e261 100644
--- a/pymysqlreplication/row_event.py
+++ b/pymysqlreplication/row_event.py
@@ -220,7 +220,7 @@ class RowsEvent(BinLogEvent):
         elif column.type == FIELD_TYPE.NEWDECIMAL:
             return self.__read_new_decimal(column)
         elif column.type == FIELD_TYPE.BLOB:
-            return self.__read_string(column.length_size, column)
+            return self.packet.read_length_coded_pascal_string(column.length_size)
         elif column.type == FIELD_TYPE.DATETIME:
             ret = self.__read_datetime()
             if ret is None:
@@ -1114,8 +1114,7 @@ class TableMapEvent(BinLogEvent):
         if column_type in [
             FIELD_TYPE.STRING,
             FIELD_TYPE.VAR_STRING,
-            FIELD_TYPE.VARCHAR,
-            FIELD_TYPE.BLOB,
+            FIELD_TYPE.VARCHAR
         ]:
             return True
         if column_type == FIELD_TYPE.GEOMETRY and dbms == "mariadb":

@YAtOff

diff --git a/pymysqlreplication/row_event.py b/pymysqlreplication/row_event.py
index b371fce..d88e261 100644
--- a/pymysqlreplication/row_event.py
+++ b/pymysqlreplication/row_event.py
@@ -220,7 +220,7 @@ class RowsEvent(BinLogEvent):
         elif column.type == FIELD_TYPE.NEWDECIMAL:
             return self.__read_new_decimal(column)
         elif column.type == FIELD_TYPE.BLOB:
-            return self.__read_string(column.length_size, column)
+            return self.packet.read_length_coded_pascal_string(column.length_size)

�Can you give me some data to reproduce?
If you're having trouble with this, I think I can fix it as you mentioned.

@@ -1114,8 +1114,7 @@ class TableMapEvent(BinLogEvent):
         if column_type in [
             FIELD_TYPE.STRING,
             FIELD_TYPE.VAR_STRING,
-            FIELD_TYPE.VARCHAR,
-            FIELD_TYPE.BLOB,
+            FIELD_TYPE.VARCHAR
         ]:

I think This one is not related with this issue.
This part is used to extract Optionalmetadata.
This logic is used because the collation_id of blob Type is taken as 63(binary) from optional meta data binlog.
We define a character as something that has a collation.

@sean-k1
I've created a failing test here: https://github.com/YAtOff/python-mysql-replication/blob/7ebd0ee01764bf35d3f42fd6d847c7cba81f1ea1/pymysqlreplication/tests/test_data_type.py#L569
The issue happens when the table has more fields.
I think the issue is in TableMapEvent#_read_default_charset.
For this test case, it shows that the blob (63) is in position 3, but it is in position 4.

So, I think the suggestion for the blobs I gave above needs to be corrected. There is no issue when the collation is binary, but the charset_collation_list should be correct.

@YAtOff
Thank you for your interest in the project. 🤗

  1. Could you please provide the value of system variable in your MySQL?
show variables like 'binlog_row_metadata';

it recommended to set binlog_row_metadata=FULL.

  1. Could you state the difference between test_long_blob_arbitrary_bytes you created and existing test_blob?

@YAtOff

I tried your testcase error came like this.
I made a Pr can You check this pr still makes error?


self = <pymysql.connections.Connection object at 0x7fe5b0d60760>, command = 3
sql = ('INSERT INTO test (t1, t2, t3, payload) VALUES(%s, %s, %s,  %s)', ('text', 'text', 'text', b'\xb9no\xe1k\xdal\xa7N!$\...d,ZA\xb3\xf8m\xaf\x06\xf8\'\x08\xb7\xf2\x84dj\xc5+`\xba>\xc7bq\x8b\xaaUE\xc5\xfc\xd2O\xd6\xd4\xdf\xaf\xb8\x82`\xd5V('))

    def _execute_command(self, command, sql):
        """
        :raise InterfaceError: If the connection is closed.
        :raise ValueError: If no username was specified.
        """
        if not self._sock:
            raise err.InterfaceError(0, "")
    
        # If the last query was unbuffered, make sure it finishes before
        # sending new commands
        if self._result is not None:
            if self._result.unbuffered_active:
                warnings.warn("Previous unbuffered result was left incomplete")
                self._result._finish_unbuffered_query()
            while self._result.has_next:
                self.next_result()
            self._result = None
    
        if isinstance(sql, str):
            sql = sql.encode(self.encoding)
    
        packet_size = min(MAX_PACKET_LEN, len(sql) + 1)  # +1 is for command
    
        # tiny optimization: build first packet manually instead of
        # calling self..write_packet()
        prelude = struct.pack("<iB", packet_size, command)
>       packet = prelude + sql[: packet_size - 1]
E       TypeError: can't concat tuple to bytes

I've found the root of the issue. I've created a PR #582