julien-duponchelle/python-mysql-replication

Bug - BinLog RowEvent expect column_schema to be dict but tuple is returned in some cases

Mylleranton opened this issue · 0 comments

When testing extracting data from a MariaDB database (v10.5) in Meltano (which uses pipelinewise-tap-mysql that in turn uses this package), I came across this error (using the Meltano preferred cursor pymysql.cursors.SSCursor):

bin/tap-mysql, line 10, in <module> sys.exit(main()) 
lib/python3.9/site-packages/tap_mysql/__init__.py, line 443, in main 
    raise exc 
lib/python3.9/site-packages/tap_mysql/__init__.py, line 440, in main
    main_impl() 
lib/python3.9/site-packages/tap_mysql/__init__.py, line 429, in main_impl
    do_sync(mysql_conn, args.config, args.catalog, state)
lib/python3.9/site-packages/tap_mysql/__init__.py, line 385, in do_sync
    sync_binlog_streams(mysql_conn, binlog_catalog, config, state)
lib/python3.9/site-packages/tap_mysql/__init__.py, line 368, in sync_binlog_streams
    binlog.sync_binlog_stream(mysql_conn, config, binlog_streams_map, state)
lib/python3.9/site-packages/tap_mysql/sync_strategies/binlog.py, line 882, in sync_binlog_stream
    _run_binlog_sync(mysql_conn, reader, binlog_streams_map, state, config, end_log_file, end_log_pos)
lib/python3.9/site-packages/tap_mysql/sync_strategies/binlog.py, line 607, in _run_binlog_sync
    for binlog_event in reader:
lib/python3.9/site-packages/pymysqlreplication/binlogstream.py, line 496, in fetchone
    binlog_event = BinLogPacketWrapper(pkt, self.table_map,
lib/python3.9/site-packages/pymysqlreplication/packet.py, line 136, in __init__
    self.event = event_class(self, event_size_without_header, table_map,
lib/python3.9/site-packages/pymysqlreplication/row_event.py, line 628, in __init__
    if i != (column_schema['ORDINAL_POSITION'] - 1):
    TypeError: tuple indices must be integers or slices, not str

After debugging a bit, I think that the error originates in pymysqlreplication/row_event.py in this piece of logic:

if self.table_id in table_map:
    self.column_schemas = table_map[self.table_id].column_schemas
else:
    self.column_schemas = self._ctl_connection._get_table_information(self.schema, self.table)

From what I gather, if the row event touches a table that we know the schema for (if-logic is true), we use the column schema stored in table_map, which is a dict (originally from pymysqlreplication/table.py):

self.__dict__.update({
    "column_schemas": column_schemas,
    "table_id": table_id,
    "schema": schema,
    "table": table,
    "columns": columns,
    "primary_key": primary_key
})

but, if we do not have any table data on hand (else-logic of block executes), we run into this logic in pymysqlreplication/binlogstream.py:BinLogStreamReader:_get_table_information(...) that fetches that information from the information_schema:

...
cur.execute("""
    SELECT
        COLUMN_NAME, COLLATION_NAME, CHARACTER_SET_NAME,
        COLUMN_COMMENT, COLUMN_TYPE, COLUMN_KEY, ORDINAL_POSITION
    FROM
        information_schema.columns
    WHERE
        table_schema = %s AND table_name = %s
    ORDER BY ORDINAL_POSITION
    """, (schema, table))

return cur.fetchall()
...

but this final call is where things supposedly go south: cur.fetchall() returns a tuple containing only the values from the query - and not the headers & values in a dictlike fashion. What's weird is that the implementation signatures for fetchall() seem to differ between cursor-implementations: Cursor:fetchall() seems to return a dict as expected, whereas SSCursor:fetchall() seems to return the tuple.

Not sure if this inconsistency is something to check for and deal with in this package or in the pymysql-one, raising it here to bring it up for discussion. Given that the different cursors have different purposes (SSCursor to be very performant with large datasets, DictCursor to return dicts etc.), it might make sense to be more explicit in this package on what's expected to be returned - and cast items to the correct type where unexpected formats are found.