julien-duponchelle/python-mysql-replication

Bug: Silently dropping events without errors/logging

Closed this issue · 4 comments

I found a bug in the client. This causes it to drop events from relevant tables and schemas due to handling of rotate events. There is no error logging indicating that an event from a relevant table is being dropped either.

Consider the following usual sequence of binlog file events:

  1. TableMapEvent table_id: 1, table_name: test_table, schema_name: test_schema
  2. InsertRowEvent table_id:1, .....

The erroneous library flow:

  1. BinLogStreamReader correctly parses the table map event, updates the table map entry for the table.
  2. When fetching the next event in fetchone(), pymysql.OperationalError happens with an expected error code:
    def fetchone(self):
    while True:
    if self.end_log_pos and self.is_past_end_log_pos:
    return None
    if not self.__connected_stream:
    self.__connect_to_stream()
    if not self.__connected_ctl:
    self.__connect_to_ctl()
    try:
    if pymysql.__version__ < LooseVersion("0.6"):
    pkt = self._stream_connection.read_packet()
    else:
    pkt = self._stream_connection._read_packet()
    except pymysql.OperationalError as error:
    code, message = error.args
    if code in MYSQL_EXPECTED_ERROR_CODES:
    self._stream_connection.close()
    self.__connected_stream = False
    continue
    raise
  3. As per the exception handling, the flow continues in the while loop. In the next iteration we establish a new connection.
  4. BinLogStreamReader gets a Fake Rotate Event after establishing the connection: https://mariadb.com/kb/en/fake-rotate_event/#:~:text=When%20a%20slave%20server%20connects,file%20name%20of%20the%20master.
  5. The table_map gets invalidated on every Rotate Event, this causes us to forget the Table Map entry we parsed prior to the connection issue.
    if binlog_event.event_type == ROTATE_EVENT:
    self.log_pos = binlog_event.event.position
    self.log_file = binlog_event.event.next_binlog
    # Table Id in binlog are NOT persistent in MySQL - they are in-memory identifiers
    # that means that when MySQL master restarts, it will reuse same table id for different tables
    # which will cause errors for us since our in-memory map will try to decode row data with
    # wrong table schema.
    # The fix is to rely on the fact that MySQL will also rotate to a new binlog file every time it
    # restarts. That means every rotation we see *could* be a sign of restart and so potentially
    # invalidates all our cached table id to schema mappings. This means we have to load them all
    # again for each logfile which is potentially wasted effort but we can't really do much better
    # without being broken in restart case
    self.table_map = {}
  6. Next we get the InsertRowEvent, but we don't have the table id available in the table_map, this causes a KeyError while constructing the Event object, and the event gets dropped silently, despite being of interest. In the init method of the parent class RowsEvent, we encounter a KeyError while looking up the table_map, the implementation here marks this event as not to be processed: self._processed = False.
    try:
    self.primary_key = table_map[self.table_id].data["primary_key"]
    self.schema = self.table_map[self.table_id].schema
    self.table = self.table_map[self.table_id].table
    except KeyError: # If we have filter the corresponding TableMap Event
    self._processed = False
    return
  7. One level higher, the BinLogPacketWrapper (this invokes the event object construction) sets the event object to None:
    self.event = event_class(
    self,
    event_size_without_header,
    table_map,
    ctl_connection,
    mysql_version=mysql_version,
    only_tables=only_tables,
    ignored_tables=ignored_tables,
    only_schemas=only_schemas,
    ignored_schemas=ignored_schemas,
    freeze_schema=freeze_schema,
    ignore_decode_errors=ignore_decode_errors,
    verify_checksum=verify_checksum,
    optional_meta_data=optional_meta_data,
    )
    if not self.event._processed:
    self.event = None
  8. Back to the BinLogStreamReader, it silently ignores the event because it has no information about it, except the next log position. It moves ahead the log position, and then ignores the event and moves forward.
    elif binlog_event.log_pos:
    self.log_pos = binlog_event.log_pos

    # event is none if we have filter it on packet level
    # we filter also not allowed events
    if binlog_event.event is None or (
    binlog_event.event.__class__ not in self.__allowed_events
    ):
    continue

Potential fix strategies:

  • We could retain the old log position in-case the event parsed was a table map event, and use that as the start point for the new connection, as the context is necessary.
  • We could be smarter about resetting the table_map, detecting whether the event is an actual rotate or fake rotate seems easy based on the docs.
  • I maybe missing something here, but just don't reset the table_map. I don't completely understand the reason for resetting the table map. We could store the table and schema name in the table_map, and use that while processing the TableMapEvent. After DB restart if the DB remaps the tables to different ids, we would get TableMapEvents for it. Shouldn't these events be enough to fix the table_map prior to the Row based events? Why do we need to actually reset the table_map?

Some other points to note:

  • It'd be nice to add some logging to the library so such behaviour can be easily detected by the end users.
  • Establishing connections in an infinite loop without any backoff doesn't seem right, might worsen the situation if failures are happening due to load on the database.

@megatron10 what is your mysql Version? If your Mysql version is 8.0,
check this one
we always update table_map when Before InsertRowEvent occured

The issue is not based on how we extract information from the TableMapEvent. The table_map does get updated, but when the connection is dropped after a TableMapEvent, the client re-establishes the connection and gets a (Fake) RotateEvent. The RotateEvent handling clears the table map.

https://github.com/23-OSSCA-python-mysql-replication/python-mysql-replication/blob/9c781b5cedc70ee62dd24c18573746079ec43792/pymysqlreplication/binlogstream.py#L575-L635

After this the client gets the actual Row based replication event (WriteRows/UpdateRows/DeleteRows) event, this is where the lookup in the table_map fails with a KeyError, and we mark the event as not to be processed.

@megatron10
understand this one!
thanks for writing detailed issue problem!

@megatron10
I added logging through your good words.
Finding the table_map by name when a fake rotate event occurs is something I need to think about a bit more.
Let me know if you think there's anything else you need