Bug: Silently dropping events without errors/logging
Closed this issue · 4 comments
I found a bug in the client. This causes it to drop events from relevant tables and schemas due to handling of rotate events. There is no error logging indicating that an event from a relevant table is being dropped either.
Consider the following usual sequence of binlog file events:
- TableMapEvent table_id: 1, table_name: test_table, schema_name: test_schema
- InsertRowEvent table_id:1, .....
The erroneous library flow:
- BinLogStreamReader correctly parses the table map event, updates the table map entry for the table.
- When fetching the next event in fetchone(), pymysql.OperationalError happens with an expected error code:
python-mysql-replication/pymysqlreplication/binlogstream.py
Lines 573 to 595 in a301596
- As per the exception handling, the flow continues in the while loop. In the next iteration we establish a new connection.
- BinLogStreamReader gets a Fake Rotate Event after establishing the connection: https://mariadb.com/kb/en/fake-rotate_event/#:~:text=When%20a%20slave%20server%20connects,file%20name%20of%20the%20master.
- The
table_map
gets invalidated on every Rotate Event, this causes us to forget the Table Map entry we parsed prior to the connection issue.python-mysql-replication/pymysqlreplication/binlogstream.py
Lines 621 to 633 in a301596
- Next we get the InsertRowEvent, but we don't have the table id available in the
table_map
, this causes a KeyError while constructing the Event object, and the event gets dropped silently, despite being of interest. In the init method of the parent classRowsEvent
, we encounter a KeyError while looking up the table_map, the implementation here marks this event as not to be processed:self._processed = False
.
python-mysql-replication/pymysqlreplication/row_event.py
Lines 32 to 38 in a301596
- One level higher, the BinLogPacketWrapper (this invokes the event object construction) sets the event object to None:
python-mysql-replication/pymysqlreplication/packet.py
Lines 149 to 165 in a301596
- Back to the BinLogStreamReader, it silently ignores the event because it has no information about it, except the next log position. It moves ahead the log position, and then ignores the event and moves forward.
python-mysql-replication/pymysqlreplication/binlogstream.py
Lines 634 to 635 in a301596
python-mysql-replication/pymysqlreplication/binlogstream.py
Lines 679 to 684 in a301596
Potential fix strategies:
- We could retain the old log position in-case the event parsed was a table map event, and use that as the start point for the new connection, as the context is necessary.
- We could be smarter about resetting the
table_map
, detecting whether the event is an actual rotate or fake rotate seems easy based on the docs. - I maybe missing something here, but just don't reset the
table_map
. I don't completely understand the reason for resetting the table map. We could store the table and schema name in the table_map, and use that while processing the TableMapEvent. After DB restart if the DB remaps the tables to different ids, we would get TableMapEvents for it. Shouldn't these events be enough to fix thetable_map
prior to the Row based events? Why do we need to actually reset thetable_map
?
Some other points to note:
- It'd be nice to add some logging to the library so such behaviour can be easily detected by the end users.
- Establishing connections in an infinite loop without any backoff doesn't seem right, might worsen the situation if failures are happening due to load on the database.
@megatron10 what is your mysql Version? If your Mysql version is 8.0,
check this one
we always update table_map when Before InsertRowEvent occured
The issue is not based on how we extract information from the TableMapEvent. The table_map
does get updated, but when the connection is dropped after a TableMapEvent, the client re-establishes the connection and gets a (Fake) RotateEvent. The RotateEvent handling clears the table map.
After this the client gets the actual Row based replication event (WriteRows/UpdateRows/DeleteRows) event, this is where the lookup in the table_map
fails with a KeyError, and we mark the event as not to be processed.
@megatron10
understand this one!
thanks for writing detailed issue problem!
@megatron10
I added logging through your good words.
Finding the table_map by name when a fake rotate event occurs is something I need to think about a bit more.
Let me know if you think there's anything else you need