Problem of resending big tx

Question

Problem of resending big tx

c0494133d4 opened this issue 3 years ago · 4 comments

# (dtle) src发送第一个（index=0）big tx分割
2022-06-29T15:57:11.758+0800 [DEBUG] client.driver_mgr.dtle: splitting big tx: driver=dtle @module=reader index=0 job=big_tx-migration timestamp=2022-06-29T15:57:11.758+0800
2022-06-29T15:57:11.758+0800 [DEBUG] client.driver_mgr.dtle: sendEntry: driver=dtle @module=reader events=223697 gno=10 isBig=true job=big_tx-migration timestamp=2022-06-29T15:57:11.758+0800

# dst 最后一次执行dml
2022-06-29T15:58:28.505+0800 [DEBUG] client.driver_mgr.dtle: a dml event: driver=dtle @module=dtle.applier.ApplyBinlogEvent job=big_tx-migration timestamp=2022-06-29T15:58:28.504+0800

# src收到dst执行完上述分割后的bigtx_ack
2022-06-29T15:58:28.505+0800 [DEBUG] client.driver_mgr.dtle: bigtx_ack: driver=dtle gno=10 index=0 job=big_tx-migration @module=dtle.extractor timestamp=2022-06-29T15:58:28.505+0800

# src读取一些RowsEvent后，发生同文件rotate（可能是断线重连）
2022-06-29T15:58:28.516+0800 [DEBUG] client.driver_mgr.dtle: got rowsEvent: driver=dtle flags=1 gno=10 schema=action_db_1 table=big_tx @module=reader job=big_tx-migration tableFlags=1 timestamp=2022-06-29T15:58:28.516+0800
2022-06-29T15:58:29.981+0800 [DEBUG] client.driver_mgr.dtle: got rowsEvent: driver=dtle flags=1 job=big_tx-migration schema=action_db_1 tableFlags=1 @module=reader gno=10 table=big_tx timestamp=2022-06-29T15:58:29.980+0800
2022-06-29T15:58:30.614+0800 [INFO]  client.driver_mgr.dtle: Rotate to next binlog: driver=dtle @module=reader job=big_tx-migration name=mysql-bin.000002 timestamp=2022-06-29T15:58:30.614+0800

# 这个断线重连似乎是必现的。

# src：重连后会重新从头读取big tx （query=BEGIN）
2022-06-29T15:58:30.615+0800 [DEBUG] client.driver_mgr.dtle: query event: driver=dtle job=big_tx-migration query=BEGIN schema=action_db_1 @module=reader timestamp=2022-06-29T15:58:30.615+0800

# src再次发送index=0分割
2022-06-29T15:59:10.401+0800 [DEBUG] client.driver_mgr.dtle: splitting big tx: driver=dtle @module=reader index=0 job=big_tx-migration timestamp=2022-06-29T15:59:10.401+0800
2022-06-29T15:59:10.401+0800 [DEBUG] client.driver_mgr.dtle: sendEntry: driver=dtle @module=reader events=223697 gno=10 isBig=true job=big_tx-migration timestamp=2022-06-29T15:59:10.401+0800

# dst收到该BinlogEntry，但后续没有处理, 可能是卡在WaitForAllCommitted里了
2022-06-29T15:59:10.968+0800 [DEBUG] client.driver_mgr.dtle: incr. recv a msg.: driver=dtle @module=dtle.applier job=big_tx-migration timestamp=2022-06-29T15:59:10.968+0800
2022-06-29T15:59:11.656+0800 [DEBUG] client.driver_mgr.dtle: a binlogEntry.: driver=dtle gno=10 job=big_tx-migration remaining=0 seq=10 lc=9 @module=dtle.applier timestamp=2022-06-29T15:59:11.655+0800
2022-06-29T15:59:11.660+0800 [DEBUG] client.driver_mgr.dtle: WaitForAllCommitted: driver=dtle lc=9 le=10 timestamp=2022-06-29T15:59:11.660+0800

Answer 1 · 2022-06-29T11:01:30.000Z

现在的大事务处理方案：#779

重点：

源端按行分割发送大事务，并等待目标端ack
目标端执行完分割后，ack源端
源端收到ack后，处理下一个分割
目标端只有执行最后一个分割后，才会commit

需要改进：目标端接收到重发的分割后

合理处理WaitForAllCommitted
rollback当前未commit的大事务, 重新执行

Answer 2 · 2022-06-29T11:02:04.000Z

问题2: slave thread断连的原因是什么？

MySQL send_packet() 后，client必须在net_write_timeout内全部接收完毕
send_packet()的时机和数据量，待研究
原有的在超时前读取一个Event的办法无效。现添加新job参数 SlaveNetWriteTimeout调整复制线程的超时时间
- MySQL默认值60，dtle该参数默认值28800（8小时）

Answer 3 · 2022-07-04T09:32:27.000Z

Case 3

dtle对同时执行的大事务有限制（配置项 big_tx_max_jobs）.
处于大事务状态中的job，被删除时，遗漏了计数器-1的操作。导致后续job处理缓慢。
fixed in 97644c7

Answer 4 · 2022-09-13T09:55:36.000Z

4

		// src is resending an earlier BinlogEntry
		if a.dbs[0].Tx != nil {
			_ = a.dbs[0].Tx.Rollback()
		}

rollback后未将Tx置空, 可能引起问题TODO