actiontech/dtle

Problem of resending big tx

c0494133d4 opened this issue · 4 comments

# (dtle) src发送第一个(index=0)big tx分割
2022-06-29T15:57:11.758+0800 [DEBUG] client.driver_mgr.dtle: splitting big tx: driver=dtle @module=reader index=0 job=big_tx-migration timestamp=2022-06-29T15:57:11.758+0800
2022-06-29T15:57:11.758+0800 [DEBUG] client.driver_mgr.dtle: sendEntry: driver=dtle @module=reader events=223697 gno=10 isBig=true job=big_tx-migration timestamp=2022-06-29T15:57:11.758+0800

# dst 最后一次执行dml
2022-06-29T15:58:28.505+0800 [DEBUG] client.driver_mgr.dtle: a dml event: driver=dtle @module=dtle.applier.ApplyBinlogEvent job=big_tx-migration timestamp=2022-06-29T15:58:28.504+0800

# src收到dst执行完上述分割后的bigtx_ack
2022-06-29T15:58:28.505+0800 [DEBUG] client.driver_mgr.dtle: bigtx_ack: driver=dtle gno=10 index=0 job=big_tx-migration @module=dtle.extractor timestamp=2022-06-29T15:58:28.505+0800

# src读取一些RowsEvent后,发生同文件rotate(可能是断线重连)
2022-06-29T15:58:28.516+0800 [DEBUG] client.driver_mgr.dtle: got rowsEvent: driver=dtle flags=1 gno=10 schema=action_db_1 table=big_tx @module=reader job=big_tx-migration tableFlags=1 timestamp=2022-06-29T15:58:28.516+0800
2022-06-29T15:58:29.981+0800 [DEBUG] client.driver_mgr.dtle: got rowsEvent: driver=dtle flags=1 job=big_tx-migration schema=action_db_1 tableFlags=1 @module=reader gno=10 table=big_tx timestamp=2022-06-29T15:58:29.980+0800
2022-06-29T15:58:30.614+0800 [INFO]  client.driver_mgr.dtle: Rotate to next binlog: driver=dtle @module=reader job=big_tx-migration name=mysql-bin.000002 timestamp=2022-06-29T15:58:30.614+0800

# 这个断线重连似乎是必现的。

# src:重连后会重新从头读取big tx (query=BEGIN)
2022-06-29T15:58:30.615+0800 [DEBUG] client.driver_mgr.dtle: query event: driver=dtle job=big_tx-migration query=BEGIN schema=action_db_1 @module=reader timestamp=2022-06-29T15:58:30.615+0800

# src再次发送index=0分割
2022-06-29T15:59:10.401+0800 [DEBUG] client.driver_mgr.dtle: splitting big tx: driver=dtle @module=reader index=0 job=big_tx-migration timestamp=2022-06-29T15:59:10.401+0800
2022-06-29T15:59:10.401+0800 [DEBUG] client.driver_mgr.dtle: sendEntry: driver=dtle @module=reader events=223697 gno=10 isBig=true job=big_tx-migration timestamp=2022-06-29T15:59:10.401+0800

# dst收到该BinlogEntry,但后续没有处理, 可能是卡在WaitForAllCommitted里了
2022-06-29T15:59:10.968+0800 [DEBUG] client.driver_mgr.dtle: incr. recv a msg.: driver=dtle @module=dtle.applier job=big_tx-migration timestamp=2022-06-29T15:59:10.968+0800
2022-06-29T15:59:11.656+0800 [DEBUG] client.driver_mgr.dtle: a binlogEntry.: driver=dtle gno=10 job=big_tx-migration remaining=0 seq=10 lc=9 @module=dtle.applier timestamp=2022-06-29T15:59:11.655+0800
2022-06-29T15:59:11.660+0800 [DEBUG] client.driver_mgr.dtle: WaitForAllCommitted: driver=dtle lc=9 le=10 timestamp=2022-06-29T15:59:11.660+0800

现在的大事务处理方案:#779

重点:

  1. 源端按行分割发送大事务,并等待目标端ack
  2. 目标端执行完分割后,ack源端
  3. 源端收到ack后,处理下一个分割
  4. 目标端只有执行最后一个分割后,才会commit

需要改进:目标端接收到重发的分割后

  • 合理处理WaitForAllCommitted
  • rollback当前未commit的大事务, 重新执行

问题2: slave thread断连的原因是什么?

  • MySQL send_packet() 后,client必须在net_write_timeout内全部接收完毕
  • send_packet()的时机和数据量,待研究
  • 原有的在超时前读取一个Event的办法无效。现添加新job参数 SlaveNetWriteTimeout调整复制线程的超时时间
    • MySQL默认值60,dtle该参数默认值28800(8小时)

Case 3

  • dtle对同时执行的大事务有限制(配置项 big_tx_max_jobs).
  • 处于大事务状态中的job,被删除时,遗漏了计数器-1的操作。导致后续job处理缓慢。
  • fixed in 97644c7

4

		// src is resending an earlier BinlogEntry
		if a.dbs[0].Tx != nil {
			_ = a.dbs[0].Tx.Rollback()
		}

rollback后未将Tx置空, 可能引起问题TODO