apache/incubator-horaedb-meta

Shard version mismatch

ZuLiangWang opened this issue · 1 comments

Describe this problem
After the cluster runs for a long time, the ShardVersion of some CeresDB nodes is inconsistent with the ShardVersion of CeresMeta.

Steps to reproduce
The cluster runs for a long time, and it is not clear how to reproduce it.

Expected behavior
The Shard version of CeresDB and CeresMeta are consistent.

Additional Information

2023-07-19T14:46:32.114+0800    error   procedure/manager_impl.go:161   procedure start failed  {"clusterName": "defaultCluster", "error": "send eventPrepare: dispatch create table on shard: create table on shard: (#500)event dispatch failed, cause:create table on shard, addr:11.39.8.171:8831, request:{{{46 1 1111} 1110} {109314 MMM_2198193666_INFLUENCE_PRE_SANDBOX_OUTPUT_TABLE 0 public {<nil>}} [0 10 10 10 4 116 115 105 100 16 5 32 1 10 12 10 6 112 101 114 105 111 100 16 1 32 2 10 23 10 13 103 114 111 117 112 98 121 73 110 100 101 120 48 16 4 24 1 32 3 40 1 10 17 10 9 108 111 103 83 97 109 112 108 101 16 4 24 1 32 4 10 17 10 7 95 114 101 115 117 108 116 16 4 24 1 32 5 40 1 10 16 10 6 115 101 114 118 101 114 16 4 24 1 32 6 40 1 10 13 10 3 105 100 99 16 4 24 1 32 7 40 1 10 13 10 3 108 100 99 16 4 24 1 32 8 40 1 10 21 10 11 97 112 112 108 105 99 97 116 105 111 110 16 4 24 1 32 9 40 1 16 1 24 2 34 2 1 2] Analytic false map[enable_ttl:true ttl:3d update_mode:APPEND write_buffer_size:33554432]}, err:fail to create table on shard in cluster, req:CreateTableOnShardRequest { update_shard_info: Some(UpdateShardInfo { curr_shard_info: Some(ShardInfo { id: 46, role: Leader, version: 1111 }), prev_version: 1110 }), table_info: Some(TableInfo { id: 109314, name: \"MMM_2198193666_INFLUENCE_PRE_SANDBOX_OUTPUT_TABLE\", schema_id: 0, schema_name: \"public\", partition_info: None }), encoded_schema: [0, 10, 10, 10, 4, 116, 115, 105, 100, 16, 5, 32, 1, 10, 12, 10, 6, 112, 101, 114, 105, 111, 100, 16, 1, 32, 2, 10, 23, 10, 13, 103, 114, 111, 117, 112, 98, 121, 73, 110, 100, 101, 120, 48, 16, 4, 24, 1, 32, 3, 40, 1, 10, 17, 10, 9, 108, 111, 103, 83, 97, 109, 112, 108, 101, 16, 4, 24, 1, 32, 4, 10, 17, 10, 7, 95, 114, 101, 115, 117, 108, 116, 16, 4, 24, 1, 32, 5, 40, 1, 10, 16, 10, 6, 115, 101, 114, 118, 101, 114, 16, 4, 24, 1, 32, 6, 40, 1, 10, 13, 10, 3, 105, 100, 99, 16, 4, 24, 1, 32, 7, 40, 1, 10, 13, 10, 3, 108, 100, 99, 16, 4, 24, 1, 32, 8, 40, 1, 10, 21, 10, 11, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 16, 4, 24, 1, 32, 9, 40, 1, 16, 1, 24, 2, 34, 2, 1, 2], engine: \"Analytic\", create_if_not_exist: false, options: {\"enable_ttl\": \"true\", \"write_buffer_size\": \"33554432\", \"update_mode\": \"APPEND\", \"ttl\": \"3d\"} }. Caused by: Shard version mismatch, shard_info:ShardInfo { id: 46, role: Leader, version: 1111 }, expect version:1110.\ngithub.com/CeresDB/ceresmeta/pkg/coderr.(*codeError).WithCausef\n\t/Users/zulliangwang/code/ceres/ceresmeta/pkg/coderr/error.go:73\ngithub.com/CeresDB/ceresmeta/server/coordinator/eventdispatch.(*DispatchImpl).CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/eventdispatch/dispatch_impl.go:71\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl.CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/common_util.go:60\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.prepareCallback\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:90\ngithub.com/looplab/fsm.(*FSM).afterEventCallbacks\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:435\ngithub.com/looplab/fsm.(*FSM).Event.func1\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:330\ngithub.com/looplab/fsm.transitionerStruct.transition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:375\ngithub.com/looplab/fsm.(*FSM).doTransition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:360\ngithub.com/looplab/fsm.(*FSM).Event\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:343\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.(*Procedure).Start\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:213\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure.(*ManagerImpl).startProcedureWorker.func1\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/manager_impl.go:159\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ngithub.com/CeresDB/ceresmeta/pkg/coderr.(*codeError).WithCausef\n\t/Users/zulliangwang/code/ceres/ceresmeta/pkg/coderr/error.go:73\ngithub.com/CeresDB/ceresmeta/server/coordinator/eventdispatch.(*DispatchImpl).CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/eventdispatch/dispatch_impl.go:71\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl.CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/common_util.go:60\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.prepareCallback\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:90\ngithub.com/looplab/fsm.(*FSM).afterEventCallbacks\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:435\ngithub.com/looplab/fsm.(*FSM).Event.func1\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:330\ngithub.com/looplab/fsm.transitionerStruct.transition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:375\ngithub.com/looplab/fsm.(*FSM).doTransition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:360\ngithub.com/looplab/fsm.(*FSM).Event\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:343\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.(*Procedure).Start\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:213\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure.(*ManagerImpl).startProcedureWorker.func1\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/manager_impl.go:159\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594", "errorVerbose": "(#500)event dispatch failed, cause:create table on shard, addr:11.39.8.171:8831, request:{{{46 1 1111} 1110} {109314 MMM_2198193666_INFLUENCE_PRE_SANDBOX_OUTPUT_TABLE 0 public {<nil>}} [0 10 10 10 4 116 115 105 100 16 5 32 1 10 12 10 6 112 101 114 105 111 100 16 1 32 2 10 23 10 13 103 114 111 117 112 98 121 73 110 100 101 120 48 16 4 24 1 32 3 40 1 10 17 10 9 108 111 103 83 97 109 112 108 101 16 4 24 1 32 4 10 17 10 7 95 114 101 115 117 108 116 16 4 24 1 32 5 40 1 10 16 10 6 115 101 114 118 101 114 16 4 24 1 32 6 40 1 10 13 10 3 105 100 99 16 4 24 1 32 7 40 1 10 13 10 3 108 100 99 16 4 24 1 32 8 40 1 10 21 10 11 97 112 112 108 105 99 97 116 105 111 110 16 4 24 1 32 9 40 1 16 1 24 2 34 2 1 2] Analytic false map[enable_ttl:true ttl:3d update_mode:APPEND write_buffer_size:33554432]}, err:fail to create table on shard in cluster, req:CreateTableOnShardRequest { update_shard_info: Some(UpdateShardInfo { curr_shard_info: Some(ShardInfo { id: 46, role: Leader, version: 1111 }), prev_version: 1110 }), table_info: Some(TableInfo { id: 109314, name: \"MMM_2198193666_INFLUENCE_PRE_SANDBOX_OUTPUT_TABLE\", schema_id: 0, schema_name: \"public\", partition_info: None }), encoded_schema: [0, 10, 10, 10, 4, 116, 115, 105, 100, 16, 5, 32, 1, 10, 12, 10, 6, 112, 101, 114, 105, 111, 100, 16, 1, 32, 2, 10, 23, 10, 13, 103, 114, 111, 117, 112, 98, 121, 73, 110, 100, 101, 120, 48, 16, 4, 24, 1, 32, 3, 40, 1, 10, 17, 10, 9, 108, 111, 103, 83, 97, 109, 112, 108, 101, 16, 4, 24, 1, 32, 4, 10, 17, 10, 7, 95, 114, 101, 115, 117, 108, 116, 16, 4, 24, 1, 32, 5, 40, 1, 10, 16, 10, 6, 115, 101, 114, 118, 101, 114, 16, 4, 24, 1, 32, 6, 40, 1, 10, 13, 10, 3, 105, 100, 99, 16, 4, 24, 1, 32, 7, 40, 1, 10, 13, 10, 3, 108, 100, 99, 16, 4, 24, 1, 32, 8, 40, 1, 10, 21, 10, 11, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 16, 4, 24, 1, 32, 9, 40, 1, 16, 1, 24, 2, 34, 2, 1, 2], engine: \"Analytic\", create_if_not_exist: false, options: {\"enable_ttl\": \"true\", \"write_buffer_size\": \"33554432\", \"update_mode\": \"APPEND\", \"ttl\": \"3d\"} }. Caused by: Shard version mismatch, shard_info:ShardInfo { id: 46, role: Leader, version: 1111 }, expect version:1110.\ngithub.com/CeresDB/ceresmeta/pkg/coderr.(*codeError).WithCausef\n\t/Users/zulliangwang/code/ceres/ceresmeta/pkg/coderr/error.go:73\ngithub.com/CeresDB/ceresmeta/server/coordinator/eventdispatch.(*DispatchImpl).CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/eventdispatch/dispatch_impl.go:71\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl.CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/common_util.go:60\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.prepareCallback\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:90\ngithub.com/looplab/fsm.(*FSM).afterEventCallbacks\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:435\ngithub.com/looplab/fsm.(*FSM).Event.func1\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:330\ngithub.com/looplab/fsm.transitionerStruct.transition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:375\ngithub.com/looplab/fsm.(*FSM).doTransition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:360\ngithub.com/looplab/fsm.(*FSM).Event\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:343\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.(*Procedure).Start\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:213\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure.(*ManagerImpl).startProcedureWorker.func1\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/manager_impl.go:159\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ngithub.com/CeresDB/ceresmeta/pkg/coderr.(*codeError).WithCausef\n\t/Users/zulliangwang/code/ceres/ceresmeta/pkg/coderr/error.go:73\ngithub.com/CeresDB/ceresmeta/server/coordinator/eventdispatch.(*DispatchImpl).CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/eventdispatch/dispatch_impl.go:71\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl.CreateTableOnShard\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/common_util.go:60\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.prepareCallback\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:90\ngithub.com/looplab/fsm.(*FSM).afterEventCallbacks\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:435\ngithub.com/looplab/fsm.(*FSM).Event.func1\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:330\ngithub.com/looplab/fsm.transitionerStruct.transition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:375\ngithub.com/looplab/fsm.(*FSM).doTransition\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:360\ngithub.com/looplab/fsm.(*FSM).Event\n\t/Users/zulliangwang/go/pkg/mod/github.com/looplab/fsm@v0.3.0/fsm.go:343\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure/ddl/createtable.(*Procedure).Start\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/ddl/createtable/create_table.go:213\ngithub.com/CeresDB/ceresmeta/server/coordinator/procedure.(*ManagerImpl).startProcedureWorker.func1\n\t/Users/zulliangwang/code/ceres/ceresmeta/server/coordinator/procedure/manager_impl.go:159\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncreate table on shard\ndispatch create table on shard\nsend eventPrepare"}

This problem results from that shard version is determined by ceresmeta, but it should be decided by ceresdb. And my proposal is to upgrade the protocol of meta_event_service to support ceresdb to decide the shard version.