test_topology_streaming_failure is flaky depending on where the topology coordinator runs
tgrabiec opened this issue · 1 comments
tgrabiec commented
=================================== FAILURES ===================================
_______________________ test_topology_streaming_failure ________________________
request = <FixtureRequest for <Function test_topology_streaming_failure>>
manager = <test.pylib.manager_client.ManagerClient object at 0x7f0422e2c710>
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_topology_streaming_failure(request, manager: ManagerClient):
"""Fail streaming while doing a topology operation"""
# decommission failure
servers = await manager.running_servers()
logs = [await manager.server_open_log(srv.server_id) for srv in servers]
marks = [await log.mark() for log in logs]
await manager.api.enable_injection(servers[2].ip_addr, 'stream_ranges_fail', one_shot=True)
await manager.decommission_node(servers[2].server_id, expected_error="Decommission failed. See earlier errors")
servers = await manager.running_servers()
assert len(servers) == 3
matches = [await log.grep("raft_topology - rollback.*after decommissioning failure, moving transition state to rollback to normal",
from_mark=mark) for log, mark in zip(logs, marks)]
assert sum(len(x) for x in matches) == 1
# bootstrap failure
marks = [await log.mark() for log in logs]
servers = await manager.running_servers()
s = await manager.server_add(start=False, config={
'error_injections_at_startup': ['stream_ranges_fail']
})
await manager.server_start(s.server_id, expected_error="Bootstrap failed. See earlier errors")
servers = await manager.running_servers()
assert s not in servers
matches = [await log.grep("raft_topology - rollback.*after bootstrapping failure, moving transition state to left token ring",
from_mark=mark) for log, mark in zip(logs, marks)]
assert sum(len(x) for x in matches) == 1
# bootstrap failure in raft barrier
marks = [await log.mark() for log in logs]
servers = await manager.running_servers()
s = await manager.server_add(start=False)
await manager.api.enable_injection(servers[1].ip_addr, 'raft_topology_barrier_fail', one_shot=True)
> await manager.server_start(s.server_id, expected_error="Bootstrap failed. See earlier errors")
The test fails because it expects bootstrap to fail, due to arming of raft_topology_barrier_fail
on servers[1]. However, if servers[1] is the topology coordinator, it will not execute the barrier
command because it is excluded, probably here:
guard = co_await exec_global_command(std::move(guard),
raft_topology_cmd{raft_topology_cmd::command::barrier},
{_raft.id()},
drop_guard_and_retake::no);
servers[1] log: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8667/artifact/testlog/x86_64/dev/scylla-808.log
We can see:
INFO 2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - executing global topology command barrier, excluded nodes: {3a5c178f-20c3-4fce-bf56-4b0cf2c183c0}
DEBUG 2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - send barrier command with term 1 and index 25 to ae87b773-97a6-4b15-a93c-4b540b541267/127.136.35.46
DEBUG 2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - send barrier command with term 1 and index 25 to 4e47fd61-56a2-4030-9e95-83a807d781e2/127.136.35.33
INFO 2024-05-09 17:35:44,650 [shard 0:strm] raft_topology - updating topology state: committed new CDC generation, ID: (2024/05/09 14:35:44, 6b2046a0-0e11-11ef-2257-1d37ae5a1c05)
DEBUG 2024-05-09 17:35:44,652 [shard 0:strm] raft_topology - reload raft topology state
INFO 2024-05-09 17:35:44,661 [shard 0:strm] cdc - Started using generation (2024/05/09 14:35:44, 6b2046a0-0e11-11ef-2257-1d37ae5a1c05).
3a5c178f-20c3-4fce-bf56-4b0cf2c183c0 is the host id of servers[1].
So it's a test problem which assumes that servers[1] is not the coordinator.