test_topology_streaming_failure is flaky depending on where the topology coordinator runs

Question

test_topology_streaming_failure is flaky depending on where the topology coordinator runs

tgrabiec opened this issue a month ago · 1 comments

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8667/testReport/junit/(root)/non-boost%20tests/Tests___Unit_Tests___topology_test_topology_failure_recovery_dev_1/

=================================== FAILURES ===================================
_______________________ test_topology_streaming_failure ________________________

request = <FixtureRequest for <Function test_topology_streaming_failure>>
manager = <test.pylib.manager_client.ManagerClient object at 0x7f0422e2c710>

    @pytest.mark.asyncio
    @skip_mode('release', 'error injections are not supported in release mode')
    async def test_topology_streaming_failure(request, manager: ManagerClient):
        """Fail streaming while doing a topology operation"""
        # decommission failure
        servers = await manager.running_servers()
        logs = [await manager.server_open_log(srv.server_id) for srv in servers]
        marks = [await log.mark() for log in logs]
        await manager.api.enable_injection(servers[2].ip_addr, 'stream_ranges_fail', one_shot=True)
        await manager.decommission_node(servers[2].server_id, expected_error="Decommission failed. See earlier errors")
        servers = await manager.running_servers()
        assert len(servers) == 3
        matches = [await log.grep("raft_topology - rollback.*after decommissioning failure, moving transition state to rollback to normal",
                   from_mark=mark) for log, mark in zip(logs, marks)]
        assert sum(len(x) for x in matches) == 1
        # bootstrap failure
        marks = [await log.mark() for log in logs]
        servers = await manager.running_servers()
        s = await manager.server_add(start=False, config={
            'error_injections_at_startup': ['stream_ranges_fail']
        })
        await manager.server_start(s.server_id, expected_error="Bootstrap failed. See earlier errors")
        servers = await manager.running_servers()
        assert s not in servers
        matches = [await log.grep("raft_topology - rollback.*after bootstrapping failure, moving transition state to left token ring",
                   from_mark=mark) for log, mark in zip(logs, marks)]
        assert sum(len(x) for x in matches) == 1
        # bootstrap failure in raft barrier
        marks = [await log.mark() for log in logs]
        servers = await manager.running_servers()
        s = await manager.server_add(start=False)
        await manager.api.enable_injection(servers[1].ip_addr, 'raft_topology_barrier_fail', one_shot=True)
>       await manager.server_start(s.server_id, expected_error="Bootstrap failed. See earlier errors")

The test fails because it expects bootstrap to fail, due to arming of raft_topology_barrier_fail on servers[1]. However, if servers[1] is the topology coordinator, it will not execute the barrier command because it is excluded, probably here:

        guard = co_await exec_global_command(std::move(guard),
                raft_topology_cmd{raft_topology_cmd::command::barrier},
                {_raft.id()},
                drop_guard_and_retake::no);

servers[1] log: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8667/artifact/testlog/x86_64/dev/scylla-808.log

We can see:

INFO  2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - executing global topology command barrier, excluded nodes: {3a5c178f-20c3-4fce-bf56-4b0cf2c183c0}
DEBUG 2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - send barrier command with term 1 and index 25 to ae87b773-97a6-4b15-a93c-4b540b541267/127.136.35.46
DEBUG 2024-05-09 17:35:44,643 [shard 0:strm] raft_topology - send barrier command with term 1 and index 25 to 4e47fd61-56a2-4030-9e95-83a807d781e2/127.136.35.33
INFO  2024-05-09 17:35:44,650 [shard 0:strm] raft_topology - updating topology state: committed new CDC generation, ID: (2024/05/09 14:35:44, 6b2046a0-0e11-11ef-2257-1d37ae5a1c05)
DEBUG 2024-05-09 17:35:44,652 [shard 0:strm] raft_topology - reload raft topology state
INFO  2024-05-09 17:35:44,661 [shard 0:strm] cdc - Started using generation (2024/05/09 14:35:44, 6b2046a0-0e11-11ef-2257-1d37ae5a1c05).

3a5c178f-20c3-4fce-bf56-4b0cf2c183c0 is the host id of servers[1].

So it's a test problem which assumes that servers[1] is not the coordinator.

Answer 1 · 2024-05-10T15:59:35.000Z

scylla-815.log
scylla-817.log
scylla-809.log
scylla-808.log
scylla-810.log
topology.test_topology_failure_recovery.1.log