[dtest] alternator_tests.TesterAlternator.test_cluster_traces failed with IndexError: list index out of range

Question

[dtest] alternator_tests.TesterAlternator.test_cluster_traces failed with IndexError: list index out of range

Closed this issue 22 days ago · 1 comments

Logs are here

Stacktrace:

self = <alternator_tests.TesterAlternator object at 0x7f80b10583d0>

    def test_cluster_traces(self):  # pylint:disable=too-many-locals,too-many-statements  # noqa: PLR0915
        """
        The test inserts items of different types and checks the traces for each of the following actions: "PutItem",
         "GetItem", "UpdateItem", and "DeleteItem".
        Also, for each action taken, the test checks for the following:
         1. For each action, the test checks a list of trace messages that should exist.
         2. The order of the traces should be according to the order of the expected traces variable (In other words,
          the test checks for the first message, and once is found, the test continue to find for the next message.
          Until the test found all the messages in that order).
         3. Some of the traces are supposed to appear under each node.
        """
        table_name = TABLE_NAME
        num_of_items = 10
        self.prepare_dynamodb_cluster(num_of_nodes=3)
        nodes = self.cluster.nodelist()
        self.create_table(node=nodes[0], table_name=table_name)
        data_generator = AlternatorDataGenerator(primary_key=self._table_primary_key, primary_key_format=self._table_primary_key_format)
        items = data_generator.create_multiple_items(num_of_items=num_of_items, mode=TypeMode.MIXED)
        # This dictionary contains the "traces" messages for each method. Also, the order of messages is important!
        # During the test, we expect to find the messages in the order they are displayed in the list.
        # For example: For the "put_item" method, the test expects to see first the "PutItem" message.
        # The next message should be the "accept_proposal: send accept proposal" message.
        # Finally, the test expects to see "CAS successful" message (if one of those messages does not appear in
        #  this order, the test will fail).
        expected_messages_dict = {
            # Last thing, the numbers that appear next to a message indicate the number of different messages we are
            #  looking for. That is, if the number is not 1, we expect to see the message from several different nodes.
            "put": [("PutItem", 1), ("prepare_ballot: sending prepare", len(nodes) - 1), ("accept_proposal: send accept proposal", len(nodes) - 1), ("CAS successful", 1)],
            "get": [("GetItem", 1), ("Creating read executor for token", 1), ("Querying is done", 1)],
            "update": [("UpdateItem", 1), ("accept_proposal: send accept proposal", len(nodes) - 1), ("prune: send prune of", len(nodes) - 1), ("CAS successful", 1)],
            "delete": [("DeleteItem", 1), ("accept_proposal: send accept proposal", len(nodes) - 1), ("prune: send prune of", len(nodes) - 1), ("CAS successful", 1)],
        }
        method_name_by_method_idx = {method_idx: method_name for method_idx, method_name in enumerate(expected_messages_dict)}
        expected_traces_number = len(expected_messages_dict)
    
        set_trace_probability(nodes=nodes, probability_value=1.0)
        for item_idx, item in enumerate(items):
            node = nodes[item_idx % len(nodes)]
            item_key = {self._table_primary_key: item[self._table_primary_key]}
            for method_name in expected_messages_dict:
                if method_name == "put":
                    self.put_item(node=node, item=item, table_name=table_name)
                elif method_name == "get":
                    self.get_item(node=node, item_key=item_key, table_name=table_name, consistent_read=True)
                elif method_name == "update":
                    self.update_item(node=node, item_key=item_key, table_name=table_name)
                elif method_name == "delete":
                    self.delete_item(node=node, item_key=item_key, table_name=table_name)
                else:
                    raise KeyError(f'The following "{method_name}" method name not supported!')
    
        set_trace_probability(nodes=nodes, probability_value=0.0)
    
        @retrying(num_attempts=10, sleep_time=1, allowed_exceptions=(AssertionError,))
        def try_checking_all_events():
            all_traces_events = self.get_all_traces_events()
            node_ips = {get_ip_from_node(node) for node in nodes}
            # The following action order for each item is: "PutItem", "GetItem", "UpdateItem", and "DeleteIte" (this
            #  order is order of "expected_messages_dict" keys).
            # Therefore, for each action, we need to get a list with "len(items)" cells.
            events_by_action = {}
            for events_idx, events in enumerate(all_traces_events):
                events_by_action.setdefault(method_name_by_method_idx[events_idx % expected_traces_number], []).append(events)
    
            def verify_traces_messages(method_name):  # pylint:disable=too-many-locals
                all_traces = events_by_action.get(method_name)
                expected_messages = expected_messages_dict.get(method_name)
                assert all_traces, f"traces for {method_name} wasn't found"
                assert expected_messages, f"expected_messages for {method_name} wasn't found"
    
                # The "all_traces" variable contains the list of traces in the order of action ("get_item" or "pu_item")
                #  we did. The test enters multiple items. Thus, need over on the traces for each item entered.
                for action_idx, traces in enumerate(all_traces):
                    source = get_ip_from_node(nodes[action_idx % len(nodes)])
                    trace_idx = 0
                    # This loop goes over the messages that should appear within the traces in the order of the
                    # "expected_messages".
                    for expected_message_details in expected_messages:
                        result = []
                        expected_message, expected_message_number = expected_message_details
                        count = expected_message_number
                        is_message_found = False
                        # This loop goes through all the traces once and tries to find the expected messages.
                        # Therefore, if the message not found or the while loop ends, this is means that a test failed
                        #  because the messages order was incorrect or the expected message changed.
                        while trace_idx < len(traces):
                            trace = traces[trace_idx]
                            trace_idx += 1
                            event_msg = trace["activity"]
                            if expected_message in event_msg and source == trace["source"]:
                                if expected_message != event_msg:
                                    result.append(event_msg)
                                count -= 1
                                if count == 0:
                                    is_message_found = True
                                    break
                        assert is_message_found, f'The following "{expected_message}" trace message not found from "{source}" source!'
                        if result:
                            if method_name in ["put", "update", "delete"]:
                                ips = {msg.rsplit(" ", maxsplit=1)[1] for msg in result}
                                _diff = node_ips ^ ips
                                assert _diff == {source}, f'The "{expected_message}" message not sent from "{expected_message_number}" ' f'nodes (The message in missing in the following "{_diff}" IPs'
                            elif method_name == "get":
                                for msg in result:
                                    ips = set(node_ip.strip() for node_ip in msg.split("{", maxsplit=1)[1].split("}", maxsplit=1)[0].split(","))
                                    _diff = node_ips ^ ips
                                    assert not _diff, f'The following node ips "{_diff}" not found in "get_item" event'
                            else:
                                raise AssertionError(f'The following "{method_name}" method name not supported!')
    
            for method_name in expected_messages_dict:
                logger.info(f'Verifying all traces of "{method_name}_item" method name')
                verify_traces_messages(method_name=method_name)
    
>       try_checking_all_events()

alternator_tests.py:974: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tools/retrying.py:45: in inner
    return func(*args, **kwargs)
alternator_tests.py:972: in try_checking_all_events
    verify_traces_messages(method_name=method_name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

method_name = 'get'

    def verify_traces_messages(method_name):  # pylint:disable=too-many-locals
        all_traces = events_by_action.get(method_name)
        expected_messages = expected_messages_dict.get(method_name)
        assert all_traces, f"traces for {method_name} wasn't found"
        assert expected_messages, f"expected_messages for {method_name} wasn't found"
    
        # The "all_traces" variable contains the list of traces in the order of action ("get_item" or "pu_item")
        #  we did. The test enters multiple items. Thus, need over on the traces for each item entered.
        for action_idx, traces in enumerate(all_traces):
            source = get_ip_from_node(nodes[action_idx % len(nodes)])
            trace_idx = 0
            # This loop goes over the messages that should appear within the traces in the order of the
            # "expected_messages".
            for expected_message_details in expected_messages:
                result = []
                expected_message, expected_message_number = expected_message_details
                count = expected_message_number
                is_message_found = False
                # This loop goes through all the traces once and tries to find the expected messages.
                # Therefore, if the message not found or the while loop ends, this is means that a test failed
                #  because the messages order was incorrect or the expected message changed.
                while trace_idx < len(traces):
                    trace = traces[trace_idx]
                    trace_idx += 1
                    event_msg = trace["activity"]
                    if expected_message in event_msg and source == trace["source"]:
                        if expected_message != event_msg:
                            result.append(event_msg)
                        count -= 1
                        if count == 0:
                            is_message_found = True
                            break
                assert is_message_found, f'The following "{expected_message}" trace message not found from "{source}" source!'
                if result:
                    if method_name in ["put", "update", "delete"]:
                        ips = {msg.rsplit(" ", maxsplit=1)[1] for msg in result}
                        _diff = node_ips ^ ips
                        assert _diff == {source}, f'The "{expected_message}" message not sent from "{expected_message_number}" ' f'nodes (The message in missing in the following "{_diff}" IPs'
                    elif method_name == "get":
                        for msg in result:
>                           ips = set(node_ip.strip() for node_ip in msg.split("{", maxsplit=1)[1].split("}", maxsplit=1)[0].split(","))
E                           IndexError: list index out of range

alternator_tests.py:964: IndexError

Happened since build https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/548/ - Scylla Sha - 87b08c9

Answer 1 · 2024-05-05T21:17:08.000Z

Not surprising this test was broken by that patch: The test code

ips = set(node_ip.strip() for node_ip in msg.split("{", maxsplit=1)[1].split("}", maxsplit=1)[0].split(","))

assumes that a list of IP addresses was formatted in a very specific way in the tracing message, but the way we print lists changed in the fmt patch by @tchaikov.

This test should be fixed to accept the new message format. That's all that is to it. It has nothing to do with Alternator. But nevertheless, I'll prepare a patch for the test now.