IC_queue is getting huge

Question

IC_queue is getting huge

domwoe opened this issue 3 years ago · 8 comments

We experience an issue with a huge IC_queue. Our validator_info response has a size of around 7MB. The issue started already in March, when lots of view changes were triggered before the view change could be completed. View_no to be voted for increased from 8044 to 10931. Then a node voted for a view change to 8045 again which was successful. However every node still carries the huge IC_queue. Restarting does not help since the IC_queue is persisted.

IC - Instance Change

I've attached (parts of) the log from the view_change_trigger_service.py.
extract.txt

Looking at the code it seems that this is the only point where instance change messages get removed.

Is there a way to flush the IC queue without deleting the indy node's data directory? About 10 stewards would need to do this in this case.
Can we do something to prevent such a build up of (unsuccessful) view changes?

I've reported this issue/asked the questions also on Rocketchat.

Thanks for taking the time to look into this!

Network Details

13 validator nodes in March. Today 15.

Software

"Software": {
                    "indy-node": "1.12.4",
                    "sovrin": "unknown",
                    "Indy_packages": [
                        "hi  indy-node                           1.12.4                                        amd64        Indy node",
                        "hi  indy-plenum                         1.12.4                                        amd64        Plenum Byzantine Fault Tolerant Protocol",
                        "hi  libindy-crypto                      0.4.5                                         amd64        This is the shared crypto libirary for Hyperledger Indy components.",
                        "hi  python3-indy-crypto                 0.4.5                                         amd64        This is the official wrapper for Hyperledger Indy Crypto library (https://www.hyperledger.org/projects).",
                        ""
                    ],
                    "OS_version": "Linux-4.15.0-1092-azure-x86_64-with-Ubuntu-16.04-xenial",
                    "Installed_packages": [
                        "jsonpickle 0.9.6",
                        "intervaltree 2.1.0",
                        "pyzmq 18.1.0",
                        "packaging 19.0",
                        "ioflo 1.5.4",
                        "base58 1.0.0",
                        "python-rocksdb 0.6.9",
                        "orderedset 2.0",
                        "indy-crypto 0.4.5-23",
                        "sha3 0.2.1",
                        "portalocker 0.5.7",
                        "distro 1.3.0",
                        "setuptools 38.5.2",
                        "rlp 0.5.1",
                        "psutil 5.4.3",
                        "indy-plenum 1.12.4",
                        "python-dateutil 2.6.1",
                        "semver 2.7.9",
                        "Pympler 0.5",
                        "sortedcontainers 1.5.7",
                        "Pygments 2.2.0",
                        "libnacl 1.6.1",
                        "timeout-decorator 0.4.0",
                        "six 1.11.0",
                        "indy-node 1.12.4"
                    ]
                },

There is some variation on the exact os_version among the validator nodes.

Answer 1 · 2021-07-05T08:24:57.000Z

@esplinr -- can your team take a look at this? Thanks!

Answer 2 · 2021-07-07T07:09:37.000Z

@esplinr I added the info on the software version we are using. If more logs or information are needed please ask

Answer 3 · 2021-07-09T00:19:53.000Z

It'll take me a while to find someone with the right expertise to triage this and provide some direction.

Answer 4 · 2021-07-12T08:14:23.000Z

Small update:
I've written a small script to remove the dangling queue entries from the node_status_db directly.
Still, would be good to have more insights on why this happened and I think in general it's not a good idea to allow that such a large queue is built up.

Answer 5 · 2021-07-15T13:56:12.000Z

After small investigation of logs and validator-info, I have the next results:

IC_queue cleaning is working cause there are a lot of line about:
InstanceChangeProvider: Discard InstanceChange from **** for ViewNo 8045 because it is out of date (was received 7200sec ago).
From validator-info: There were a lot of restarts and amount of "the oldest" nodes is 8 and it's less then needed for quorum (10 at least). As I remember, it's possible to divide network into 2 subnetwork by rebooting more then f nodes. Also, from validator-info we can see, that only 9 IC messages are receiving for each propagated viewNo, and it's not enough.
Removing information from node_status_db or other internal storage is not a good idea at all, cause it may lead to getting unpredictable behaviour.

The main recommendation from my point of view here is:

Try to send "pool restart" command to all pool. You need to synchronize starting all the nodes.

Answer 6 · 2021-07-15T14:15:31.000Z

Thanks @anikitinDSR for looking into this.

The issue with the IC_queue cleaning is that it seems the cleaning of the queue for a particular view_no only happens if a new vote is received for exactly this view_no. Since we are currently at view_no 8053, but there are votes in the queue up to view_no 10931 this will might forever.
What we tried already:

One pool restart which restarted most of the nodes except 3 because they're running on docker and don't have the node control tool
A pool restart synchronised with manual restart of 2 of the docker nodes.

I struggle to understand your second point.

Answer 7 · 2021-07-15T17:11:44.000Z

My interpretation of @anikitinDSR 's comment is that the logs suggest the network is partitioned, which can happen if more than f nodes are rebooted at the same time. That could be why some nodes are seeing an older view change than the others.

My suggestion is that you rebuild just the nodes that are seeing the old view change, and let them catch up to have a consistent view with the rest of the network.

If that isn't the problem, you'll have to trace the system to figure out what exactly is going on.

Answer 8 · 2021-07-16T15:50:54.000Z

The final results and suggestions after investigations and discussion.
Current size of IC_queue is not critical and it's a big problem in general. For preventing this behaviour I suggest the next plan:

Create tests with checking, that after successful view_change process, IC_queue is cleaning up in terms of outdated IC messages. This test should check that messages with < successViewNo, == successViewNo and >successViewNo will be discarded. One node from txnPoolNodeSet and inserting different IC messages can be used here.
Create a fix with discarding all outdated IC messages after view_change_done. I think, that calling _update_vote method for each element in IC_queue could be enough.