hyperledger/indy-plenum

plenum/test/node_catchup/test_node_catchup_with_connection_problem hanging on Ubuntu 20.04

WadeBarnes opened this issue · 13 comments

When running on Ubuntu 20.04 the follow tests hang and never complete:

plenum/test/node_catchup/test_node_catchup_with_connection_problem.py::test_catchup_with_lost_ledger_status

  • On the fourth iteration when all four iterations are run the test hangs.
  • When the fourth iteration (lost_count=4) is run on it's own the test passes.
  • Details of the investigation below.

plenum/test/node_catchup/test_node_catchup_with_connection_problem.py::test_catchup_with_lost_first_consistency_proofs

  • On the first iteration.
  • Cause has not been investigated.

plenum/test/node_catchup/test_node_catchup_with_connection_problem.py::test_cancel_request_cp_and_ls_after_catchup

  • On the first iteration.
  • Cause has not been investigated.

Investiagtion into hang issue with plenum/test/node_catchup/test_node_catchup_with_connection_problem.py::test_catchup_with_lost_ledger_status

The tests are hanging on this line:

Thinking that it could be an issue with RocksDB or the Python wrapper I tried building the wrapper straight from source git+https://github.com/twmht/python-rocksdb.git#egg=python-rocksdb to get the new close method that is not included in the released PyPi version. The latest code causes a seg fault on close, so I also tried git+https://github.com/alexreg/python-rocksdb.git@fix_close_segfault#egg=python-rocksdb which fixes the seg fault issue. My thought was the rocksDB instances were not getting closed/deposed properly. None of this made any difference, the tests still hung.

If you modify the code to only run 3 iterations, rather than the 4, you avoid the hang and the tests pass. If you modify the code to run just the 4th iteration the tests pass.

Steps to reproduce:

Using https://github.com/WadeBarnes/indy-plenum/blob/20.04-test-debugging

MINGW64 /c/indy-plenum (20.04-test-debugging)
$ docker build -t plenum-build:2004 -f .github/workflows/build/Dockerfile.ubuntu-2004 .
MINGW64 /c/indy-plenum (20.04-test-debugging)
$ docker build -t indy-plenum-test:2004 -f .github/workflows/build/Dockerfile.test-2004 .
MINGW64 /c/indy-plenum (20.04-test-debugging)
$ docker run --rm -it --name plenum-testing --volume='//c/indy-plenum:/home/indy/indy-plenum:Z' indy-plenum-test:2004 bash
root@cd83db811641:/home/indy/indy-plenum# python3 -m pytest -l -v --log-cli-level=WARNING --disable-warnings plenum/test/node_catchup/test_node_catchup_with_connection_problem.py::test_catchup_with_lost_ledger_status

Result:
On the fourth iteration of plenum/test/node_catchup/test_node_catchup_with_connection_problem.py::test_catchup_with_lost_ledger_status it will hang on root:kv_store_rocksdb.py:30 Init KeyValueStorageRocksdb -> open

WARNING  root:compact_merkle_tree.py:57 <- _update
PASSED                                                                                                                                                                                                           [ 75%] 
plenum/test/node_catchup/test_node_catchup_with_connection_problem.py::test_catchup_with_lost_ledger_status[4]
---------------------------------------------------------------------------------------------------- live log call -----------------------------------------------------------------------------------------------------WARNING  root:test_node_catchup_with_connection_problem.py:44 lost_count: 4
WARNING  root:test_node_catchup_with_connection_problem.py:45 txnPoolNodeSet: [Alpha, Beta, Gamma, Delta]
WARNING  root:test_node_catchup_with_connection_problem.py:46 looper: <stp_core.loop.looper.Looper object at 0x7f6629910220>
WARNING  root:test_node_catchup_with_connection_problem.py:47 sdk_pool_handle: 2
WARNING  root:test_node_catchup_with_connection_problem.py:48 sdk_wallet_steward: (5, 'MSjKTWkPLtYoPEaTF1TUDb')
WARNING  root:test_node_catchup_with_connection_problem.py:49 tconf: <module 'indy_config.py' from '/tmp/pytest-of-root/pytest-1/tmp0/etc/indy/indy_config.py'>
WARNING  root:test_node_catchup_with_connection_problem.py:50 tdir: /tmp/pytest-of-root/pytest-1/tmp0
WARNING  root:test_node_catchup_with_connection_problem.py:51 allPluginsPath: ['/home/indy/indy-plenum/plenum/test/plugin/stats_consumer']
WARNING  root:test_node_catchup_with_connection_problem.py:52 monkeypatch: <_pytest.monkeypatch.MonkeyPatch object at 0x7f662817fca0>

...

WARNING  root:kv_store_rocksdb_int_keys.py:23 -> Init KeyValueStorageRocksdbIntKeys
WARNING  root:kv_store_rocksdb.py:20 -> Init KeyValueStorageRocksdb
WARNING  root:kv_store_rocksdb.py:30 Init KeyValueStorageRocksdb -> open

About this issue. We have to copy deb package from xenial repo to bionic and install rocksdb=5.8.8 instead of librocksdb5.17.
Or, from the other hand, of cause we can start moving process from 5.8 version to 5.17 for rocksdb. But it can take a lot of effort.

I setup a VSCode remote container environment for indy-plenum (https://github.com/WadeBarnes/indy-plenum/tree/ubuntu-20.04-dev-container) to debug this issue further. So far it appears the issue is not with RocksDB at all. When setting breakpoints and stepping though the code it gets well past the point indicated above and ends up hanging here; https://github.com/WadeBarnes/indy-plenum/blob/ubuntu-20.04-dev-container/ledger/ledger.py#L65. I'm hoping to be able to dig into this more today.

@anikitinDSR, are you saying you've tested it with RocksDB 5.8.8 and you don't experience the hanging issue with the tests?

Exactly. You can remove 5.17 version of rocksdb inside the container and install rocksdb 5.8 from repo.sovrin.org xenial instead of bionic. Also, please revert self._db.close() calling. I mean this one:
https://github.com/WadeBarnes/indy-plenum/blob/45b163d056c3d6b9411771a693bd3bfbb45f3569/storage/kv_store_rocksdb.py#L157

@anikitinDSR, Which of the RocksDB 5.8.8 packages did you use? I'm getting errors trying to install the one from deb https://repo.sovrin.org/lib/apt xenial stable

I think this one can be useful:
deb https://repo.sovrin.org/deb xenial master

you can try this:
WadeBarnes#3

But please make sure, that it's only for showing that it works with rocksdb5.8 and it cannot be a fix.

From my point of view, we need just copy rocksdb5.8 .deb package from xenial to bionic repo and setup it as in PR.

bionic isn't really the right place for it either, since we're targeting focal.

It's ok that you want to use another repo. The main goal here is that you have to use rocksdb version 5.8, because our source code expect exactly API from this version.
For using another version of rocksdb changes in the source code are needed.

I updated the https://github.com/WadeBarnes/indy-plenum/tree/20.04-test-debugging code following your recommendations to create a PoC that ran the test via GHA, https://github.com/WadeBarnes/indy-plenum/actions/runs/1031135484, to prove all the tests pass. The one test that is failing in that run is an unrelated issue.

rocksdb_5.8.8_amd64.deb has been published into the Hyperleger Indy repository and registered as supporting
focal, bionic, and xenial; rocksdb_5.8.8_amd64.deb

@udosson, @anikitinDSR, I've updated the test branch with the new repository information; https://github.com/WadeBarnes/indy-plenum/blob/20.04-test-debugging/.github/workflows/build/Dockerfile.ubuntu-2004#L11-L15

Successful test run here; https://github.com/WadeBarnes/indy-plenum/actions/runs/1049822050

I've confirmed RocksDB gets picked up from the Hyperledger repository. @udosson, You con go ahead with integrating these changes into your PR, and then we can close this ticket.

This fix has been integrated into PR #1545