Sudden drop in TPS after total 14k transactions

Question

Sudden drop in TPS after total 14k transactions

drandreaskrueger opened this issue 6 years ago · 30 comments

Hello geth team, perhaps you have ideas where this sudden drop in TPS in quorum might come from:
Consensys/quorum#479 (comment)

System information

Geth version: quorum Geth/v1.7.2-stable-3f1817ea/linux-amd64/go1.10.1
OS & Version: Linux

Expected behaviour

constant TPS

Actual behaviour

For the first 14k transactions, we see averages of ~465 TPS, but then it suddenly goes down considerably, to an average of ~270 TPS.

Steps to reproduce the behaviour

See https://gitlab.com/electronDLT/chainhammer/blob/master/quorum-IBFT.md

drandreaskrueger commented 6 years ago

.

Answer 1 · 2018-08-20T11:08:55.000Z

We've made 14 releases to go-ethereum since then, also I've no idea what modifications quorum made to our code. I'd gladly debug Ethereum related issues, but I can't help debug a codebase that's a derivative of ours.

Answer 2 · 2018-08-20T12:41:39.000Z

Thx.

Not asking for debugging, but asking for ideas what else to try out.

Not a single idea crossed your mind when you saw that?

Answer 3 · 2018-08-20T12:45:15.000Z

We're always pushing in polishes and performance improvements. Would be nice to see how it behaves on top of the latest go-ethereum code.

…

On Mon, Aug 20, 2018 at 3:41 PM drandreaskrueger ***@***.***> wrote: Thx. Not asking for debugging, but asking for ideas what else <Consensys/quorum#479 (comment)> to try out. Not a single idea crossed your mind when you saw that? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#17447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH6GSTl6TucJgD_fEEYFoOQ6hLSwdBEks5uSq6KgaJpZM4WDv0b> .

Answer 4 · 2018-08-20T12:50:53.000Z

Would be nice to see how it behaves on top of the latest go-ethereum code.

Yes.

For that I am looking for something like the quorum 7nodes example or the blk.io crux-quorum-4nodes-docker-compose.

Perhaps this works: https://medium.com/@javahippie/building-a-local-ethereum-network-with-docker-and-geth-5b9326b85f37 any better suggestion?

Answer 5 · 2018-08-20T15:27:17.000Z

@karalabe please reopen

I am seeing the exact same problem in geth v1.8.13-stable-225171a4

will document it now

Answer 6 · 2018-08-20T15:57:15.000Z

it does not only happen in quorum, but also in geth v1.8.13:

Details here:

Answer 7 · 2018-08-20T16:14:17.000Z

oh, and I asked - yes, Quorum is going to catch up soon.

Answer 8 · 2018-08-23T16:44:45.000Z

@drandreaskrueger It wouldn't drop if you send transactions via mining node ( aka signer )

Here is the live testnet stats https://test-stats.eos-classic.io from eosclassic POA Testnet https://github.com/eosclassic/eosclassic/commit/7d76e6c86f5b0385755e492240eea4b3fba8d60d

with 1s block time and 100 mil gas limit

explorer: https://test-explorer.eos-classic.io/home

Current tps around 200 tps

Answer 9 · 2018-08-23T23:38:55.000Z

if you send transactions via mining node ( aka signer )

I am.

In geth submitting them to :8545 in this network https://github.com/javahippie/geth-dev
In quorum submitting them to :22001 in this network https://github.com/drandreaskrueger/crux/commits/master

See https://gitlab.com/electronDLT/chainhammer/blob/master/geth.md and https://gitlab.com/electronDLT/chainhammer/blob/master/quorum-IBFT.md#crux-docker-4nodes

So ... the question is still open: Why the drop?

Answer 10 · 2018-08-23T23:39:49.000Z

eosclassic POA Testnet

Interesting.

Does it make sense to also benchmark that one?
It is a public testnet, right? How can I get testnetcoins? How to add my node?

Is it a unique PoA algorithm, or one of those which I have already benchmarked ?

Sorry for the perhaps naive question, I will hopefully be able to look deeper when I am back in the office on Monday.

eosclassic POA Testnet

Is there perhaps also a dockerized/vagrant-virtualbox version of such a network with a private chain - like the 7nodes example of quorum, or the above mentioned crux 4nodes (which both run complete networks and all nodes on my one one machine). If yes, then I would rather benchmark that. (My tobalaba.md benchmarking has shown that public PoA chains can lead to very unstable results.)

Answer 11 · 2018-08-24T09:26:24.000Z

@drandreaskrueger It is a public testnet with clique engine ( and some modifications for 0 fee transactions ).

There is a testnet faucet running so you can download the node here https://github.com/eosclassic/eosclassic/releases/tag/v2.1.1 and earn some testnet coins here https://faucet.eos-classic.io/

Right now there is no dockerized/vagrant-virtualbox version for our testnet but we are planning to improve them with AWS / Azure template support soon.

Our purpose of benchmark is to improve existing consensus algorithm for public use and implement modified version of DPOS under ethereum environment, integrating DPOS will be done after specification of DPOS.

So feel free to use our environment, I will list our testnet/development resources here.

EOS Classic POA Testnet Resources

Explorer: https://test-explorer.eos-classic.io/

Faucet: https://faucet.eos-classic.io/

Wallet: https://wallet.eos-classic.io/

RPC-Node: https://test-node.eos-classic.io/

Network stats: https://test-stats.eos-classic.io

Answer 12 · 2018-08-28T16:27:15.000Z

EOS - see

but please keep this thread to answering my geth question!

Answer 13 · 2018-09-04T15:22:29.000Z

the question is still open: Why the drop?

Might have time for it next week, please share any idea, thanks.

Answer 14 · 2018-09-18T10:30:15.000Z

any new ideas about that?

You can now super-easily reproduce my results, in less than 10 minutes, with my Amazon AMI image:

https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#readymade-amazon-ami

Answer 15 · 2018-09-21T10:05:51.000Z

@drandreaskrueger perhaps a silly question, but one that hasn't been asked: have you eliminated the possibility that this is artefact of using a docker/VM-based network? Not directly related, but I have in the past had problems when developing on fabric due to VM memory limitations after a buildup of transactions. It probably isn't the problem, but it's something to check, which is what you originally asked for.

Answer 16 · 2018-09-21T16:51:03.000Z

@jdparkinson93 no, I haven't. Not a silly question. It is good to eliminate all possible factors.

Please send the exact commandline/s for creating a self-sufficient node/network without docker.

Then the benchmarking can be run on that, to see whether this replicates or not.

Thanks.

Answer 17 · 2018-09-21T16:51:44.000Z

And do reproduce my results, in less than 10 minutes, with my Amazon AMI image:
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#readymade-amazon-ami

or same file scroll up ... for instructions how to do it on any Linux system.

Then you can do experiments yourself.

Answer 18 · 2018-09-25T15:44:16.000Z

@jdparkinson93 no, I haven't. Not a silly question. It is good to eliminate all possible factors.

Please send the exact commandline/s for creating a self-sufficient node/network without docker.

Then the benchmarking can be run on that, to see whether this replicates or not.

Thanks.

@drandreaskrueger

I attempted to do this in the laziest possible way by leveraging ganache, but submitting that many transactions at once caused it a problem. However, drip-feeding them in with batches of 100 I was able to reach and exceed 14k txs.

Time (as reported by time unix command) to run the first few batches of 100 txs:

15.02 real         3.55 user         0.28 sys
15.19 real         3.61 user         0.28 sys
15.51 real         3.69 user         0.29 sys
15.65 real         3.57 user         0.27 sys
15.96 real         3.65 user         0.28 sys
16.32 real         3.76 user         0.33 sys
16.29 real         3.53 user         0.27 sys
16.62 real         3.56 user         0.27 sys
16.15 real         3.70 user         0.28 sys
15.99 real         3.53 user         0.27 sys
16.22 real         3.51 user         0.27 sys

After 14000 txs:

21.64 real         3.63 user         0.29 sys
21.19 real         3.62 user         0.28 sys
20.97 real         3.65 user         0.27 sys
20.70 real         3.62 user         0.29 sys
23.41 real         3.67 user         0.29 sys
22.51 real         3.81 user         0.30 sys
21.03 real         3.64 user         0.27 sys
21.76 real         3.77 user         0.30 sys
21.38 real         3.76 user         0.28 sys
21.69 real         3.75 user         0.30 sys       
21.67 real         3.65 user         0.28 sys
20.89 real         3.75 user         0.30 sys
21.12 real         3.68 user         0.29 sys

So as we can see, the time does increase somewhat... but I'd argue that's expected in this case because the transactions I sent in were non-trivial and the execution time per transaction is expected to grow as there are more transactions (since the use-case involves matching related things, which in turn involves some iteration).

Apologies that I didn't use simple Ether/ERC20 transfers for the test, as in hindsight that would have been a better comparison. Nevertheless, I'm not convinced I am seeing a massive (unexpected) slowdown of the magnitude your graphs suggest.

Answer 19 · 2018-09-25T16:21:43.000Z

Thanks a lot, @jdparkinson93 - great that you try to reproduce that dropoff with different means.

Please also do reproduce my results, in less than 10 minutes, with my Amazon AMI image:
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#readymade-amazon-ami

Or with a tiny bit more time ... on any linux system
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md

Thanks.

Answer 20 · 2018-09-25T16:24:11.000Z

Today, I had done a quorum run on a really fast AWS machine:
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#quorum-crux-ibft

And got these results:
https://gitlab.com/electronDLT/chainhammer/blob/master/quorum-IBFT.md#amazon-aws-results

with a mild drop of TPS in the last third.

Answer 21 · 2018-09-27T09:36:29.000Z

@drandreaskrueger There isn't really any drop in your latest test - if anything, it appeared to incrementally speed up and then reach a ceiling. What are your thoughts regarding these new results?

Answer 22 · 2018-10-01T23:31:59.000Z

After 14000 txs ... So as we can see, the time does increase somewhat

Yes, interesting. Thanks a lot, @jdparkinson93 !

You see a (21-15) / 15 = 40% increase in "real time".
I see a (350 - 220) / 350 ~40% drop in TPS.

Could be the same signal, no?

Could you plot your first column over the whole range of 20k transactions?
I would like to see whether you also see a rather sudden drop, like in most of my earlier experiments.

didn't use simple Ether/ERC20 transfers for the test

I am always overwriting one uint in a smart contract. Easiest to understand here; but then later optimized here.

Answer 23 · 2018-10-21T00:49:04.000Z

I have new information for you guys.

I ran the same thing again, BUT this time with 50k transactions (not only 20k).

Have a look at this:

= https://github.com/drandreaskrueger/chainhammer/blob/eb97301c6f054cbe6e6278c91a138d10fa04a779/reader/img/geth-clique-50kTx_t2xlarge_tps-bt-bs-gas_blks12-98.png

Isn't that odd?

I would not have expected that.

TPS is dropping considerably after ~14k transactions - but then it actually recovers. Rinse repeat.

More infos, log of whole experiment, etc:

https://github.com/drandreaskrueger/chainhammer/blob/master/geth.md#run-2

EDIT: sorry guys, after massive reorg, some paths had changed. Repaired that now, image is visible again.

Answer 24 · 2019-01-15T14:09:28.000Z

sorry guys, after massive reorg, some paths had changed. Repaired that now, image is visible again.

Answer 25 · 2019-02-19T10:22:26.000Z

Thank you for putting so much work into debugging this issue.

Answer 26 · 2019-02-19T15:41:06.000Z

Thank you for putting so much work into debugging this issue.

Happy to help. I don't speak go yet, and I am not debugging it myself ... but I provide you with a tool that makes debugging this problem ... much easier.

chainhammer - I have just published a brand new version v55:
https://github.com/drandreaskrueger/chainhammer/#quickstart

Instead of installing everything to your main work computer, better use (a virtualbox Debian/Ubuntu installation or) my Amazon AMI to spin up a t2.medium machine, see docs/cloud.md#readymade-amazon-ami.

Then all you need is this line:

CH_TXS=50000 CH_THREADING="threaded2 20" ./run.sh "YourNaming-Geth" geth-clique

and afterwards check results/runs/ to find an autogenerated results page, with time series diagrams.

Hope that helps! Keep me posted please.

Answer 27 · 2020-09-24T12:00:58.000Z

@drandreaskrueger We recently did some changes in the txpool s.th. local transactions are not stalled for that long, and some improvements to the memory mgmt. Previously we would need to reheap the txpool whenever a new tx was send, this created a lot of pressure on the garbage collector. I don't know how to use AWS and have no account there. Could maybe rerun your chainhammer once again with the newest release? Thank you very much for all your work!

Answer 28 · 2020-09-24T23:10:24.000Z

Thank you very much for your message, @MariusVanDerWijden. Happy to be of use.

Right now, it's not my (paid) focus, honestly - but ... I intended to rerun 'everything chainhammer' again with the -then newest- versions of all available Ethereum clients too, at some time in the future anyways. I could also imagine to then publish all the -updated- results in some paper, so ... this is good to know. Great. Thanks for getting back to me. Much appreciated.

Cool that you made some changes here, and that perhaps this issue might have something to do with those symptoms, and you improved the client. Great.

Oh, and ... AWS is really not important here. If you want to, try yourself now - I recommend to first watch the (boring) 15 minutes video, also these step by step instructions had been really helpful for other users.
It shouldn't be difficult for you to get new geth results, using chainhammer. It could also be a useful tool for you, to identify and fix further bottlenecks, especially because it is so simplistic.

Answer 29 · 2023-02-21T09:05:01.000Z

This ticket/investigation is getting quite old, and I suspect maybe not relevant with the current code. I'm going to close this for now.