stellar/stellar-core

Consider deprecating sleeping at apply time for simulation purposes and implement an alternative

marta-lokhova opened this issue · 9 comments

OP_APPLY_SLEEP_TIME_DURATION_FOR_TESTING and OP_APPLY_SLEEP_TIME_WEIGHT_FOR_TESTING configs seem to be obsolete now:

  • Latest core defaults to BucketListDB, so we can't rely on in-memory SQL for "instant" application
  • With introduction of Soroban, it's unclear if these flags actually work as expected because of the env overhead. As a result, applying Soroban transactions is also not instant, so application time might be well beyond the desired duration (OP_APPLY_SLEEP_TIME_DURATION_FOR_TESTING)

cc @bboston7 @SirTyson for input

No objections, sounds like a good idea

So what is the alternative? Those settings were there to allow for testing overlay without actually having to have a real state and still have a distribution for apply time of transactions that maps the simulation scenario.

I think this is less important / is achievable via the Soroban loadgen now. Previously, we didn't have a good way of ramping up the apply time complexity with just payments, so we had to sleep after the application to achieve the desired time distribution. This is not the case with Soroban, where we can specify an arbitrary amount of IO and CPU. From the overlay perspective it doesn't matter if the apply time is spent sleeping in a classic op or just spinning CPU in a Soroban op. I think we can achieve the same time delays for overlay testing, but with time spent in the actual env, which seems to accomplish the same goal while being more realistic.

kinda of. You cannot sleep in Soroban, just burn CPU. The reason we have a sleep here instead of a busy loop, is that it allows to run fairly complex topologies without having to dedicate "real hardware" necessary to perform the simulation (ie: kubernetes workers only need to be powerful enough to handle overlay). For tests where we want to push in the thousands of tx/s this makes a very big difference (and this will continue to be true as we won't need ledger execution cores even with ledger apply done in the background)

So what is the alternative? Those settings were there to allow for testing overlay without actually having to have a real state and still have a distribution for apply time of transactions that maps the simulation scenario.

This isn't quite right; the trick was to still have the state, but use in-memory SQL and sleep on top. We're moving towards BucketListDB support only, and dropping SQL completely. So we need to figure out what to do with this simulation mode, and whether it's worth supporting and figuring out some kind of in-memory BucketListDB solution.

Looking at the current usage, we use OP_APPLY_SLEEP_TIME_DURATION_FOR_TESTING in two places:

  • Pubnet simulation where we set these numbers to something super low. Honestly, I don't think it's worth keeping it there. The numbers are so outdated to the point that the mission is not realistic. We should just switch to a normal DB backend to test something more real.
  • Max TPS tests, where we set the config to 0 to test "ideal conditions". This one I think is worth fighting for, since using a disk-backed backend will yield much lower perf numbers. But I'm not certain how well the current simulation approach works, with "sleeping on top of apply time". My understanding is that invoking Soroban ops creates a lot of overhead even with an in-memory database, so application is not "instant". Is that accurate?

To be clear, I'm definitely in favor of simulating apply time for perf measurement purposes. We've seen anomalies from DB backends in the past (specifically, with Postgres), which created noise in overlay measurements. What I'm saying in this issue is that the current implementation doesn't work with BucketListDB, and might not work well with Soroban traffic (even with SQL backend).

Maybe we need to move towards being actually stateless in overlay simulations. It does require some changes though (make loadgen generate well-formed but invalid transactions while still supporting other modes, simulate validation time somehow, etc). Of course we could also just keep in-memory SQL for simulation purposes only, but that's super annoying as we'd have to keep and maintain the SQL code, so I'd prefer not to go that route.

With introduction of Soroban, it's unclear if these flags actually work as expected because of the env overhead. As a result, applying Soroban transactions is also not instant, so application time might be well beyond the desired duration (OP_APPLY_SLEEP_TIME_DURATION_FOR_TESTING)

This is exactly what I saw in practice when using this while developing the new mixed mode max TPS mission. There is no impact on max TPS when using the values from the pubnet simulation. It appears this is due to those values being so low, but it would take a bit more testing to confirm that. I could have raised the sleep time, but doing so would also slow down pay transactions. If we want to use this approach for mixed mode, I think we need different distributions for different transaction types.

Overall though, I think removing this is the right move given that hitting BucketsListDB is now necessary and provides a more realistic representation of apply time anyway.

I think we're mixing two different use cases:

  • using loadgen for testing on "real hardware" (ie: answer questions like "how many tps can testnet" do). Here high fidelity is needed, so using bucketsdb etc is the way to go. In this kind of setup sleeping is not what we want to do as we want to test "end-to-end" using synthetic traffic
  • using loadgen for testing the impact of changes in core outside of the transaction subsystem (overlay, consensus are the main ones). Those tests are typically done on larger deployments (supercluster mission) where pods are using shared workers, in this context resource types matter. Like: storage, IO performance is a fraction of the real performance and we can't use realistic ledgers because when collocated use up too much space (which also cause slow setup time); for CPU the capacity is much lower than "real instance types" as well because cores are shared. This is where bypassing as much of the disk/CPU is beneficial.

I want to clarify that the intent of this issue is to exclusively focus on simulate apply feature in the context of overlay perf testing. Real hardware testing is outside of scope for this work (I agree with Nicolas that we should just use BucketListDB as default).

The question is what to do for overlay perf testing. We need to be able to tweak ledger close time in a controlled way in order to avoid noise in measurements. Our current solution in master is to use an in-memory SQL DB, and sleep on top. The assumption here was that in-memory application should be "nearly instant", allowing us to control total close time by sleeping.

This assumption is no longer valid: in-memory SQL does nothing when BucketListDB is enabled, and we go to disk anyway. In addition, it's not clear if Soroban operations cause too much env overhead (for instance, @bboston7 is seeing ledger close take several seconds, which is way beyond the configured sleep time).

I suggest we revisit our simulation mode and move towards stateless simulations: this means we can skip tx application in ledger close completely. The tradeoff is that we'll end up with garbage blocks, and need to simulate block/tx validation, but that doesn't seem too bad as long loadgen generates more or less realistic well-formed transactions.

Thoughts?