[BUG]: On testnet 4, unknown previous block errors during sync

Question

[BUG]: On testnet 4, unknown previous block errors during sync

Closed this issue 2 years ago · 0 comments

Is there an existing issue for this?

I have searched the existing issues

Current behavior

When attempting to sync Testnet 4, I got quite a few of these:

p2p_1                  | 2022-08-23 20:26:19.075986 (p2p.Koinos) [p2p/error_handler.go:81] <info>: Encountered peer error: QmcGiTpSm6YrmYo3rWoqrCPez2aJY4VdraBQsGsZKwFRuG, block application failed: local RPC error ApplyBlock, chain rpc error, unknown previous block. Current error score: 104841

The error score increased until the peer was disconnected, and I ended up with no peers.

Expected behavior

This error shouldn't occur (unless there's a peer doing something weird, like running possibly malicious modified p2p code, or sending blocks out of order).

Steps to reproduce

Clone https://github.com/koinos/koinos
Start with docker-compose up

Environment

- OS: Ubuntu 22.04 (x86-64)

Anything else?

I have a guess of what's causing the error. I haven't checked the code, it's just a hypothesis that seems to fit my observations.

Normally parallelizing requests A, B, C is good for performance but it causes a race for block submissions because the requests are dependent. If A=submit block 101, B=submit block 102, C=submit block 103, you can't parallelize A, B, C. And that kind of rapid fire submission of a chain of dependent blocks is exactly what happens during a sync.

I think that somewhere, somehow, requests are getting improperly parallelized or reordered. Where could this be happening? Here are my guesses:

The p2p code bulk requests blocks, then there's a handler that submits each block to the chain as it comes in. There might be some async or threading stuff going on in the p2p code that lets later blocks be submitted before earlier block submissions are finished processing.
Somewhere on the submission reception side, perhaps in koinos-chain, there might be some async or threading stuff that lets request handling happen in parallel. It might be buried in koinos-mq-cpp or even lower down.
Somewhere in the RabbitMQ docs, there might be a paragraph that says "Messages are not guaranteed to be received in the order they are sent. If program P submits A, B, C, then program Q might read A, C, B."

The words there might be in italics means "This is something that, if it's actually this way, would explain the bug. But I haven't looked to see if it's actually this way, so it might not actually be this way."

So if this is our problem, two solutions immediately come to mind:

Simple solution: Add a lock to p2p so only one block submission can be in flight at a time. This may have severe performance penalties for sync.

Complicated solution: Create an API that allows submitting multiple blocks in a single batch message. This guarantees the messages aren't reordered by putting them into a single application-level message. The p2p code for properly using such an API would be moderately complicated:

Have each peer handler put the blocks it wants to submit into a single process-wide Go channel
Have a single process-wide goroutine that drains the channel
When draining the channel, have a loop that checks if a non-blocking queue read would retrieve an additional block
Accumulate those immediately available blocks into a batch message until some size or count threshold is reached (or you run out of immediately available blocks)
Rather than submitting a block to the queue, the peer handler actually submits an object that contains the block and a channel; the channel acts as a future for the chain's response to the block (which may be an error).
The peer handler awaits the future and assigns naughty points as appropriate if the chain errored.

The idea of the complicated solution is this: The simple solution has a performance penalty for sync because the interprocess round-trip time dominates the time required to process a block. You can't pipeline and are forced to wait for that round-trip, because some thread or async thing (or possibly RabbitMQ semantics allowing reordering of messages) causes a race if you try.

The complicated solution recognizes that you still have a restriction of one block submission at a time, but if you make that submission a batch, you can amortize the round-trip time over the batch size and improve performance-per-block.