blur-network/blur

Incorrect block hash appears periodically during sync

Closed this issue · 21 comments

Errors like this one pop up periodically while syncing if full verification is enabled

src/cryptonote_protocol/cryptonote_protocol_handler.inl:1031 [xxx.xxx.xx.xxx:52541 OUT] Got block <c45c2622079c1abf1400000000000000697e28464e7f0000140a000000000000> with unknown parent <fd365d62f979f8651708a6eb7e543730c32ff261684e3a337ffce8dd6c633be7> which was not requested - querying block hashes

The parent block exists... but obviously the block with cleared bits does not. The temporary stall in verification resolves itself, if the daemon is just left to its own devices.

Not sure if this is happening due to a lack of proper locking during sync, or what. The fact that it resolves itself seems to imply that verification is either being disabled temporarily to move through the failure, or that the block is being popped and sync is beginning again from the rewind point.

leto commented

I am seeing this on Ubuntu 16.04 as well, just figured I should mention

Maybe a bit presumptuous, but this appears to be the area of code that is likely causing the issue: https://github.com/blur-network/blur/blob/master/src/cryptonote_protocol/cryptonote_protocol_handler.inl#L965-L1186

Noting here to come back to later, when I have proper time. This is based solely on the correlation between the m_sync_lock failure, and the incorrect block hashes. If locking is not the issue here, then this is likely incorrect.

leto commented

Just giving another example here in case it's useful:


2019-07-29 03:13:12.419 [P2P6]  ERROR   verify  src/cryptonote_core/blockchain.cpp:3249 Block with id: <eb2ae0123ec8349a534d56e44d71e20dc6d4dd64e4b66d1ba38eaf14f24b456e>
does not have enough proof of work: <1c94bdc1942cfd934f58a717e555bf4ca34ab029d94effddf71fd7ff5490c86b>
unexpected difficulty: 12537449

which happens on the latest master commit 9f8a0a4

@leto you're saying that commit directly affected a change in block verification? Or something else?

These are separate bugs, I think.

leto commented

@who-biz sorry, I thought those were the same error message. Please disregard

@leto No, they are indeed the same error message. However, based on the structure of the invalid PoW hash, there appears a different catalyst behind them. Are you able to discern if that commit precisely caused the behavior you're seeing?

Either way, I think opening a separate issue report for the bug you're seeing is likely the best course of action. It appears that the hash with zeroed bits in my original issue report is being caused by over-scanning of the data stream in the database, or something like that. Possibly the txn being committed on an incorrect bit, as two strings of adjacent zeroes (in the amount of 13-15) in a legitimately calculated hash would be incredibly unlikely.

In contrast, yours appears to be an error in PoW hash calculation resulting in a legitimately incorrect hash. This would imply that an erroneous arithmetic operation within the function get_block_longhash or in the file slow-hash.c is the cause behind it.

@leto The padded hash that I was encountering is fixed in 5f5f886. However, I am now seeing the incorrect hash appearing as well, with an error that reads:

[P2P5]	ERROR	net	contrib/epee/include/net/abstract_tcp_server2.inl:352	Exception at [connection<t_protocol_handler>::handle_read], what=basic_string::_M_construct null not valid

printed to stdout afterward(presumably directly related). This is after changing the code that writes the block hash to stdout to use ostringstream.

So, you were correct in thinking these arise from the same underlying issue -- which seems to be translation of the block hash from binary to human-readable text... The 5f5f886 changes fix the looping that was occurring, when this is encountered, thereby speeding up sync... but doesn't fix the underlying issue. Any chance you have some insight as to a certain behavior or operation that triggers this?

May be related to this code: 2072f15#diff-fca43d50a482c1d6cfb80eee7fcf688dL172 and its effect, prior to removal. I thought that it may resolve the error, but apparently not.

I'll continue debugging.

leto commented

@who-biz I haven't tried to use the latest code on any rigs lately, but I will be doing some rig maintenance and will let you know what I find

For purposes of fully documenting the error messages:

[P2P6]  INFO    blockchain      src/cryptonote_core/blockchain.cpp:3461 +++++ BLOCK SUCCESSFULLY ADDED
id:     <263faa92511dc9ff404dd50a10818050865d7d2dd29ae9081dd0685f1b22f83f>
PoW:    <c1aa6e627512f661b862712fe27a3a54a5e6cdfb34dc27834e947c1026010000>
HEIGHT 307603, difficulty:      4021419
block reward: 6.303681354997(6.303681354997 + 0.000000000000), coinbase_blob_size: 85, cumulative size: 85, 7(0/7)ms
[P2P6]  ERROR   verify  src/cryptonote_core/blockchain.cpp:3247 Block with id: <393dab3e65025386bcf8331e57b96bf6306908ae47b1ddef4ac99af747d18eea>
does not have enough proof of work: <4b01ce4976e77fd1d749ae33b660baacf99621e8983ccd320a9a65ba99c85add>
unexpected difficulty: 4091548
[P2P6]  INFO    net.cn  src/cryptonote_protocol/cryptonote_protocol_handler.inl:1107    [yyy.yy.yy.yyy:52541 OUT] Block verification failed, dropping connection
[P2P6]  INFO    net.p2p src/p2p/net_node.inl:1617       [yyy.yy.yy.yyy:52541 yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy OUT] CLOSE CONNECTION
[P2P7]  INFO    net.p2p.msg     src/cryptonote_protocol/cryptonote_protocol_handler.inl:841     [xx.xx.xxx.xx:52541 OUT] Received NOTIFY_RESPONSE_GET_OBJECTS (20 blocks, 0 txes)
[P2P7]  ERROR   net     contrib/epee/include/net/abstract_tcp_server2.inl:352   Exception at [connection<t_protocol_handler>::handle_read], what=basic_string::_M_construct null not valid
[P2P7]  INFO    net.p2p src/p2p/net_node.inl:1617       [xx.xx.xxx.xx:52541 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx OUT] CLOSE CONNECTION
[P2P4]  INFO    net.p2p.msg     src/cryptonote_protocol/cryptonote_protocol_handler.inl:841     [66.70.189.131:52541 OUT] Received NOTIFY_RESPONSE_GET_OBJECTS (20 blocks, 0 txes)
[P2P4]  ERROR   net     contrib/epee/include/net/abstract_tcp_server2.inl:352   Exception at [connection<t_protocol_handler>::handle_read], what=basic_string::_M_construct null not valid

xs denote one node, ys another. Other IP was left in, since it is a seed node address, anyway.

Really not sure whats happening here, still. Particularly because that difficulty doesn't appear to match that PoW hash, at first glance. Edit: Yeah, that's because that's checking what diff should be against that incorrect PoW hash. So difficulty should be correct there, and PoW hash is not...

Now to figure out why the longhash calculation isn't correct on these blocks... need to determine if issue is arising from peer communications sending too much data, or something on blockchain level. Peer comms being the problem would be much more pleasant

leto commented

@who-biz I am relatively sure that figuring out exactly why this happens will clear things up:

what=basic_string::_M_construct null not valid

I have seen that _M_construct null error before in Bitcoin/Zcash C++ codebase, and usually means some kind of bad data marshalling, such as an empty string attempting to be used as an object

I have seen that _M_construct null error before in Bitcoin/Zcash C++ codebase, and usually means some kind of bad data marshalling, such as an empty string attempting to be used as an object

This makes sense, because this portion of code has comments stating that the last block transmitted by a peer may be empty due to x circumstances.

The part that is concerning me, is that the node syncing drops the connection on this single invalid block. So, a peer sending invalid blocks seems to cause the block streams from other peers to be considered invalid as well, and all affected connections appear to be dropped. Sync resumes when a new connection is made, and a new block queue is populated.

Prior to these changes: 5f5f886, that null block was trying to validate in a recursive loop that wouldn’t quit for a fair amount of time. So this seems to be better... but I’ll have to dig some more to figure out a better way to handle the issue, if it can’t be fixed altogether.

Disregard all of the commits above. They've been reverted as they did not solve the issue.

Auto-closed due to commit title, re-opening pending further investigation.

The _M_string::construct_null error no longer appears during sync. Still, proof of work appears to fail verification for a given block, occasionally. We are able to add the block if another peer has the same block with a valid PoW hash.

I recognize this is probably somewhat confusing... but see error log below for what I mean

2019-09-13 06:04:45.284	[P2P8]	ERROR	verify	src/cryptonote_core/blockchain.cpp:3247	Block with id: <16213e933c0a1cbbbddd1cebada77f58dca3889a05ffc93bb2d6464f6e3f3304>
does not have enough proof of work: <3f0a3ac7c1fc123820dcfa8959c21c50699f05120a3d0265ccc55a6155d7d072>
unexpected difficulty: 17094228
2019-09-13 06:04:45.284	[P2P8]	INFO 	net.cn	src/cryptonote_protocol/cryptonote_protocol_handler.inl:1105	[66.70.188.178:52541 OUT] Block verification failed, dropping connection
2019-09-13 06:04:45.296	[P2P5]	INFO 	blockchain	src/cryptonote_core/blockchain.cpp:3461	+++++ BLOCK SUCCESSFULLY ADDED
id:	<16213e933c0a1cbbbddd1cebada77f58dca3889a05ffc93bb2d6464f6e3f3304>
PoW:	<de01e195916023f1a6f8de185e413464ff13c007f9b6bba5802ba33779000000>
HEIGHT 127779, difficulty:	17094228
block reward: 7.482952322303(7.482952322303 + 0.000000000000), coinbase_blob_size: 85, cumulative size: 85, 10(0/10)ms

So what happens, is if the data we retrieve from one node causes us to incorrectly compute a given block's PoW hash, we pick up the sync at the same spot from a different peer, rather than the prior behavior of dropping that block for all peers.

Now, the final step in resolving the title of this issue "Incorrect block hash appears during sync"... is actually figuring out what is causing that calculated longhash value to deviate from those on other nodes' ledgers.

Good news: is that we now prevent sync stalling when the error is encountered. Only that single node's connection is dropped, and we can reconnect later during regular idle handshakes.

leto commented

@who-biz looks like you are getting closer to the entire fix.

If some nodes are calculating a hash differently than other nodes, that leads me to think that some OS-specific or compiler-specific issue is causing the calculation to be different, leading to a consensus change and the nodes forking.

Sometimes this is caused by data types meaning different things or having different sizes or bit alignments, such as unsigned int being 32bit or 64bit on different systems.

Do we have any data about this problem occurring only on certain OS's or perhaps only with release binaries vs compiled from source?

@leto

Something that I found out recently: all types suffixed with _t are reserved as macros on POSIX systems. Maybe related?

But yes. Reports seem to show that systems running off on their own forks tends to occur on Windows more frequently.

leto commented

@who-biz these kinds of bugs happen in Bitcoin/Zcash world in the internals of GMP, which have bugs on Windows. Most coins have their own patches for GMP to fix it. They have to do with memory-alignment assumptions which are sometimes false.

@who-biz these kinds of bugs happen in Bitcoin/Zcash world in the internals of GMP, which have bugs on Windows. Most coins have their own patches for GMP to fix it. They have to do with memory-alignment assumptions which are sometimes false.

I'll look into this as soon as I can. It sounds a bit more involved, but it may be somewhat easy to port BTC's changes, similar to what was done with the depends build system.

Any idea if this happens on MacOS typically as well? There are two places that I see alignment macros used: 1.) Within a warning I get compiling for MacOS, but this seems to be a redefinition of existing alignment, and doesn't appear to have an effect. 2.) Within the database, where misalignment appears to be allowed (It doesn't seem like that would be at all a good thing...)

leto commented

@who-biz yes, if you look at the source code of KMD, there are various places where GMP patches are applied to fix bugs in Windows and I think also mac. All Zcash forks have these patches to make stuff work correctly across platforms, not sure all of it has ever gotten upstream to Zcash itself nor GMP.

4e2d8b9 appears to have resolved this. Just removed the problematic area of code, since all it was doing was a check against a block queue, which really isn't being employed properly anyway.

For whatever reasons, whoever wrote this code appear to have intentionally crafted this into a synchronous process, which means a queue is not really necessary at all. As a result, users are crippled into relying on the ordering of blocks in the "queue," and are effectively DDoSed if something goes awry.

All said, I really can't see what the point of this block of code is, if not simply to cause the issues we are encountering... Or, maybe I'm missing something.

Will leave open for visibility, until this becomes a confirmed solution.