relab/hotstuff

liveness problem when encountering view changes

Closed this issue · 2 comments

Dear authors,

Hi, I think this reposistory is an excellent work. However, I encounter a liveness problem using the round-robin mode.

As the default setting, I run a 4 nodes cluster and a client. Normally, the CPU of the hotstuff server is about 200% per insistance.
But after running the hotstuff for sometime, the CPU of the hotstuff server drops to lower than 10%, and the hotstuff servers do not serve any more.

I think this is because the view-change is happenning, and some blocks could not get enough certifications and the blockchain deviates from one server to another.

Setting the view-timeout to a lower value(e.g. 10ms) increases the probability of encoutering this liveness problem.

This disturbs me a lot. and I am looking forward to hearing from you. Thanks!

That is a deadlock problem. I fixed it by changing the structure of the data.Block, using the mutex of the HotStuffCore to lock, rather than its own mutex.

The latter block should wait until all the parent blocks has been arrive.

// should lock outside by hs.Mut.Lock()
func (hs *HotStuffCore) PutBlock(b *data.Block) {
	for {
		if _, ok := hs.Blocks.Get(b.ParentHash); ok { //Put until its parent has been put
			hs.Blocks.Put(b)
			hs.waitProposal.Broadcast()
			return
		} else {
			hs.waitProposal.Wait()
		}
	}
}

// should lock outside by hs.Mut.Lock()
func (hs *HotStuffCore) GetBlock(hash data.BlockHash) *data.Block {
	for {
		if block, ok := hs.Blocks.Get(hash); ok {
			return block
		} else {
			hs.waitProposal.Wait()
		}
	}
}

This modification will success in the normal operation without liveness problem. But it might contain safety problem when in Byzantine cases.

Likely fixed in the new version.