hashicorp/raft

OverrideNotifyBool concurrency issues

fananchong opened this issue · 2 comments

code 1 :

// runLeader runs the FSM for a leader. Do the setup here and drop into
// the leaderLoop for the hot loop.
func (r *Raft) runLeader() {
	r.logger.Info("entering leader state", "leader", r)
	metrics.IncrCounter([]string{"raft", "state", "leader"}, 1)

	// Notify that we are the leader
	overrideNotifyBool(r.leaderCh, true)

code 2 , my code :

for leader := range myRaft.LeaderCh() {
	isLeader = leader
}

LeaderCh is a channel with 1 element
There is 1 concurrency here:
goroutine 1. select check condition: ch <- v, <-ch
goroutine 2. range myRaft.LeaderCh()

In the following sequence, when ch has 1 element, runLeader will be stuck:

goroutine 1 goroutine 2 Description
ch <- v [select] check condition is blocking
range myRaft.LeaderCh() takes away the elements in ch
<-ch [select] check condition is blocking
runLeader stuck [select] both conditions are blocking, execution block
banks commented

Thanks for creating this issue. We had a couple of questions to help us understand what you are seeing.

  1. Did you actually observe any incorrect behaviour in this code? If you did can you reproduce it?
  2. Your analysis seems to describe a race if the Chan is read (goroutine 2) in between the two cases of the select statement being evaluated in overrideNotifyBool here:

    raft/util.go

    Lines 104 to 116 in 3cb47c5

    func overrideNotifyBool(ch chan bool, v bool) {
    select {
    case ch <- v:
    // value sent, all done
    case <-ch:
    // channel had an old value
    select {
    case ch <- v:
    default:
    panic("race: channel was sent concurrently")
    }
    }
    }
    Is that correct?

If we understood #2 correctly, I don't think it is a real race because of how select is implemented in Go. goselect will take a lock on every chan involved in the statement in order before evaluating each branch condition for a "ready" branch. In this case there is only one chan so it's just one lock, but both cases will be evaluated for readiness before the lock is released. So it's not possible for the recv in goroutine 2 in your table to happen "in between" since the chan would have been locked the whole time.

@dnephin made a simple test that shows that this doesn't block in any of our runs which we would expect if there were a race here (running for 10 seconds a number of times and with the race detector on) https://github.com/hashicorp/raft/compare/dnephin/show-no-race

Finally, the behaviour of range over a channel is a little unexpected for some people - the loop you wrote in the code will actually block until the chan is closed which will only happen when the Raft instance is shut down. So you should see the bool change each time the leader state changes, but the loop will block otherwise.

Does that help?

thanks