/go-concurrency-bugs

Collected Concurrency Bugs in Our ASPLOS Paper

Data Set for "Understanding Real-World Concurrency Bugs in Go" in ASPLOS'2019

Abstract

In this paper, we perform the first systematic study on concurrency bugs in real Go programs. We analyzed 171 concurrency bugs in total from six popular Go software, with more than half of them caused by non-traditional, Go-specific problems. Apart from root causes of these bugs, we also studied their fixes, performed experiments to reproduce them, and evaluated them with two publicly-available Go bug detectors.Overall, our study provides a better understanding on Go's concurrency models and can guide future researchers and practitioners in writing better, more reliable Go software and in developing debugging and diagnosis tools for Go.

Study Methodology

Go Applications

We selected six representative, real-world software written in Go, including two container systems (Docker and Kubernetes), one key-value store system (etcd), two databases (CockroachDB and BoltDB), and one RPC library gRPC-go. These applications are open-source projects that have gained wide usages in datacenter environments. The following table is the information of selected applications.

Application Stars Commits Contributors LOC Dev History
Docker 48975 35149 1767 786K 4.2 Years
Kubernetes 36581 65684 1679 2297K 3.9 Years
etcd 18417 14101 436 441K 4.9 Years
CockroachDB 13461 29485 197 520k 4.2 Years
gRPC-go 5594 2528 148 53K 3.3 Years
BoltDB 8530 816 98 9K 4.4 Years

Bug Taxonomy

We propose a new method to categorize Go concurrency bugs according to two orthogonal dimensions. The first dimension is based on the behavior of bugs. If one or more goroutines are unintentionally stuck in their execution and cannot move forward, we call such concurrency issues blocking bugs. If instead all goroutines can finish their tasks but their behaviors are not desired, we call them non-blocking ones. The following table shows the detailed breakdown of bug categories across each application.

Application Behavior Root Cause
Blocking Non-Blocking Shared Memory Message Passing
Docker 21 23 28 16
Kubernetes 17 17 20 14
etcd 21 16 18 19
CockroachDB 12 16 23 5
gRPC-Go 11 12 12 11
BoltDB 3 2 4 1
Total 85 86 105 66

Blocking Bugs

Overall, we found that there are around 42% blocking bugs caused by errors in protecting shared memory, and 58% are caused by errors in message passing. Considering that shared memory primitives are used more frequently than message passing ones, message passing operations are even more likely to cause blocking bugs.

Share Memory

For example, Docker#25384, happens with the use of a shared variable of type WaitGroup, as shown in following figure. The Wait() at line 7 can only be unblocked, when Done() at line 5 is invoked len(pm.plugins) times, since len(pm.plugins) is used as parameter to call Add() at line 2. However, the Wait() is called inside the loop, so that it blocks goroutine creation at line 4 in later iterations and it blocks the invocation of Done() inside each created goroutine. The fix of this bug is to move the invocation of Wait() out from the loop.

1 var group sync.WaitGroup
2 group.Add(len(pm.plugins))
3 for _, p := range pm.plugins {
4   go func(p *plugin) {
5    defer group.Done()
6   }
7 - group.Wait()
8 }
9 + group.Wait()

Message Passing

The following bug is caused by errors in message passing. The finishReq function creates a child goroutine using an anonymous function at line 4 to handle a request---a common practice in Go server programs. The child goroutine executes fn() and sends result back to the parent goroutine through channel ch at line 6.he child will block at line 6 until the parent pulls result from ch at line 9. Meanwhile, the parent will block at select until either when the child sends result to ch (line 9) or when a timeout happens (line 11). If timeout happens earlier or if Go runtime (non-deterministically) chooses the case at line 11 when both cases are valid, the parent will return from requestReq() at line 12, and no one else can pull result from ch any more, resulting in the child being blocked forever.

1 func finishReq(timeout time.Duration) r ob {
2 -   ch := make(chan ob)
3 +   ch := make(chan ob, 1)
4   go func() {
5     result := fn()
6     ch <- result // block
7   } ()
8   select {
9     case result = <- ch:
10       return result
11     case <- time.After(timeout):
12       return nil
13   }
14 }

Non-Blocking Bugs

We found around 80% of our collected non-blocking bugs are due to un-protected or wrongly protected shared memory accesses and around 20% are caused by errors in message passing.

Shared Memory

One example from Docker is shown in following figure. Local variable i is shared between the parent goroutine and the goroutines it creates at line 2. The developer intends each child goroutine uses a distinct i value to initialize string apiVersion at line 4. However, values of apiVersion are non-deterministic in the buggy program. For example, if the child goroutines begin after the whole loop of the parent goroutine finishes, value of apiVersion are all equal to 'v1.21'. The buggy program only produces desired result when each child goroutine initializes string apiVersion immediately after its creation and before i is assigned to a new value.

1  for i := 17; i <= 21; i++ { // write
2 -   go func() { /* Create a new goroutine */
3 +   go func(i int) {
4            apiVersion := fmt.Sprintf("v1.%d", i) // read
5            ...
6 -       }()
7 +       }(i)
8   }

Message Passing

Docker#24007 in following figure is caused by the violation of the rule that a channel can only be closed once. When multiple goroutines execute the piece of code, more than one of them can execute the default clause and try to close the channel at line 5, causing a runtime panic in Go.

1 - select {
2 -   case <- c.closed:
3 -   default:
4 +     Once.Do(func() {
5         close(c.closed)
6 +     })
7 - }

Papers

Understanding Real-World Concurrency Bugs in Go. Tengfei Tu, Xiaoyu Liu, Linhai Song, Yiying Zhang. To Appear at the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19).

Citation

@inproceedings{go-study-asplos,
    author = {Tu, Tengfei and Liu, Xiaoyu and Song, Linhai and Zhang, Yiying},
    title = {Understanding Real-World Concurrency Bugs in Go},
    booktitle = {ASPLOS},
    year = {2019},
}

Forum

  1. https://golangnews.org/2019/03/understanding-real-world-concurrency-bugs-in-go/
  2. https://lobste.rs/s/wan3io/understanding_real_world_concurrency
  3. https://news.ycombinator.com/item?id=19280927
  4. http://taint.org/2019/03/02/235801a.html
  5. https://golangweekly.com/issues/251
  6. https://www.jtolio.com/2016/03/go-channels-are-bad-and-you-should-feel-bad/
  7. https://www.reddit.com/r/golang/comments/awjf2b/understanding_realworld_concurrency_bugs_in_go_pdf/?ref=readnext
  8. https://www.bilibili.com/video/av45087132/
  9. https://youtu.be/ClVrJcTM-lA
  10. https://www.jexia.com/en/blog/golang-error-proneness-message-passing/
  11. https://blog.acolyer.org/2019/05/17/understanding-real-world-concurrency-bugs-in-go/