golang/go

proposal: add container/queue

christianrpetrin opened this issue · 95 comments

Proposal: Built in support for high performance unbounded queue

Author: Christian Petrin.

Last updated: November 26, 2018

Discussion at: #27935

Design document at https://github.com/golang/proposal/blob/master/design/27935-unbounded-queue-package.md

Abstract

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

Queues in computer science is a very old, well established and well known concept, yet Go doesn't provide a specialized, safe to use, performant and issue free unbounded queue implementation.

Buffered channels provide an excellent option to be used as a queue, but buffered channels are bounded and so doesn't scale to support very large data sets. The same applies for the standard ring package.

The standard list package can be used as the underlying data structure for building unbounded queues, but the performance yielded by this linked list based implementation is not optimal.

Implementing a queue using slices as suggested here is a feasible approach, but the performance yielded by this implementation can be abysmal in some high load scenarios.

Background

Queues that grows dynamically has many uses. As an example, I'm working on a logging system called CloudLogger that sends all logged data to external logging management systems, such as Stackdriver and Cloudwatch. External logging systems typically rate limit how much data their service will accept for a given account and time frame. So in a scenario where the hosting application is logging more data than the logging management system will accept at a given moment, CloudLogger has to queue the extra logs and send them to the logging management system at a pace the system will accept. As there's no telling how much data will have to be queued as it depends on the current traffic, an unbounded, dynamically growing queue is the ideal data structure to be used. Buffered channels in this scenario is not ideal as they have a limit on how much data they will accept, and once that limit has been reached, the producers (routines adding to the channel) start to block, making the adding to the channel operation an "eventual" synchronous process. A fully asynchronous operation in this scenario is highly desirable as logging data should not slow down significantly the hosting application.

Above problem is a problem that, potentially, every system that calls another system faces. And in the cloud and microservices era, this is an extremely common scenario.

Due to the lack of support for built in unbounded queues in Go, Go engineers are left to either:

  1. Research and use external packages, or
  2. Build their own queue implementation

Both approaches are riddled with pitfalls.

Using external packages, especially in enterprise level software, requires a lot of care as using external, potentially untested and hard to understand code can have unwanted consequences. This problem is made much worse by the fact that, currently, there's no well established and disseminated open source Go queue implementation according to this stackoverflow discussion, this github search for Go queues and Awesome Go.

Building a queue, on the other hand, might sound like a compelling argument, but building efficient, high performant, bug free unbounded queue is a hard job that requires a pretty solid computer science foundation as well a good deal of time to research different design approaches, test different implementations, make sure the code is bug and memory leak free, etc.

In the end what Go engineers have been doing up to this point is building their own queues, which are for the most part inefficient and can have disastrous, yet hidden performance and memory issues. As examples of poorly designed and/or implemented queues, the approaches suggested here and here (among many others), requires linear copy of the internal slice for resizing purposes. Some implementations also has memory issues such as an ever expanding internal slice and memory leaks.

Proposal

I propose to add a new package, "container/queue", to the standard library to support in-memory unbounded queues. The proposed queue implementation offers excellent performance and very low memory consumption when comparing it to three promising open source implementations (gammazero, phf and juju); to use Go channels as queue; the standard list package as a queue as well as six other experimental queue implementations.

The proposed queue implementation offers the most balanced approach to performance given different loads, being significantly faster and still uses less memory than every other queue implementation in the tests.

The closest data structure Go has to offer for building dynamically growing queues for large data sets is the standard list package. When comparing the proposed solution to using the list package as an unbounded queue (refer to "BenchmarkList"), the proposed solution is consistently faster than using the list package as a queue as well as displaying a much lower memory footprint.

Reasoning

There's two well accepted approaches to implementing queues when in comes to the queue underlying data structure:

  1. Using linked list
  2. Using array

Linked list as the underlying data structure for an unbounded queue has the advantage of scaling efficiently when the underlying data structure needs to grow to accommodate more values. This is due to the fact that the existing elements doesn't need to be repositioned or copied around when the queue needs to grow.

However, there's a few concerns with this approach:

  1. The use of prev/next pointers for each value requires a good deal of extra memory
  2. Due to the fact that each "node" in the linked list can be allocated far away from the previous one, navigating through the list can be slow due to its bad memory locality properties
  3. Adding new values always require new memory allocations and pointers being set, hindering
    performance

On the other hand, using a slice as the underlying data structure for unbounded queues has the advantage of very good memory locality, making retrieval of values faster when comparing to linked lists. Also an "alloc more than needed right now" approach can easily be implemented with slices.

However, when the slice needs to expand to accommodate new values, a well adopted strategy is to allocate a new, larger slice, copy over all elements from the previous slice into the new one and use the new one to add the new elements.

The problem with this approach is the obvious need to copy all the values from the older, small slice, into the new one, yielding a poor performance when the amount of values that need copying are fairly large.

Another potential problem is a theoretical lower limit on how much data they can hold as slices, like arrays, have to allocate its specified positions in sequential memory addresses, so the maximum number of items the queue would ever be able to hold is the maximum size a slice can be allocated on that particular system at any given moment. Due to modern memory management techniques such as virtual memory and paging, this is a very hard scenario to corroborate thru practical testing.

Nonetheless, this approach doesn't scale well with large data sets.

Having said that, there's a third, newer approach to implementing unbounded queues: use fixed size linked slices as the underlying data structure.

The fixed size linked slices approach is a hybrid between the first two, providing good memory locality arrays have alongside the efficient growing mechanism linked lists offer. It is also not limited on the maximum size a slice can be allocated, being able to hold and deal efficiently with a theoretical much larger amount of data than pure slice based implementations.

Rationale

Research

A first implementation of the new design was built.

The benchmark tests showed the new design was very promising, so I decided to research about other possible queue designs and implementations with the goal to improve the first design and implementation.

As part of the research to identify the best possible queue designs and implementations, I implemented and probed a total of 7 experimental queue implementations. Below are a few of the most interesting ones.

  • queueimpl1: custom queue implementation that stores the values in a simple slice. Pop removes the first slice element. This is a slice based implementation that tests this suggestion.
  • queueimpl5: custom queue implementation that stores the values in linked slices. This implementation tests the queue performance when storing the "next" pointer as part of the values slice instead of having it as a separate "next" field. The next element is stored in the last position of the internal slices, which is a reserved position.
  • queueimpl7: custom queue implementation that stores the values in linked slices. This implementation tests the queue performance when performing lazy creation of the internal slice as well as starting with a 1-sized slice, allowing it to grow up to 16 by using the built in append function. Subsequent slices are created with 128 fixed size.

Also as part of the research, I investigated and probed below open source queue implementations as well.

  • phf: this is a slice, ring based queue implementation. Interesting to note the author did a pretty good job researching and probing other queue implementations as well.
  • gammazero: the deque implemented in this package is also a slice, ring based queue implementation.
  • juju: the deque implemented in this package uses a linked list based approach, similarly to other experimental implementations in this package such as queueimpl3. The biggest difference between this implementation and the other experimental ones is the fact that this queue uses the standard list package as the linked list. The standard list package implements a doubly linked list, while the experimental implementations implements their own singly linked list.

The standard list package as well as buffered channels were probed as well.

Benchmark Results

Add and remove 100 items
Performance

ns/op

Memory

B/op

Add and remove 100k items
Performance

ns/op

Memory

B/op

Aggregated Results
Performance

ns/op

Memory

B/op

Detailed, curated results can be found here

Aggregated, curated results can be found here

Given above results, queueimpl7, henceforth just "impl7", proved to be the most balanced implementation, being either faster or very competitive in all test scenarios from a performance and memory perspective.

Refer here for more details about the tests.

The benchmark tests can be found here.

Impl7 Design and Implementation

Impl7 was the result of the observation that some slice based implementations such as queueimpl1 and
queueimpl2 offers phenomenal performance when the queue is used with small data sets.

For instance, comparing queueimpl3 (very simple linked slice implementation) with queueimpl1 (very simple slice based implementation), the results at adding 0 (init time only), 1 and 10 items are very favorable for impl1, from a performance and memory perspective.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl3.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%  472.53ns ± 7%   +6821.99%  (p=0.000 n=20+17)
/1-4         48.1ns ± 6%   492.4ns ± 5%    +924.66%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     695ns ± 8%     +30.57%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    2.50µs ± 4%     -21.69%  (p=0.000 n=18+19)
/1000-4      24.5µs ± 3%    23.6µs ± 2%      -3.33%  (p=0.000 n=19+19)
/10000-4      322µs ± 4%     238µs ± 1%     -26.02%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.3ms ±13%     -79.32%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B       2080.00B ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%   2080.0B ± 0%  +12900.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%     2152B ± 0%    +278.87%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    2.87kB ± 0%     -34.13%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    24.6kB ± 0%     -39.54%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     244kB ± 0%     -67.27%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%     -75.85%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           2.00 ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%    +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      11.0 ± 0%     -21.43%  (p=0.000 n=20+20)
/100-4          108 ± 0%       101 ± 0%      -6.48%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.01k ± 0%      +0.50%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%      +1.35%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%      +1.53%  (p=0.000 n=20+20)

Impl7 is a hybrid experiment between using a simple slice based queue implementation for small data sets and the fixed size linked slice approach for large data sets, which is an approach that scales really well, offering really good performance for small and large data sets.

The implementation starts by lazily creating the first slice to hold the first values added to the queue.

const (
    // firstSliceSize holds the size of the first slice.
    firstSliceSize = 1

    // maxFirstSliceSize holds the maximum size of the first slice.
    maxFirstSliceSize = 16

    // maxInternalSliceSize holds the maximum size of each internal slice.
    maxInternalSliceSize = 128
)

...

// Push adds a value to the queue.
// The complexity is amortized O(1).
func (q *Queueimpl7) Push(v interface{}) {
    if q.head == nil {
        h := newNode(firstSliceSize) // Returns a 1-sized slice.
        q.head = h
        q.tail = h
        q.lastSliceSize = maxFirstSliceSize
    } else if len(q.tail.v) >= q.lastSliceSize {
        n := newNode(maxInternalSliceSize) // Returns a 128-sized slice.
        q.tail.n = n
        q.tail = n
        q.lastSliceSize = maxInternalSliceSize
    }

    q.tail.v = append(q.tail.v, v)
    q.len++
}

...

// newNode returns an initialized node.
func newNode(capacity int) *Node {
    return &Node{
        v: make([]interface{}, 0, capacity),
    }
}

The very first created slice is created with capacity 1. The implementation allows the builtin append function to dynamically resize the slice up to 16 (maxFirstSliceSize) positions. After that it reverts to creating fixed size 128 position slices, which offers the best performance for data sets above 16 items.

16 items was chosen as this seems to provide the best balanced performance for small and large data sets according to the array size benchmark tests. Above 16 items, growing the slice means allocating a new, larger one and copying all 16 elements from the previous slice into the new one. The append function phenomenal performance can only compensate for the added copying of elements if the data set is very small, no more than 8 items in the benchmark tests. For above 8 items, the fixed size slice approach is consistently faster and uses less memory, where 128 sized slices are allocated and linked together when the data structure needs to scale to accommodate new values.

Why 16? Why not 15 or 14?

The builtin append function, as of "go1.11 darwin/amd64", seems to double the slice size every time it needs to allocate a new one.

ts := make([]int, 0, 1)

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 1 item; output: 1

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 2 items; output: 2

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 3 items; output: 4

ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 5 items; output: 8

ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 9 items; output: 16

Since the append function will resize the slice from 8 to 16 positions, it makes sense to use all 16 already allocated positions before switching to the fixed size slices approach.

Design Considerations

Impl7 uses linked slices as its underlying data structure.

The reason for the choice comes from two main observations of slice based queues:

  1. When the queue needs to expand to accommodate new values, a new, larger slice needs to be allocated and used
  2. Allocating and managing large slices is expensive, especially in an overloaded system with little available physical memory

To help clarify the scenario, below is what happens when a slice based queue that already holds, say 1bi items, needs to expand to accommodate a new item.

Slice based implementation

  • Allocate a new, twice the size of the previous allocated one, say 2 billion positions slice
  • Copy over all 1 billion items from the previous slice into the new one
  • Add the new value into the first unused position in the new slice, position 1000000001.

The same scenario for impl7 plays out like below.

Impl7

  • Allocate a new 128 size slice
  • Set the next pointer
  • Add the value into the first position of the new slice, position 0

Impl7 never copies data around, but slice based ones do, and if the data set is large, it doesn't matter how fast the copying algorithm is. The copying has to be done and will take some time.

The decision to use linked slices was also the result of the observation that slices goes to great length to provide predictive, indexed positions. A hash table, for instance, absolutely need this property, but not a queue. So impl7 completely gives up this property and focus on what really matters: add to end, retrieve from head. No copying around and repositioning of elements is needed for that. So when a slice goes to great length to provide that functionality, the whole work of allocating new arrays, copying data around is all wasted work. None of that is necessary. And this work costs dearly for large data sets as observed in the tests.

Impl7 Benchmark Results

Below compares impl7 with a few selected implementations.

The tests name are formatted given below.

  • Benchmark/N-4: benchmark a queue implementation where N denotes the number of items added and removed to/from the queue; 4 means the number of CPU cores in the host machine.

Examples:

  • Benchmark/0-4: benchmark the queue by creating a new instance of it. This only test initialization time.
  • Benchmark/100-4: benchmark the queue by creating a new instance of it and adding and removing 100 items to/from the queue.

Standard list used as a FIFO queue vs impl7.

benchstat rawresults/bench-list.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         34.9ns ± 1%     1.2ns ± 3%   -96.64%  (p=0.000 n=19+20)
/1-4         77.0ns ± 1%    68.3ns ± 1%   -11.21%  (p=0.000 n=20+20)
/10-4         574ns ± 0%     578ns ± 0%    +0.59%  (p=0.000 n=18+20)
/100-4       5.94µs ± 1%    3.07µs ± 0%   -48.28%  (p=0.000 n=19+18)
/1000-4      56.0µs ± 1%    25.8µs ± 1%   -53.92%  (p=0.000 n=20+20)
/10000-4      618µs ± 1%     260µs ± 1%   -57.99%  (p=0.000 n=20+18)
/100000-4    13.1ms ± 6%     3.1ms ± 3%   -76.50%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          48.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          96.0B ± 0%     48.0B ± 0%   -50.00%  (p=0.000 n=20+20)
/10-4          600B ± 0%      600B ± 0%      ~     (all equal)
/100-4       5.64kB ± 0%    3.40kB ± 0%   -39.72%  (p=0.000 n=20+20)
/1000-4      56.0kB ± 0%    25.2kB ± 0%   -55.10%  (p=0.000 n=20+20)
/10000-4      560kB ± 0%     243kB ± 0%   -56.65%  (p=0.000 n=20+20)
/100000-4    5.60MB ± 0%    2.43MB ± 0%   -56.66%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          20.0 ± 0%      15.0 ± 0%   -25.00%  (p=0.000 n=20+20)
/100-4          200 ± 0%       107 ± 0%   -46.50%  (p=0.000 n=20+20)
/1000-4       2.00k ± 0%     1.02k ± 0%   -48.95%  (p=0.000 n=20+20)
/10000-4      20.0k ± 0%     10.2k ± 0%   -49.20%  (p=0.000 n=20+20)
/100000-4      200k ± 0%      102k ± 0%   -49.22%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~29x faster (1.2ns vs 34.9ns) than list package for init time (0 items)
  • Up to ~4x faster (3.1ms vs 13.1ms) than list package for 100k items
  • Uses ~1/2 memory (2.43MB vs 5.60MB) than list package for 100k items

impl1 (simple slice based queue implementaion) vs impl7.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%    1.18ns ± 3%   -82.79%  (p=0.000 n=20+20)
/1-4         48.1ns ± 6%    68.3ns ± 1%   +42.23%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     578ns ± 0%    +8.55%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    3.07µs ± 0%    -3.74%  (p=0.000 n=18+18)
/1000-4      24.5µs ± 3%    25.8µs ± 1%    +5.51%  (p=0.000 n=19+20)
/10000-4      322µs ± 4%     260µs ± 1%   -19.23%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.1ms ± 3%   -80.60%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B          0.00B           ~     (all equal)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%      600B ± 0%    +5.63%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    3.40kB ± 0%   -22.02%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    25.2kB ± 0%   -38.25%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     243kB ± 0%   -67.47%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%   -75.84%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           0.00           ~     (all equal)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      15.0 ± 0%    +7.14%  (p=0.000 n=20+20)
/100-4          108 ± 0%       107 ± 0%    -0.93%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.02k ± 0%    +1.09%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.39%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~5x faster (1.18ns vs 6.83ns) than impl1 for init time (0 items)
  • Up to ~5x faster (3.1ms vs 15.8ms) than impl1 for 100k items
  • Uses ~1/4 memory (2.4MB vs 10MB) than impl1 for 100k items

It's important to note that the performance and memory gains for impl7 is exponential like the larger the data set is due to the fact slice based implementations doesn't scale well, paying a higher and higher price, performance and memory wise, every time it needs to scale to accommodate an ever
expanding data set.


phf (slice, ring based FIFO queue implementation) vs impl7.

benchstat rawresults/bench-phf.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         28.1ns ± 1%     1.2ns ± 3%   -95.83%  (p=0.000 n=20+20)
/1-4         42.5ns ± 1%    68.3ns ± 1%   +60.80%  (p=0.000 n=20+20)
/10-4         681ns ± 1%     578ns ± 0%   -15.11%  (p=0.000 n=18+20)
/100-4       4.55µs ± 1%    3.07µs ± 0%   -32.45%  (p=0.000 n=19+18)
/1000-4      35.5µs ± 1%    25.8µs ± 1%   -27.32%  (p=0.000 n=18+20)
/10000-4      349µs ± 2%     260µs ± 1%   -25.67%  (p=0.000 n=20+18)
/100000-4    11.7ms ±11%     3.1ms ± 3%   -73.77%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          16.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          696B ± 0%      600B ± 0%   -13.79%  (p=0.000 n=20+20)
/100-4       6.79kB ± 0%    3.40kB ± 0%   -49.94%  (p=0.000 n=20+20)
/1000-4      57.0kB ± 0%    25.2kB ± 0%   -55.86%  (p=0.000 n=20+20)
/10000-4      473kB ± 0%     243kB ± 0%   -48.68%  (p=0.000 n=20+20)
/100000-4    7.09MB ± 0%    2.43MB ± 0%   -65.77%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          15.0 ± 0%      15.0 ± 0%      ~     (all equal)
/100-4          111 ± 0%       107 ± 0%    -3.60%  (p=0.000 n=20+20)
/1000-4       1.02k ± 0%     1.02k ± 0%    +0.39%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.38%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~23x faster (1.2ns vs 28.1ns) than phf for init time (0 items)
  • Up to ~3x faster (3.1ms vs 11.7ms) than phf for 100k items
  • Uses ~1/2 memory (2.43MB vs 7.09MB) than phf for 100k items

Buffered channel vs impl7.

benchstat rawresults/bench-channel.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         30.2ns ± 1%     1.2ns ± 3%   -96.12%  (p=0.000 n=19+20)
/1-4         87.6ns ± 1%    68.3ns ± 1%   -22.00%  (p=0.000 n=19+20)
/10-4         704ns ± 1%     578ns ± 0%   -17.90%  (p=0.000 n=20+20)
/100-4       6.78µs ± 1%    3.07µs ± 0%   -54.70%  (p=0.000 n=20+18)
/1000-4      67.3µs ± 1%    25.8µs ± 1%   -61.65%  (p=0.000 n=20+20)
/10000-4      672µs ± 1%     260µs ± 1%   -61.36%  (p=0.000 n=19+18)
/100000-4    6.76ms ± 1%    3.07ms ± 3%   -54.61%  (p=0.000 n=19+20)

name       old alloc/op   new alloc/op   delta
/0-4          96.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4           112B ± 0%       48B ± 0%   -57.14%  (p=0.000 n=20+20)
/10-4          248B ± 0%      600B ± 0%  +141.94%  (p=0.000 n=20+20)
/100-4       1.69kB ± 0%    3.40kB ± 0%  +101.42%  (p=0.000 n=20+20)
/1000-4      16.2kB ± 0%    25.2kB ± 0%   +55.46%  (p=0.000 n=20+20)
/10000-4      162kB ± 0%     243kB ± 0%   +49.93%  (p=0.000 n=20+20)
/100000-4    1.60MB ± 0%    2.43MB ± 0%   +51.43%  (p=0.000 n=16+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          10.0 ± 0%      15.0 ± 0%   +50.00%  (p=0.000 n=20+20)
/100-4          100 ± 0%       107 ± 0%    +7.00%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.10%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.61%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.57%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~25x faster (1.2ns vs 30.2ns) than channels for init time (0 items)
  • Up to ~2x faster (3.07ms vs 6.76ms) than channels for 100k items
  • Uses ~50% MORE memory (2.43MB vs 1.60MB) than channels for 100k items

Above is not really a fair comparison as standard buffered channels doesn't scale (at all) and they are meant for routine synchronization. Nonetheless, they can and make for an excellent bounded FIFO queue option. Still, impl7 is consistently faster than channels across the board, but uses considerably more memory than channels.


Given its excellent performance under all scenarios, the hybrid approach impl7 seems to be the ideal candidate for a high performance, low memory footprint general purpose FIFO queue.

For above reasons, I propose to port impl7 to the standard library.

All raw benchmark results can be found here.

Internal Slice Size

Impl7 uses linked slices as its underlying data structure.

The size of the internal slice does influence performance and memory consumption significantly.

According to the internal slice size bench tests, larger internal slice sizes yields better performance and lower memory footprint. However, the gains diminishes dramatically as the slice size increases.

Below are a few interesting results from the benchmark tests.

BenchmarkMaxSubsequentSliceSize/1-4                20000         76836 ns/op       53967 B/op       2752 allocs/op
BenchmarkMaxSubsequentSliceSize/2-4                30000         59811 ns/op       40015 B/op       1880 allocs/op
BenchmarkMaxSubsequentSliceSize/4-4                30000         42925 ns/op       33039 B/op       1444 allocs/op
BenchmarkMaxSubsequentSliceSize/8-4                50000         36946 ns/op       29551 B/op       1226 allocs/op
BenchmarkMaxSubsequentSliceSize/16-4               50000         30597 ns/op       27951 B/op       1118 allocs/op
BenchmarkMaxSubsequentSliceSize/32-4               50000         28273 ns/op       27343 B/op       1064 allocs/op
BenchmarkMaxSubsequentSliceSize/64-4               50000         26969 ns/op       26895 B/op       1036 allocs/op
BenchmarkMaxSubsequentSliceSize/128-4              50000         27316 ns/op       26671 B/op       1022 allocs/op
BenchmarkMaxSubsequentSliceSize/256-4              50000         26221 ns/op       28623 B/op       1016 allocs/op
BenchmarkMaxSubsequentSliceSize/512-4              50000         25882 ns/op       28559 B/op       1012 allocs/op
BenchmarkMaxSubsequentSliceSize/1024-4             50000         25674 ns/op       28527 B/op       1010 allocs/op

Given the fact that larger internal slices also means potentially more unused memory in some scenarios, 128 seems to be the perfect balance between performance and worst case scenario for memory footprint.

Full results can be found here.

API

Impl7 implements below API methods.

Operation Method
Add func (q *Queueimpl7) Push(v interface{})
Remove func (q *Queueimpl7) Pop() (interface{}, bool)
Size func (q *Queueimpl7) Len() int
Return First func (q *Queueimpl7) Front() (interface{}, bool)

As nil values are considered valid queue values, similarly to the map data structure, "Front" and "Pop" returns a second bool parameter to indicate whether the returned value is valid and whether the queue is empty or not.

The reason for above method names and signatures are the need to keep compatibility with existing Go data structures such as the list, ring and heap packages.

Below are the method names used by the existing list, ring and heap Go data structures, as well as the new proposed queue.

Operation list ring heap queue
Add PushFront/PushBack Link Push Push
Remove Remove Unlink Pop Pop
Size Len Len - Len
Return First Front - - Front

For comparison purposes, below are the method names for C++, Java and C# for their queue implementation.

Operation C++ Java C#
Add push add/offer Enqueue
Remove pop remove/poll Dequeue
Size size - Count
Return First front peek Peek

Drawbacks

The biggest drawback of the proposed implementation is the potentially extra allocated but not used memory in its head and tail slices.

This scenario realizes when exactly 17 items are added to the queue, causing the creation of a full sized internal slice of 128 positions. Initially only the first element in this new slice is used to store the added value. All the other 127 elements are already allocated, but not used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}
// Push 1 extra item that causes the creation of a new 128 sized slice to store this value.
q.Push(17)

// Pops the first 16 items to release the first slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *127 /* head slice unused positions */)))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 2040 bytes

The worst case scenario realizes when exactly 145 items are added to the queue and 143 items are removed. This causes the queue struct to hold a 128-sized slice as its head slice, but only the last element is actually used. Similarly, the queue struct will hold a separate 128-sized slice as its tail slice, but only the first position in that slice is being used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}

// Push an additional 128 items to fill the first full sized slice (sized 128).
for i := 1; i <= 128; i++ {
   q.Push(i)
}

// Push 1 extra item that causes the creation of a new 128 sized slice to store this value,
// adding a total of 145 items to the queue.
q.Push(1)

// Pops the first 143 items to release the first dynamic slice (sized 16) and
// 127 items from the first full sized slice (sized 128).
for i := 1; i <= 143; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *(127 /* head slice unused positions */ +127 /* tail slice unused positions */))))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 4072 bytes

Above code was run on Go version "go1.11 darwin/amd64".

Open Questions/Issues

Should this be a deque (double-ended queue) implementation instead? The deque could be used as a stack as well, but it would make more sense to have a queue and stack implementations (like most mainstream languages have) instead of a deque that can be used as a stack (confusing). Stack is a very important computer science data structure as well and so I believe Go should have a specialized implementation for it as well (given the specialized implementation offers real value to the users and not just a "nice" named interface and methods).

Should "Pop" and "Front" return only the value instead of the value and a second bool parameter (which indicates whether the queue is empty or not)? The implication of the change is adding nil values wouldn't be valid anymore so "Pop" and "Front" would return nil when the queue is empty. Panic should be avoided in libraries.

The memory footprint for a 128 sized internal slice causes, in the worst case scenario, a 2040 bytes of memory allocated (on a 64bit system) but not used. Switching to 64 means roughly half the memory would be used with a slight ~2.89% performance drop (252813ns vs 260137ns). The extra memory footprint is not worth the extra performance gain is a very good point to make. Should we change this value to 64 or maybe make it configurable?

Should we also provide a safe for concurrent use implementation? A specialized implementation that would rely on atomic operations to update its internal indices and length could offer a much better performance when comparing to a similar implementation that relies on a mutex.

With the impending release of generics, should we wait to release the new queue package once the new generics framework is released?

Should we implement support for the range keyword for the new queue? It could be done in a generic way so other data structures could also benefit from this feature. For now, IMO, this is a topic for another proposal/discussion.

Summary

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

I feel strongly this proposal should be accepted due to below reasons.

  1. The proposed solution was well researched and probed, being dramatically and consistently faster than 6 other experimental queue implementations as well 3 promising open source queue implementations as well the standard list package and buffered channels; it still consumes considerably less memory than every other queue implementation tested, except for buffered channels
  2. The proposed solution uses a new, unique approach to building queues, yet its implementation is clean and extremely simple. Both main methods, "Push" and "Pop", are composed of only 16 and 19 lines of code (total), respectively. The proposed implementation also have proper tests with 100% test coverage and should require minimal maintenance moving forward
  3. I'll implement any changes the Go community feel are needed for the proposed solution to be worth of the standard library and the Go community

Update 11/26/2018.

Due to many suggestions to make the queue a deque and to deploy it as a proper external package, the deque package was built and deployed here. The proposal now is to add the deque package to the standard library instead of impl7. Refer here for details.

cznic commented

Unbound queues don't fit well machines with finite memory.

@cznic I get your point, but I disagree with it for a number of reasons.

  1. Modern devices, even mobile ones, comes with gigabytes of memory available. Efficient data structures that allow engineers to use that much memory efficiently are very welcome
  2. Go already offers a few unbounded data structures such as list, map and others. I don't understand why Go should stop now
  3. In modern cloud computing, it is fairly easy to grow memory usage by distributing the growth to more than one device. The technique I'm using on CloudLogger, for instance, is to allow for infinite memory growth in a single VM, but configuring Kubernetes to auto-scale based on memory. So when a node reaches a certain limit, say 50%, of its used memory, Kubernetes will deploy a new pod and start to round robin the requests (which causes the memory usage to go up) between the old and new pod. This way I effectively (and automatically, with no code changes) can scale one device memory into many others. Techniques like this are bread and butter of modern cloud computing. An amazing language like Go should not restrict the engineers in being able to use such techniques

Maybe the term "unbounded" is a bit confusing. Maybe we should change from "unbounded queue" to "dynamically growing queue". Dynamically growing, which in the queue scenario means the same thing as unbounded, would probably be less confusing. Thoughts?

Agreed that the stdlib or golang.org/x repo would benefit from containers like queue and btree. It may become easier to land new containers once generics lands, tho that's a long way off yet.

Dynamic or "elastic" channel buffers have been proposed (and declined) in two forms: a channel buffer pool #20868, and unlimited channel buffer #20352. A workable alternative is one-goroutine-plus-two-channels as in this elastic channel example. It might be useful to include that in your benchmark set.

Container structures are typically unbounded/elastic, so that's unlikely to be a recurring objection here :-)

@networkimprov, thanks for posting the other channel proposals, and agreeing with the queue proposal! :-)

I looked into https://github.com/kylelemons/iq and both queue designs implemented here, a slice, ring based design as well as the dynamic channel.

I have benchmarked two high quality open source, similarly slice ring based implementations: phf and gammazero.

I have also benchmarked channels as queue.

Impl7 is dramatically faster than all ring slice based implementations tested as well as channels.

Having said that, I tried to benchmark both implementations, but the design for both queues is so "unique" I had trouble getting both implementations to fit my benchmark tests.

I couldn't help but to notice the "Infinite Queue implementations in Go. These are bad news, you should always use bounded queues, but they are here for educational purposes." title on the repo.

Blanket statements like this that are based on faulty research really doesn't help. I wish the author would do a better job researching other design and implementation options before making such statements.

cznic commented

Blanket statements like this that are based on faulty research really doesn't help.

The above is a blanket statement because it claims some undefined research is faulty without providing any evidence ;-)

phf commented

Thanks for noticing my queue. I obviously agree that a queue should be in the standard library. Personally I believe that a deque is what you want in the standard library. It's usually not much more work to implement (or much more complex code-wise) but it's twice as useful and not (in principle) harder to make fast. The proposal sounds good and I wish it success. Oh, and congrats on beating my queue! :-)

wsc1 commented

Nice queue!

However, my feeling is that the state of affairs of containers and the standard library is such that a global assessment of what the standard library should provide for containers before considering inclusion based on the merits of a single implementation of a single data structure.

Re benchmarking kylelemons/iq, its two channels simulate a single standard channel (one is input, the other output). But you already knew that, so the issue must be the goroutine?

Re its claim, "you should always use bounded queues," that means channel queues. As channels are principally an "IGrC" mechanism, if the consumer goroutine stops, the producer/s must block on c <- x (iow feel "backpressure").

I gather the API isn't thread safe? Have you benchmarked a concurrent scenario, e.g. multiple pushers & single popper? That's where a comparison with kylelemons/iq is most relevant. (And that's how I'd use your queue.)

Related: @ianlancetaylor once voiced support for the idea of allocating a bounded channel buffer as-needed, and your list-of-slices technique might apply there. There's not a separate proposal for that as yet.

I remember https://github.com/karalabe/cookiejar/tree/master/collections having a decent implementation.

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking. For different concurrent situations (MPMC, SPMC, MPSC, SPSC) and high-performance, the code would also look different; since not having to manage concurrency at a particular producer/consumer helps to speed up things.

As for tests you should also test the slowly increasing cases (i.e. 2 x Push, 1 x Pop or 3 x Push, 2 x Pop) and slowly decreasing cases (i.e. fill queue, then 1 x Push, 2 x Pop or 2 x Push, 3 x Pop). Also this document is not including the stable-cases (fill 1e6, or larger depending on internal block size, then repeat for N 1 x Push, 2 x Pop). Also missing large cases with 10e6 items. And average might not be the best measurement for these, look at percentiles also.

Should this be a deque (double-ended queue) implementation instead?

I believe that the two-array deque is conventional in language libraries. It tends to provide better cache behavior than a linked-list approach, and a lower cost (particularly in the face of mixed push-front/push-back usage) than a single-array queue.

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking.

“Lock-freedom” is not particularly meaningful for programs that run on multiple cores. The CPU's cache coherence protocol has many of the same properties as a lock.

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (#15292).

I agree that when used for concurrency, unbounded queues are a bad fit. However, unbounded queues are useful for sequential work e.g. when doing a breadth first search. We use slices that can grow arbitrarily, maps that can grow arbitrarily, and stacks that can grow arbitrarily, why can't the programmer be trusted to use them when appropriate?

I don't understand what the push back is on the idea (though I agree that it should wait on generics). An unbounded queue is not an unbounded channel.

Thanks @egonelbre for pointing out to this https://github.com/karalabe/cookiejar/blob/master/collections/queue/queue.go
queue implementation. This is an interesting design based on a dynamically growing circular slice of blocks.

I have added this queue to the bench tests. Feel free to clone the repo and run the tests to validate below results by yourself.

benchstat rawresults/bench-cookiejar.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         9.47µs ± 0%    0.00µs ± 3%   -99.99%  (p=0.000 n=19+20)
/1-4         9.52µs ± 1%    0.07µs ± 1%   -99.28%  (p=0.000 n=18+20)
/10-4        9.76µs ± 1%    0.58µs ± 0%   -94.08%  (p=0.000 n=19+20)
/100-4       11.6µs ± 1%     3.1µs ± 0%   -73.54%  (p=0.000 n=20+18)
/1000-4      29.6µs ± 2%    25.8µs ± 1%   -12.71%  (p=0.000 n=20+20)
/10000-4      231µs ± 1%     260µs ± 1%   +12.56%  (p=0.000 n=19+18)
/100000-4    3.21ms ±13%    3.07ms ± 3%    -4.41%  (p=0.000 n=18+20)

name       old alloc/op   new alloc/op   delta
/0-4         65.6kB ± 0%     0.0kB       -100.00%  (p=0.000 n=20+20)
/1-4         65.6kB ± 0%     0.0kB ± 0%   -99.93%  (p=0.000 n=20+20)
/10-4        65.6kB ± 0%     0.6kB ± 0%   -99.09%  (p=0.000 n=20+20)
/100-4       66.4kB ± 0%     3.4kB ± 0%   -94.88%  (p=0.000 n=20+20)
/1000-4      73.6kB ± 0%    25.2kB ± 0%   -65.80%  (p=0.000 n=20+20)
/10000-4      277kB ± 0%     243kB ± 0%   -12.29%  (p=0.000 n=20+20)
/100000-4    2.45MB ± 0%    2.43MB ± 0%    -0.79%  (p=0.000 n=20+20)

name       old allocs/op  new allocs/op  delta
/0-4           2.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          11.0 ± 0%      15.0 ± 0%   +36.36%  (p=0.000 n=20+20)
/100-4          101 ± 0%       107 ± 0%    +5.94%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.00%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.56%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.52%  (p=0.000 n=20+20)

This queue does have good performance for fairly large data sets, but its performance for small data sets, performance and memory wise, are not very good.

This is a very good implementation, I really like it, but:

  1. Due to its poor performance with small data sets, I would not regard this as a well balanced, general purpose queue
  2. I have concerns regarding memory leaks on this queue. I don't see anywhere in the code where the
    just released position in the internal slice gets set to nil
  3. This queue suffers from the same problem as every other pure slice based queues suffer: copying of existing elements when the queue needs to grow

@networkimprov the kylelemons/iq is a very unique design which seems to me was built to prove a point the channels does not make good unbounded queues.

I can't agree more with this. Channels are meant for routine synchronization, not as a sequential data store like a traditional queue can, which by the way, is a core use case for queues (if you can't process fast enough, store). That's why I was actually adamant in including buffered channels in the tests: they are not meant to be used as queues. If people really want to use them as queues, then a separate proposal with a separate discussion should follow. This is not a "dynamic channel" proposal. Please keep that in mind.

Regarding the claim, "you should always use bounded queues,", I understood (by reading the code) that he/she meant channels. My point was that his/her statement is not clear and bounded enough, so people reading might think no unbounded queues, regardless of its design, implementation, performance, etc are good. I can't agree with that.

Impl7 is not safe for concurrent use. As the map and all stdlib containers are not safe for concurrent use, the new queue should follow on the pattern. Not all users will need a safe for concurrent use queue. So IMO, a safe for concurrent use version is a topic for another proposal/discussion just like we have a separate sync.Map for a safe for concurrent use map implementation.

I do talk a bit about the subject on the design document though.

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

That's sort of the plan. Take a look at this response. I can't apply flow control all the way to the clients. This would make the hosting system slower, affecting its clients (the human beings). Non-critical infrastructure such as logging should not affect the hosting system significantly. This is a problem that has no solution other than to either queue or discard the logs. Discard should be the very last approach, only used when queueing is no longer possible. FWIW, CloudLogger already treats important logs in a very different way than regular logs, so even if one of the backend servers dies and a whole bunch of logs get lost, that's by design. All the important logs are not long term cached and will not be lost (in large numbers, at least).

Also I get your concerns around using too much memory, but think of this queue as a data structure that is highly efficient also in dealing with large amounts of data, gigabytes of data. It has to be unbounded because each device has a certain amount of available memory to be used at any given time. It should be up to the engineer using the queue to decide how much memory he/she wants to use. It's the job of the language designers to provide tools that allow the engineers to handle their needs with the maximum possible efficient, and not dictate how much memory their programs should use.

If we do have and can provide to all Go engineers a data structure that is flexible enough they know they will get really good performance no matter their needs, I think that data structure brings real value. Go engineers have been building their queues for a long time now. Often with (pretty bad results)[https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md]. That's also my point here. Go engineers are already doing "the bad" thing of using a lot of memory. The proposed solution just allows them to keep doing that in a much more efficient way.

Again, I'm not proposing a super crazy, specialized data structure here. Coming from Java and C#, I was super surprised Go doesn't offer a traditional queue implementation. Having a traditional queue implementation, especially a super flexible and fast one, also helps bring all those Java, C#, etc devs to the Go world.

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (#15292).

@bcmills please consider below.

  1. The queue proposed here is dramatically faster than every other queue tested. If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it. If we find a faster, balanced queue implementation, I'm more than happy to root for that implementation to make into the stdlib. As a matter of fact, I'd definitely use it in all my projects
  2. @networkimprov mentioned generics are a long way off. Should we wait and hold off improvements until we finally decide what we want to do with generics? How long have we been talking about generics? How much longer will it take? IMO we should keep on with an agile/SCRUM like approach to Go development where small, incremental improvements are constantly built and released. That's one of the reasons why Go is beating every other mainstream languages out there. We should allow and foster organic growth and constant small improvements to the language. High level discussions do have their place and should exist, but they can take place unilaterally and should not hold off development for very long.
  3. I'm very happy to implement changes the proposed the queue to make it better. All suggestions and criticism are most welcome. Especially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good. Please let us know why you disagree with the proposal and reply back to my comments if I do so, so we can have a constructive conversation.
cznic commented

Slices can be used as queues, but often they're not. Sometimes a program collects items and sort them, for example. A bounded slice is not helpful in this case. If the sorting requires more memory than available, there's no other option, in the first approximation, than to crash.

Maps serve a different purpose, but the argument about bound/unbound is the same.

There are many other general data structures with different big-O characteristics of their particular operations. One usually carefully selects the proper one and the argument about bound/unbound is still the same.

Channels typically serve as a queue between concurrently executing producers and consumers. They have a limit that blocks producers when reached and provides useful back pressure. Making a channel unlimited loses that property.

A non-channel queue (FIFO) offers a rather limited set of operations: Put and Get (or whatever the names were chosen). If it is to be used in the scenario where a slice, map or other generic data structure can be used, then there's no real need to have or use it.

The remaining scenarios for using a FIFO are IMO mostly - if not only - where the producers and consumers execute concurrently. If a FIFO is used effectively as a channel there are two important questions to answer:

  • Why not use a channel in the first place?
  • Why the back pressure property is useful for channels but nut useful for a FIFO used as a channel?

Also note that slices and maps that exist today in the language are type generic. The proposed unbound queue seems to use interface{} typed items. That adds memory overhead and hurts cache locality. I haven't really studied the code, but in a proper benchmark, []interface{} cannot beat []int in any of those two aspects. Similarly for []interface{} vs a buffered chan int.

If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it.

See https://golang.org/cl/94137 (which I ought to dust off and submit one of these days).

Also to help validate my point of the whole bunch of problematic Go queues out there, I was able to validate the cookiejar queue actually has a memory leak. Just ran some bench tests locally with large items added.

go test -benchmem -count 1 -benchtime 60s -timeout 60m -bench=".*Cookiejar" -run=^$
goos: darwin
goarch: amd64
pkg: github.com/christianrpetrin/queue-tests
BenchmarkCookiejar/1000000-4         	    2000	  44099280 ns/op	24830036 B/op	 1000489 allocs/op
BenchmarkCookiejar/10000000-4        	     200	 464993059 ns/op	317408985 B/op	10004883 allocs/op
BenchmarkCookiejar/100000000-4       	      10	7453667009 ns/op	9649322720 B/op	100048829 allocs/op
^Csignal: interrupt
FAIL	github.com/christianrpetrin/queue-tests	1510.242s

The memory growth is not linear to the growth of work. Also the last 1bi test took so long (5+ minutes), I had to abort the test. Impl7 doesn't suffer from these problems. I would most certainly not recommend cookiejar queue implementation until these issues are fixed.

This is a worthy proposal, but, since generics are an active topic of discussion (https://go.googlesource.com/proposal/+/master/design/go2draft.md), I agree with @bcmills that we should not pursue this further until that discussion has settled. It will inevitably affect the API, and possibly the implementation, of any new container types in the standard library.

Let me add that a queue implementation is just as good outside the standard library as in. There is no rush to adding queues to the standard library; people can use them effectively as an external package.

Specially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good.

Go philosophy takes a "when in doubt, leave it out" stance, so some ppl thumb-down stuff they hardly/never use. For example, notice the thumbs on this proposal to drop complex number support: #19921. From what I've read, the maintainers don't make decisions based on thumb-votes.

@ianlancetaylor agree the generics discussion should move forward. My only concern is how much longer will it take until a final decision is made and deployed? There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there. Can we get some sort of timeline on when the decision on generics will be made?

Also agree a queue could be deployed to any open source repo, but please read the background piece of my proposal to help you understand why I think there's a serious need for a high perf, issue free queue in the Go stdlib. To highlight the point:

Using external packages has to be done with great care. Packages on the stdlib, however, are supposed to be really well tested and performatic, as they are validated by the whole Go community. A decision to use a stdlib package is an easy one, but to use an open source, not well disseminated implementation is much more complicated decision to make. The other big problem is how does people find out what are the best, safe to use queue implementations? As pointed by @egonelbre, this one looks like a really interesting queue, but it has hidden memory leaks. Most Go engineers doesn't have the time to validate and properly probe these queues. That's the biggest value we can provide to the whole community: a safe, performant and very easy queue to be used.

@ianlancetaylor how about landing a queue in golang.org/x in the interim?

There is no specific timeline for generics support.

The argument about what should be in the standard library is a large one that should not take place on this issue. I'll just mention the FAQ: https://golang.org/doc/faq#x_in_std .

cznic commented

There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there.

There is an unknown, but rather big number of things that are not found in the stdlib which have several open source implementations. Per the above quoted logic construction, all of them should go to the stdlib. However, I think such conclusion is not true.

I still maintain the stance, that

  • an unbound queue/FIFO is a bad idea in the first place
  • no new interface{} or []interface{} based container should go into the stdlib.

Reasons were given earlier.

This is to certain extent similar to why the stdlib does not provide a backtracking regular expression engine implementation. Anyone is free to use one, but wrong solutions do not belong to the stdlib.

edit: typo

@bcmills I ported your implementation to here. Please let me know if you are fine with it. Otherwise I'll remove it.

I did commented out the routine synchronization code, otherwise comparing it to the proposal queue wouldn't be fair. I also needed to change a bit the NextEvent() method so it would return nil instead of block when there's no more items in the queue. This way I can get it to fit the tests better. Please let me know if this port looks good. I'll implement any changes you like and re-run the tests.

Results:

benchstat rawresults/bench-bcmills.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         2.90ns ± 1%    1.18ns ± 3%   -59.47%  (p=0.000 n=19+20)
/1-4         46.4ns ± 0%    68.3ns ± 1%   +47.27%  (p=0.000 n=18+20)
/10-4         897ns ± 1%     578ns ± 0%   -35.59%  (p=0.000 n=20+20)
/100-4       3.89µs ± 1%    3.07µs ± 0%   -21.00%  (p=0.000 n=19+18)
/1000-4      34.9µs ± 1%    25.8µs ± 1%   -26.15%  (p=0.000 n=19+20)
/10000-4      470µs ± 1%     260µs ± 1%   -44.75%  (p=0.000 n=18+18)
/100000-4    28.2ms ± 9%     3.1ms ± 3%   -89.13%  (p=0.000 n=19+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B          0.00B           ~     (all equal)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4        1.06kB ± 0%    0.60kB ± 0%   -43.61%  (p=0.000 n=20+20)
/100-4       4.86kB ± 0%    3.40kB ± 0%   -29.98%  (p=0.000 n=20+20)
/1000-4      73.5kB ± 0%    25.2kB ± 0%   -65.77%  (p=0.000 n=20+20)
/10000-4     1.03MB ± 0%    0.24MB ± 0%   -76.37%  (p=0.000 n=20+20)
/100000-4    19.3MB ± 0%     2.4MB ± 0%   -87.42%  (p=0.000 n=13+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           0.00           ~     (all equal)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          19.0 ± 0%      15.0 ± 0%   -21.05%  (p=0.000 n=20+20)
/100-4          113 ± 0%       107 ± 0%    -5.31%  (p=0.000 n=20+20)
/1000-4       1.02k ± 0%     1.02k ± 0%      ~     (all equal)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.26%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.51%  (p=0.000 n=20+20)

As compared to pretty much every other queue implementation tested, impl7 is very competitive or faster with low data sets and perform dramatically faster with large data sets, performance and memory wise. Impl7 is ~9x faster for 100k items (3.1ms vs 28.2) and uses only ~1/8 of the memory (2.4MB vs 19.3MB). This is pretty significant.

Please consider the performance of the queue as well in your decision whether to include the queue in
the stdlib or not.

@networkimprov The argument about generics applies to golang.org/x just as much as to the standard library.

There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there.

There is an unknown, but rather big number of things that are not found in the stdlib which have several open source implementations. Per the above quoted logic construction, all of them should go to the stdlib. However, I think such conclusion is not true.

I still maintain the stance, that

  • an unbound queue/FIFO is a bad idea in the first place
  • no new interface{} or []interface{} based container should go into the stdlib.

Reasons were given earlier.

This is to certain extent similar to why the stdlib does not provide a backtracking regular expression engine implementation. Anyone is free to use one, but wrong solutions do not belong to the stdlib.

edit: typo

If you consider just the first part of my statement and not the last, your logic certainly makes sense, but your are taking my comments out of context. Please quote the entire phrase, and not just the first part. On the quoted comment, I further state why I believe the first sentence is true in the third paragraph. Please don't leave those out of your quotes.

If you truly believe unbounded data structures are a bad idea, then we should propose to remove slice, map, sync.Map and the list data structures from stdlib..

@christianrpetrin, note that your benchmarks are testing only one very specific mode of usage: one in which test.count items are pushed (possibly with ⅓ of the items popped in the process), then the remaining items are popped, then the container is thrown away.

That is certainly one way to use a queue, but it is by no means the only way. In other modes of use (for example, in time- or batch-based streams) a queue will be repeatedly emptied and refilled over time.

I think you will find that the performance characteristics of the various implementations change significantly with the mode of use.

@bcmills yep, I realized that could affect the results as well. That's why the tests that add 100 and 10k items to the queues add 2 items, remove 1; add another 2, remove 1; and so on. Please review the test code here.

I have been running bench tests full time for two weeks now. I have run all sorts of tests, including super long ones; ones that add and remove items randomly, etc. I'm still to find a faster, balanced queue than the one proposed here in virtually any scenario.

cznic commented

If you consider just the first part of my statement and not the last, your logic certainly makes sense, but your are taking my comments out of context.

I've quoted in full the what was claimed and the proof of the same.

If you truly believe unbounded data structures are a bad idea, then we should propose to remove slice, map, sync.Map and the list data structures from stdlib..

I've argued in #27935 (comment) exactly the other way around. I wonder why you suggest otherwise.

@bcmills Yup. I don't trust a benchmark that says []interface{} can work faster than []int and/or use less memory. I don't know where the problem in the benchmark is, but except for some special cases it's an impossible result.

I'll leave the challenge here: if anyone can point out any significant issues either in the queue implementation or the tests, please let me know and I'll fix it right away.

Also if anyone thinks he/she has a better queue implementation, let me know and I'll benchmark against impl7 gladly.

But please people, I quit my full time job just to work on things that I absolutely love. And Go and open source software feature very high on my "things a love" list. I have spent two weeks full time working on this proposal. I see a lot of negative comments from people that apparently didn't even read the proposal entirely. Please take your time and read and analyze the whole proposal and not just one aspect of it. Again, my only interest here is to help and I'm extremely happy to see where are the holes in my research, so please leave constructive comments.

cznic commented

Please note that this proposal is about extending the API of the stdlib by a particular functionality. Any code is at this moment not that much important even though having an initial implementation is nice when a proposal gets eventually adopted.

The much more important things are the decision if the proposed functionality should or should not go in and getting the API as right as possible. Then any suitable implementation can be adopted. Yours or possibly someone else's in the future if it's better.

But please people, I quit my full time job just to work on things that I absolutely love. And Go and open source software feature very high on my "things a love" list. I have spent two weeks full time working on this proposal. I see a lot of negative comments from people that apparently didn't even read the proposal entirely. Please take your time and read and analyze the whole proposal and not just one aspect of it. Again, my only interest here is to help and I'm extremely happy to see where are the holes in my research, so please leave constructive comments.

You may have got involved more than necessary but that's not a fault of anyone else. Please try to remove yourself from the equation. If someone thinks something is technically correct/incorrect, he/she should not care how much time someone spent with preparing the proposal and/or writing the code. That's not a technically valid criterion for adopting or rejecting a proposal.

Also questions were raised in this thread that still seem rather open.

@cznic agree with everything :-)

My rant was about people complaining about the proposal with arguments already clearly answered in the proposal.

The only open question I'm aware of is if we should wait for generics or not.

There's some comments about more testing. I will wholeheartedly implement all the suggested tests, but they should not be a blocker for the proposal to be accepted or not.

Are you aware of any other open questions that are blocker?

@ianlancetaylor given a "worthy proposal," can you describe the rationale for restraining work in golang.org/x while generics is under construction? Because you'd maintain the pre-generic version indefinitely? If so, is that worth it in some cases?

@christianrpetrin, it might help to find a way to count usage of queue packages on Github. Regarding repeated comments that reject your core assumptions, silence can be a powerful argument :-)

I know the feeling of setting other work aside for weeks to draft a Go proposal. (More on my profile log.)

@networkimprov Given that generics is being actively discussed, it is my opinion that we should wait for that to be finalized before the Go team takes on any responsibility for any additional container packages of any type. (I understand that @christianrpetrin is volunteering to maintain this code, but if it's in the standard library or golang.org/x then it's the Go team's responsibility.)

My first thought when seeing the benchmarks was, that you should probably benchmark against a non-generic version of Impl1 (i.e. don't use interface{}, but int). On the surface it's not an apple-to-apple comparison. But given that Impl1 is trivial to implement correctly and Impl7 isn't, it still gives a valuable data point to see if the performance gained by generic version is worth it for the boilerplate of using a trivial hand-rolled one.

FWIW, I quickly whipped up a non-generic one and it hits the ball out of the park over Impl7, with an at most linear (2x) memory overhead:

BenchmarkImpl1/0-4  	200000000	         7.86 ns/op	       0 B/op	       0 allocs/op
BenchmarkImpl1/1-4  	20000000	        57.2 ns/op	      16 B/op	       1 allocs/op
BenchmarkImpl1/10-4 	 2000000	       744 ns/op	     568 B/op	      14 allocs/op
BenchmarkImpl1/100-4         	  300000	      4808 ns/op	    4360 B/op	     108 allocs/op
BenchmarkImpl1/1000-4        	   30000	     39089 ns/op	   40744 B/op	    1010 allocs/op
BenchmarkImpl1/10000-4       	    3000	    449350 ns/op	  746344 B/op	   10022 allocs/op
BenchmarkImpl1/100000-4      	     100	  17871414 ns/op	10047344 B/op	  100029 allocs/op
BenchmarkImpl1x/0-4          	2000000000	         1.26 ns/op	       0 B/op	       0 allocs/op
BenchmarkImpl1x/1-4          	50000000	        29.5 ns/op	       8 B/op	       1 allocs/op
BenchmarkImpl1x/10-4         	 5000000	       307 ns/op	     248 B/op	       5 allocs/op
BenchmarkImpl1x/100-4        	 1000000	      1440 ns/op	    1688 B/op	       9 allocs/op
BenchmarkImpl1x/1000-4       	  200000	     10972 ns/op	   16376 B/op	      11 allocs/op
BenchmarkImpl1x/10000-4      	   10000	    152085 ns/op	  351896 B/op	      23 allocs/op
BenchmarkImpl1x/100000-4     	    1000	   1452986 ns/op	 4654334 B/op	      30 allocs/op
BenchmarkImpl7/0-4           	2000000000	         1.27 ns/op	       0 B/op	       0 allocs/op
BenchmarkImpl7/1-4           	20000000	        95.5 ns/op	      48 B/op	       2 allocs/op
BenchmarkImpl7/10-4          	 2000000	       845 ns/op	     600 B/op	      15 allocs/op
BenchmarkImpl7/100-4         	  300000	      4609 ns/op	    3400 B/op	     107 allocs/op
BenchmarkImpl7/1000-4        	   50000	     37287 ns/op	   25160 B/op	    1021 allocs/op
BenchmarkImpl7/10000-4       	    3000	    416314 ns/op	  242760 B/op	   10161 allocs/op
BenchmarkImpl7/100000-4      	     300	   4243316 ns/op	 2427085 B/op	  101569 allocs/op

(BenchmarkImpl1x is the specialized one). Note, that in particular the claim in the proposal that for very large data sets the copy of slices is prohibitively expensive doesn't hold up to scrutiny when pitched against the far increased allocation costs that the generic implementation requires. Also note, that the O(n) number of allocations a generic implementation needs for n pushes slows down the whole program, because of increased GC pressures, whereas the non-generic version doesn't really affect the program as a whole, as the number of pointers/allocations isn't significant.

Generics might be able to improve all of that (it's hard to say, without knowing how they will be implemented), but at least for now, TBH, I'd still recommend anyone to stick with a simple slice over the proposal. ISTM that the central argument of the proposal - that a performant, correct queue implementation is hard to write - just isn't really that strong. Yes, the trivial implementation uses more memory. But not that much, and its performance is significantly faster.

@Merovius thanks for spending some time analyzing the proposed solution.

Now if you are going to compare something against something else, it's absolutely critical to compare apples to apples. Otherwise you'll get biased results.

So I quickly changed impl1 and impl7 to use int instead of "interface{}" (locally only). Just the internal data type was changed, nothing else. I also added a few more large data set tests so we can get a better picture of how much it actually cost to copy data around for large data sets.

Ran the tests locally with below commands.

go test -benchmem -count 10 -benchtime 3s -timeout 60m -bench=".*Impl1X" -run=^$ > bench-impl1x.txt
go test -benchmem -count 10 -benchtime 3s -timeout 60m -bench=".*Impl7X" -run=^$ > bench-impl7x.txt

Results

benchstat bench-impl1x.txt bench-impl7x.txt
name           old time/op    new time/op    delta
/0-4             6.87ns ± 1%    1.16ns ± 0%   -83.11%  (p=0.000 n=10+7)
/1-4             31.6ns ± 2%    56.4ns ± 5%   +78.56%  (p=0.000 n=10+10)
/10-4             322ns ± 1%     391ns ± 4%   +21.49%  (p=0.000 n=9+9)
/100-4           2.28µs ± 2%    2.37µs ± 0%    +4.20%  (p=0.000 n=10+8)
/1000-4          19.3µs ± 2%    21.0µs ± 0%    +8.81%  (p=0.000 n=9+9)
/10000-4          213µs ± 3%     214µs ± 1%      ~     (p=0.579 n=10+10)
/100000-4        2.13ms ± 1%    2.09ms ± 0%    -1.75%  (p=0.000 n=10+8)
/1000000-4       20.7ms ± 2%    23.5ms ± 4%   +13.16%  (p=0.000 n=10+10)
/10000000-4       205ms ± 2%     224ms ± 4%    +9.13%  (p=0.000 n=10+10)
/100000000-4      2.12s ± 2%     2.28s ± 4%    +7.48%  (p=0.000 n=9+10)
/1000000000-4      145s ±35%       45s ±20%   -69.08%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
/0-4              0.00B          0.00B           ~     (all equal)
/1-4              8.00B ± 0%    40.00B ± 0%  +400.00%  (p=0.000 n=10+10)
/10-4              320B ± 0%      352B ± 0%   +10.00%  (p=0.000 n=10+10)
/100-4           2.48kB ± 0%    2.13kB ± 0%   -14.19%  (p=0.000 n=10+10)
/1000-4          24.4kB ± 0%    16.7kB ± 0%   -31.39%  (p=0.000 n=10+10)
/10000-4          432kB ± 0%     163kB ± 0%   -62.34%  (p=0.000 n=10+10)
/100000-4        5.45MB ± 0%    1.63MB ± 0%   -70.19%  (p=0.000 n=10+10)
/1000000-4       53.2MB ± 0%    16.3MB ± 0%   -69.45%  (p=0.000 n=10+10)
/10000000-4       504MB ± 0%     163MB ± 0%   -67.73%  (p=0.000 n=10+10)
/100000000-4     5.74GB ± 0%    1.63GB ± 0%   -71.67%  (p=0.000 n=10+10)
/1000000000-4    54.0GB ± 0%    16.3GB ± 0%   -69.89%  (p=0.000 n=9+10)

name           old allocs/op  new allocs/op  delta
/0-4               0.00           0.00           ~     (all equal)
/1-4               1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=10+10)
/10-4              14.0 ± 0%      15.0 ± 0%    +7.14%  (p=0.000 n=10+10)
/100-4              108 ± 0%       107 ± 0%    -0.93%  (p=0.000 n=10+10)
/1000-4           1.01k ± 0%     1.02k ± 0%    +1.09%  (p=0.000 n=10+10)
/10000-4          10.0k ± 0%     10.2k ± 0%    +1.39%  (p=0.000 n=10+10)
/100000-4          100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=10+10)
/1000000-4        1.00M ± 0%     1.02M ± 0%    +1.56%  (p=0.000 n=10+10)
/10000000-4       10.0M ± 0%     10.2M ± 0%    +1.56%  (p=0.000 n=10+10)
/100000000-4       100M ± 0%      102M ± 0%    +1.56%  (p=0.000 n=10+10)
/1000000000-4     1.00G ± 0%     1.02G ± 0%    +1.56%  (p=0.000 n=9+10)

Now comparing apples to apples, we can see the results are actually (I'm a bit surprised) very similar for all tests, except the last one. Pay attention to that one. That's the 1bi items test. The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items). I strongly encourage you guys to clone and repo and run the tests with large data sets as well to help validate the results. All is need is to add more items to the "tests" variable here.

Having said that, I really like your findings here. I think this test results provide an excellent case study for the design of generics.

But still, impl7 performs really well for small and mid-sized data sets, and crushes, for the lack of a more dramatic word, pure slice based implementations for large data sets. And this will be true regardless of the internal data type because of the queue unique design, not the implementation.

I mean, this is just pure logic. Impl7 never copies data around. Slice based will copy, and if the data set is large, it doesn't matter how fast the copying algorithm is. The copying has to be done and will take some time. Also managing and expanding already gigantic slices in an overloaded system is much slower than managing many small slices.

To help clarify how impl7 is much faster than any slice based approach, here's what happens when you have 1 billion items in a queue and need to expand to accommodate more items.

Slice based implementation

  • Allocate a new, twice the size of the previous allocated one, say 2 billion positions slice
  • Copy over all 1 billion items from the previous slice into the new one
  • Add the new value into the first unused position in the new slice, position 1000000001.

Impl7

  • Allocate a new 128 size slice
  • Set the next pointer
  • Add the value into the first position of the new slice, position 0

And I still stand by my statement that building high performance queues is not an easy job. Building a well balanced queue that will perform well in all scenarios require coming up with design ideas, testing and probing different implementations. As proved again by above bench results, pure slice based implementations doesn't fit the bill.

The uniqueness of the design is in the observation that slices goes to great length to provide predictive, indexed positions. A hash table, for instance, absolutely need this property, but not a queue. So impl7 completely give up this property and focus on what really matters: add to end, retrieve the first. No copying around and repositioned is needed for that. So when a slice goes to great length to provide that functionality, the whole work of allocating new arrays, copying data around is all wasted work. None of that is necessary. And this work costs dearly for large data sets.

Also it's important to highlight that in most cases the queue will likely be used with reference types (most likely user defined structs), not primitive data types such as int. With reference types, impl7 is much faster than impl1 in all scenarios, not just the high load ones. So your test with int is a very interesting one, but not likely a very used one in the real world.

And regarding your recommendation for people to keep building their own slice based implementations, you absolutely need to point out that this recommendation would only display some good results if the queue data type is a primitive one. So please be careful with your statements. They have to be well bounded if you don't want to mislead people.

And a bit more on the "it's easy" to build a well engineered queue, I'm absolutely sure the author of this queue, which by the way, looks very good and has a very interesting design, is a highly competent engineer. However, his queue has some pretty serious memory leaks. Now imagine not so experienced engineers trying to build high performance queues. Memory leaks and terrible performance in some scenarios are likely to follow.

Just the internal data type was changed, nothing else.

Just for the record: That is biasing your results towards the more complicated code. I wouldn't wrap an []int into a generic, interface{} based API when using it as a queue, it's useless complexity. But, of course, a more complicated codebase makes more sense to wrap because you have to amortize the implementation cost via reuse.

As I said, the point of my comparison was to pitch a complicated generic implementation against a trivial non-generic one, as the argument was specifically that it's hard to write a performant correct queue. And while comparing apples to apples makes sense if you want to bake apple pie, if all you really want is a tasty desert, there's no harm in considering a wider choice of fruit :)

I also added a few more large data set tests so we can get a better picture of how much it actually cost to copy data around for large data sets.

Out of interest, can you provide some actual numbers from your practical application? It doesn't make sense to compare queues with billions of items, if actual queues never get that large in practice. Except that it again biases your tests towards a more complicated solution, as complicated solutions tend to do better asymptotically, but have larger constants.

Now, of course, biasing a test towards larger numbers isn't inherently less valid than biasing towards smaller numbers. Which is why the question above about what numbers are realistic is so important. We should start from what actual, realistic requirements are and then see what tests make sense to run.

The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items).

Note that this is exclusively because of the generic API. A non-generic slice solution would allocate between 8 and 16GB. And to reiterate, my point was specifically that I don't think the cost of a generic solution (which is exactly this memory and allocation overhead) doesn't justify its benefits.

I mean, this is just pure logic. Impl7 never copies data around. Slice based will copy, and if the data set is large, it doesn't matter how fast the copying algorithm is.

It actually isn't that simple. I'm not contesting that your implementation has - ideally - better asymptotic cumulative behavior. I'm not as sure as you are that it does, but I don't think it's as worth discussing. These copies can be amortized and the generally slower implementation can eat up the wins.

FTR, on my laptop the tipping point with non-generic implementations is around 10M elements (I can't test with significantly more though, because it's a laptop after all :) ). 10M elements per process is a pretty big queue.

As proved again by above bench results, pure slice based implementations doesn't fit the bill.

We obviously disagree on what has and has not been proven.

However, his queue has some pretty serious memory leaks.

FTR, that's not a memory leak. These are generally impossible in Go-code that doesn't use unsafe. And in any realistic system (where both pushes and pops happen), the memory will eventually (or, realistically pretty soon) be released. Note, that your actual arguments for needing the queue to be bounded already create the circumstances that no value put in the queue would be kept around very long.

But also, that queue is another example of an, IMO, unnecessarily complicated queue. I am, still, pitching all of these against using a simple, concretely typed slice and using append for push and q = q[1:] for pop. No extra API. No interface{}. The most simple implementation of a slice-based queue possible. You don't need to be a terribly experienced engineer to implement that bug-free and you don't need to implement it generically, because there are no implementation costs to amortize.

If that performs well enough for pretty much any realistic workload, that is IMO reason enough to leave FIFO-queues out of the standard library.

@Merovius, I like your sense of humor! :-)

Now with real world use cases, meaning, using the queue with a reference type, impl7 beats every other queue by far in terms of performance and memory. Your targeted example doesn't reflect reality. But still, even with a narrow case impl7 is very competitive performance wise and it still uses only 1/3 of the memory of the slice based implementation.

Out of interest, can you provide some actual numbers from your practical application? It doesn't make sense to compare queues with billions of items, if actual queues never get that large in practice. Except that it again biases your tests towards a more complicated solution, as complicated solutions tend to do better asymptotically, but have larger constants.

Unfortunately CloudLogger is not ready yet, but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable. Please make sure to read and understand the entire proposal. There's a lot more to the proposal and the queue than what meets the eye. You absolutely need to read it throughout it if you really want to come up with some proper down arguments on the queue.

Really, you find this code complex?

And a bit more on the "it's easy" to build a well engineered queue, I'm absolutely sure the author of this queue, which by the way, looks very good and has a very interesting design, is a highly competent engineer. However, his queue has some pretty serious memory leaks.

I don't quite understand why you consider that a memory leak. The reason for that case is repeated filling and emptying the queue. When you compare repeated filling of the queue (1000x per queue), you can see these numbers:

BenchmarkImpl7/0-8           1000000         1297 ns/op            0 B/op          0 allocs/op
BenchmarkCookiejar/0-8        100000        18722 ns/op        65568 B/op          2 allocs/op
BenchmarkImpl7/1-8             10000       105496 ns/op        48000 B/op       2000 allocs/op
BenchmarkCookiejar/1-8         50000        33237 ns/op        65568 B/op          2 allocs/op
BenchmarkImpl7/10-8             2000       892179 ns/op       600001 B/op      15000 allocs/op
BenchmarkCookiejar/10-8         5000       373064 ns/op       203152 B/op       9004 allocs/op
BenchmarkImpl7/100-8             300      4675674 ns/op      3400008 B/op     107000 allocs/op
BenchmarkCookiejar/100-8         500      3531698 ns/op       923153 B/op      99004 allocs/op
BenchmarkImpl7/1000-8             30     39630023 ns/op     25160058 B/op    1021000 allocs/op
BenchmarkCookiejar/1000-8         50     34455852 ns/op      8123226 B/op     999004 allocs/op
BenchmarkImpl7/10000-8             3    410618766 ns/op    242760405 B/op   10161000 allocs/op
BenchmarkCookiejar/10000-8         3    346521066 ns/op     80188901 B/op    9999006 allocs/op
BenchmarkImpl7/100000-8            1   4582102500 ns/op   2427089840 B/op  101569003 allocs/op
BenchmarkCookiejar/100000-8        1   4849534700 ns/op    801706848 B/op   99999052 allocs/op

Although, I suspect usually it's not used with interface{}, but rather specialized for a particular type. But as you can see it's 2x faster in certain scenarios. I don't know for which data-sizes, values he optimized for, but yeah.

As for a faster for larger number of items, here's a quick and dirty version:

BenchmarkImpl7/0-8     5000000000         1.29 ns/op       0 B/op       0 allocs/op
BenchmarkEgon/0-8      5000000000         1.93 ns/op       0 B/op       0 allocs/op
BenchmarkImpl7/1-8       50000000       108 ns/op         48 B/op       2 allocs/op
BenchmarkEgon/1-8         1000000      6626 ns/op      18432 B/op       1 allocs/op
BenchmarkImpl7/10-8       5000000       892 ns/op        600 B/op      15 allocs/op
BenchmarkEgon/10-8        1000000      6882 ns/op      18504 B/op      10 allocs/op
BenchmarkImpl7/100-8      1000000      4708 ns/op       3400 B/op     107 allocs/op
BenchmarkEgon/100-8        500000      8936 ns/op      19224 B/op     100 allocs/op
---
BenchmarkImpl7/1000-8      100000     39952 ns/op      25160 B/op    1021 allocs/op
BenchmarkEgon/1000-8       200000     30922 ns/op      26424 B/op    1000 allocs/op
BenchmarkImpl7/10000-8      10000    409220 ns/op     242760 B/op   10161 allocs/op
BenchmarkEgon/10000-8       10000    330812 ns/op     264312 B/op   10009 allocs/op
BenchmarkImpl7/100000-8      1000   4358965 ns/op    2427087 B/op  101569 allocs/op
BenchmarkEgon/100000-8       1000   3937140 ns/op    2606335 B/op  100097 allocs/op

For counts <1000 it's slower because there it's measuring construction time, rather than actual Pop/Push time. For repeated fills 1000x you should be able to see these results:

BenchmarkImpl7/0-8       3000000          1291 ns/op            0 B/op           0 allocs/op
BenchmarkEgon/0-8        2000000          1921 ns/op            0 B/op           0 allocs/op
BenchmarkImpl7/1-8         50000        108040 ns/op        48000 B/op        2000 allocs/op
BenchmarkEgon/1-8         300000         14591 ns/op        18432 B/op           1 allocs/op
BenchmarkImpl7/10-8         5000        881587 ns/op       600001 B/op       15000 allocs/op
BenchmarkEgon/10-8         20000        278252 ns/op       256320 B/op        9010 allocs/op
BenchmarkImpl7/100-8        1000       4685733 ns/op      3400007 B/op      107000 allocs/op
BenchmarkEgon/100-8         2000       2983277 ns/op      2598344 B/op       99098 allocs/op
BenchmarkImpl7/1000-8        100      40053001 ns/op     25160064 B/op     1021000 allocs/op
BenchmarkEgon/1000-8         200      28794169 ns/op     26000129 B/op      999977 allocs/op
BenchmarkImpl7/10000-8        10     407284450 ns/op    242760470 B/op    10161000 allocs/op
BenchmarkEgon/10000-8         20     299871275 ns/op    259999664 B/op    10008766 allocs/op
BenchmarkImpl7/100000-8        1    4402977600 ns/op   2427088160 B/op   101569000 allocs/op
BenchmarkEgon/100000-8         2    2822037900 ns/op   2600012728 B/op   100096660 allocs/op

I don't think my implementation is particularly good, I suspect there are bunch of ways to improve it, e.g. free blocks for stable queue size. Of course, this would slow down the simpler cases.

Note I changed test such that it uses Pop until false, instead of checking Len every time. The quick & dirty version has a slower Len to make Push and Pop faster.

but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable.

I have read the proposal. I disagree with your reasoning. But that's not what I asked. I asked "what are realistic N that you care about?". i.e. assuming an unbounded queue is sensible - how large will it roughly be in practice? As we've established, the answer to that question determines what the best implementation is - an implementation that is better for large N will be worse for small N. So "as big as possible" just isn't a useful answer, because you end up worsening the performance for actual use-cases.

Really, you find this code complex?

Compared to append/q[1:]? Yeah.

the results are actually (I'm a bit surprised) very similar for all tests, except the last one. Pay attention to that one. That's the 1bi items test. The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items).

Again, I think you are focusing on too narrow a set of benchmarks. You have a solution that performs well for very large queues that are filled and then drained in one long burst, under the assumption that peak memory usage for very large queues is what matters.

It is not at all obvious to me that very large queues filled and drained once are the dominant use-case for queues in general, or that the memory usage of the queue itself is significant (compared to the memory transitively pointed to by the queue's contents). I would want to see evidence to support both of those assumptions before adding such a specifically-optimized implementation to the standard library.

(In contrast, compare sync.Map: it is also specifically optimized for certain modes of use, but those modes of use appear repeatedly within the standard library. If we didn't have standard-library call sites that needed it, sync.Map would be more appropriate as a standalone package.)

but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable.

I have read the proposal. I disagree with your reasoning. But that's not what I asked. I asked "what are realistic N that you care about?". i.e. assuming an unbounded queue is sensible - how large will it roughly be in practice? As we've established, the answer to that question determines what the best implementation is - an implementation that is better for large N will be worse for small N. So "as big as possible" just isn't a useful answer, because you end up worsening the performance for actual use-cases.

Really, you find this code complex?

Compared to append/q[1:]? Yeah.

  1. Quoting the proposal: "As there's no telling how much data will have to be queued as it depends on the current traffic, an unbounded, dynamically growing queue is the ideal data structure to be used.".
    This means the memory used is linear to the amount of requests (i.e. active users) the hosting application is handling. Of coarse that even if techniques such as this is
    used to distribute the memory usage among many devices, under a DDOS attach for instance, the incurred cost to auto scale to a large amounts of nodes in order to be able to hold enormous amount of data (terabytes? petabytes?) is prohibitive. So what CloudLogger will do is state this problem very clearly and recommend the users to specify a hard limit of on much memory they want each CloudLogger backend server to hold as well as how many extra nodes he/she wants to auto scale to. After those user defined, specific for each system/scenario numbers are defined, CloudLogger will diligently cache the rate limit logs up to the specified limits. After that it starts to discard the logs.

  2. I think the point you are missing here is that impl7 will perform really well no matter how
    much memory it has to use because, again, of its design. Not the implementation. So in a way, impl7 is future proof. If, in a few years from now, we have computers with exabyte of memory and some "crazy" engineers decide to build systems that will use that much memory, impl7 will provide the efficiency the engineers need to handle that much data.

  3. I think what you are trying to say is that this code is slightly more complicated than this one. But in any way, shape or form, I would call this code complex. This reads to me as a incredible simple code that does an amazing job for what it is.

It might make sense to resume this discussion after generics lands...

It might make sense to resume this discussion after generics lands...

Ideally yes. I completely agree if we have generics in place, we'd be in a better place now. My only concern is how much longer will generics take to land. IMO, if it's going to take a couple years from now, I'd like to see this proposal make into the stdlib before that. People will be using it for years before we need to make any changes to it, which btw, I expect would be simple changes anyways.

But if you guys feel like we're really close and should be able to land generics, in say, a couple of months from now, I'm completely happy to wait.

#27935 (comment)

The Go gods are very deliberate.

I think what you are trying to say is that this code is slightly more complicated than this one.

No, that's not what I am trying to say. I'm saying that both of them are unnecessarily complicated. Just declare a queue as var q []T, use q = append(q, v) to push to the queue, v, q = q[0], q[1:] to pop from it and len(q) to get its length. It's a correct, easy to understand, performant implementation of a FIFO queue.

Obviously I won't convince you of that. I still think that your benchmarks should at least include that approach to give a fair and complete picture of whether the cost of this abstraction are worth it, though.

The Go gods are very deliberate.

I get the point, but I think it's imperative we have some sort of timeline on generics. Who can guarantee generics will land in the next, say two years? What if there's so many proposals and disagreements the Go gods will have trouble agreeing on a proposal? I understand this is a super hard process. There's so much involved and so many different use cases that needs to be addressed in an efficient way. It's a really hard job and I'm afraid if we don't specify a hard timeline on the decision, a couple of years from now we could still be discussing generics. That would greatly frustrate any one who is willing to contribute because, well, their contributions will not be accepted.

As for a faster for larger number of items, here's a quick and dirty version:

BenchmarkImpl7/0-8     5000000000         1.29 ns/op       0 B/op       0 allocs/op
BenchmarkEgon/0-8      5000000000         1.93 ns/op       0 B/op       0 allocs/op
BenchmarkImpl7/1-8       50000000       108 ns/op         48 B/op       2 allocs/op
BenchmarkEgon/1-8         1000000      6626 ns/op      18432 B/op       1 allocs/op
BenchmarkImpl7/10-8       5000000       892 ns/op        600 B/op      15 allocs/op
BenchmarkEgon/10-8        1000000      6882 ns/op      18504 B/op      10 allocs/op
BenchmarkImpl7/100-8      1000000      4708 ns/op       3400 B/op     107 allocs/op
BenchmarkEgon/100-8        500000      8936 ns/op      19224 B/op     100 allocs/op
---
BenchmarkImpl7/1000-8      100000     39952 ns/op      25160 B/op    1021 allocs/op
BenchmarkEgon/1000-8       200000     30922 ns/op      26424 B/op    1000 allocs/op
BenchmarkImpl7/10000-8      10000    409220 ns/op     242760 B/op   10161 allocs/op
BenchmarkEgon/10000-8       10000    330812 ns/op     264312 B/op   10009 allocs/op
BenchmarkImpl7/100000-8      1000   4358965 ns/op    2427087 B/op  101569 allocs/op
BenchmarkEgon/100000-8       1000   3937140 ns/op    2606335 B/op  100097 allocs/op

For counts <1000 it's slower because there it's measuring construction time, rather than actual Pop/Push time. For repeated fills 1000x you should be able to see these results:

BenchmarkImpl7/0-8       3000000          1291 ns/op            0 B/op           0 allocs/op
BenchmarkEgon/0-8        2000000          1921 ns/op            0 B/op           0 allocs/op
BenchmarkImpl7/1-8         50000        108040 ns/op        48000 B/op        2000 allocs/op
BenchmarkEgon/1-8         300000         14591 ns/op        18432 B/op           1 allocs/op
BenchmarkImpl7/10-8         5000        881587 ns/op       600001 B/op       15000 allocs/op
BenchmarkEgon/10-8         20000        278252 ns/op       256320 B/op        9010 allocs/op
BenchmarkImpl7/100-8        1000       4685733 ns/op      3400007 B/op      107000 allocs/op
BenchmarkEgon/100-8         2000       2983277 ns/op      2598344 B/op       99098 allocs/op
BenchmarkImpl7/1000-8        100      40053001 ns/op     25160064 B/op     1021000 allocs/op
BenchmarkEgon/1000-8         200      28794169 ns/op     26000129 B/op      999977 allocs/op
BenchmarkImpl7/10000-8        10     407284450 ns/op    242760470 B/op    10161000 allocs/op
BenchmarkEgon/10000-8         20     299871275 ns/op    259999664 B/op    10008766 allocs/op
BenchmarkImpl7/100000-8        1    4402977600 ns/op   2427088160 B/op   101569000 allocs/op
BenchmarkEgon/100000-8         2    2822037900 ns/op   2600012728 B/op   100096660 allocs/op

I don't think my implementation is particularly good, I suspect there are bunch of ways to improve it, e.g. free blocks for stable queue size. Of course, this would slow down the simpler cases.

Note I changed test such that it uses Pop until false, instead of checking Len every time. The quick & dirty version has a slower Len to make Push and Pop faster.

@egonelbre fantastic insight! I really appreciate your time looking into this!

Below are my thoughts.

  1. Wonderful insight in testing using head == nil instead of len for an empty queue. I did actually changed impl7 to use your recommendation, but initial tests seems to improve performance a little bit for large data sets. For small, there's a small slow down. Not sure if this is significant yet.
benchstat bench1.txt bench2.txt
name            old time/op    new time/op    delta
Impl7/0-4         1.45ns ± 1%    1.48ns ± 3%  +2.29%  (p=0.000 n=20+17)
Impl7/1-4         68.6ns ± 2%    70.1ns ± 3%  +2.13%  (p=0.000 n=18+20)
Impl7/10-4         582ns ± 4%     593ns ± 5%  +1.92%  (p=0.000 n=19+18)
Impl7/100-4       3.15µs ± 5%    3.18µs ± 8%    ~     (p=0.280 n=20+19)
Impl7/1000-4      26.1µs ± 4%    26.5µs ± 7%    ~     (p=0.104 n=18+18)
Impl7/10000-4      268µs ± 3%     274µs ± 9%    ~     (p=0.496 n=19+20)
Impl7/100000-4    3.29ms ± 7%    3.17ms ± 7%  -3.53%  (p=0.009 n=16+20)

name            old alloc/op   new alloc/op   delta
Impl7/0-4          0.00B          0.00B         ~     (all equal)
Impl7/1-4          48.0B ± 0%     48.0B ± 0%    ~     (all equal)
Impl7/10-4          600B ± 0%      600B ± 0%    ~     (all equal)
Impl7/100-4       3.40kB ± 0%    3.40kB ± 0%    ~     (all equal)
Impl7/1000-4      25.2kB ± 0%    25.2kB ± 0%    ~     (all equal)
Impl7/10000-4      243kB ± 0%     243kB ± 0%    ~     (all equal)
Impl7/100000-4    2.43MB ± 0%    2.43MB ± 0%    ~     (p=0.751 n=20+20)

name            old allocs/op  new allocs/op  delta
Impl7/0-4           0.00           0.00         ~     (all equal)
Impl7/1-4           2.00 ± 0%      2.00 ± 0%    ~     (all equal)
Impl7/10-4          15.0 ± 0%      15.0 ± 0%    ~     (all equal)
Impl7/100-4          107 ± 0%       107 ± 0%    ~     (all equal)
Impl7/1000-4       1.02k ± 0%     1.02k ± 0%    ~     (all equal)
Impl7/10000-4      10.2k ± 0%     10.2k ± 0%    ~     (all equal)
Impl7/100000-4      102k ± 0%      102k ± 0%    ~     (all equal)

For the fun part, why would checking for head instead of len be faster?

I would assume using head would be faster because of the high rate of reads and writes to the len variable coupled with the constant need to sync back its value from the the CPU registers and hardware caches (https://en.wikipedia.org/wiki/Cache_(computing)#Examples_of_hardware_caches) back to the main memory, would cause some sort of slow down. Head does suffer from the same problem, but because its value changes only every 129 items removed, the rate in which its value would be synced back to the main memory is n/129, where 129 is the number of items that needs to be removed from the queue before head is updated (to the queue next slice). The write rate for the len variable is linear to the number of items removed from the queue.

Still, your suggested change has a lot of theoretical potential to improve performance in some high load scenarios, so I checked it in already, but I'll keep testing to try to better understand its implications.

  1. Can you please share the test code you used for the "repeated fills 1000x" test?

  2. I see some strange memory patterns with your solution. Imp7 memory patterns grows linearly almost perfectly with the growth of work, doesn't matter the load. I think this might be happening because your Pop doesn't set the next pointer to nil after that slice have been used up. Take a look at line 137 of impl7 here. Not setting the next pointer to nil before "releasing" (losing the reference to the slice) causes it to still points to the new head slice, which I suspect makes GC thinks that memory address is still being used. I suspect if you fix your code to set the next to nil, you should see a performance drop on your benchmarks.

  3. I see your are using fairly large internal slice sizes (1024). Based on the slice size tests, I was also able to validate the larger the internal slice size, the better the results are for large data sets. The problem with larger slice sizes is the potential for larger amounts of allocated but not used data (see "Drawbacks" on the proposal). So 128 is a compromise between best performance for very large data sets vs wasted memory consumption.

the results are actually (I'm a bit surprised) very similar for all tests, except the last one. Pay attention to that one. That's the 1bi items test. The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items).

Again, I think you are focusing on too narrow a set of benchmarks. You have a solution that performs will for very large queues that are filled and then drained in one long burst, under the assumption that peak memory usage for very large queues is what matters.

It is not at all obvious to me that very large queues filled and drained once are the dominant use-case for queues in general, or that the memory usage of the queue itself is significant (compared to the memory transitively pointed to by the queue's contents). I would want to see evidence to support both of those assumptions before adding such a specifically-optimized implementation to the standard library.

(In contrast, compare sync.Map: it is also specifically optimized for certain modes of use, but those modes of use appear repeatedly within the standard library. If we didn't have standard-library call sites that needed it, sync.Map would be more appropriate as a standalone package.)

@bcmills, it makes a lot of sense to me! I'm also highly interested in finding out any scenarios where impl7 would not perform well, so I'll wholeheartedly implement any test suggestions.

As of right now I have a feeling there's room for improvements on repeated refills, but we'll see.

What about below tests. Would it be enough to help you understand the queue performance?

Test 1: On each test iteration:

  • Pick a random number between 1 and 10k
  • Add that number of items to the queue
  • Remove all the items from the queue
  • Repeat 10 times, reusing the same queue instance

Test 2: On each test iteration:

  • Pick a random number between 1 and 10k
  • Add that number of items to the queue
  • Pick another random number between 1 and the initial random number
  • Remove that number of items from the queue
  • Repeat 10 times, reusing the same queue instance

What about below tests. Would it be enough to help you understand the queue performance?

No, I don't think that really helps. Tests of random inputs are not necessarily representative of real programs.

If you're talking about a general-purpose queue library, the best way to gauge its performance impact is to find some real programs that use queues in the wild, and benchmark them with various queue implementations (including typed slices, []interface{}, your chunked-list implementation, a two-array deque, and probably a naive linked-list using container/list) substituted in for whatever they're using today.

If the profile numbers don't change with the implementation, then you may as well use whatever is simplest (which, today, is usually the slice approach that @Merovius described). If they do change with the implementation, then you'll be able to see which implementation works best for which kinds of programs.

  1. The nil check is not that important. The small win comes from avoiding len altogether.

  2. Using different computer so quickly recreated it https://gist.github.com/egonelbre/bffa9e02282f6623fc1b331a9515feee. Also added two additional variations where CookieJar might do well. Tracking single free-block for stable should suffice. But as @bcmills said, you need real-world benchmarks not just synthetic ones.

  3. Will review tomorrow.

  4. Hence the reason I called it "quick & dirty", I didn't bother tweaking parameters nor the design for optimality.

Although, I probably won't dwell on this issue too long afterwards.

@bcmills @egonelbre

Here's how I believe such task to gather real world data for queues should be carried out.

  1. Find a few, preferably at least three real world, production Go systems that uses queues in some shape or form. All systems should use the queue with different purposes/uses cases.
  2. For each one of the systems
    2.1) Update it to expose metrics about the queue usage such as number of items added, removed,
    rate of add to removal, minimum number of items in the queue, maximum number of items in the queue, usage pattern, among other possible useful metrics
    2.2) Update it to use typed slices as a queue
    2.3) Deploy to production
    2.4) Let it run with production traffic for at least a full day
    2.5) Gather and analyze the results
    2.6) Repeat steps 2.2 to 2.5 for:
    • []interface{} queue
    • My chunked-list implementation
    • A two-array deque
    • A naive linked-list using container/list
  3. Aggregate and analyze all results
  4. Write a document explaining all results detailing each use case, special uses, etc

I don't have access to any production system anymore, but even if I had, imagine how hard it would be to me to convince the system owners to let me play around with their system which would include deploying test code to production. Worse is, we know, based on the synthetic tests, some of those solutions will very likely not perform well, hurting the hosting system performance.

Besides the nearly impossible task to find those real world production systems that would be willing to help, the time it would take me to run all the tests, gather and analyze results is impractical for me. I can't afford to spend a entire month doing this. I do have a bunch of ideas, besides the queue one, which I believe will help make Go a better language. I want to move on and focus on the next proposals and improvements, and not linger on this one for God knows how much longer.

I urge you guys to make your decision based on:

  1. The queue design and the theory behind it
  2. The queue implementation
  3. The (synthetic) benchmark tests

There's so much more to this queue than just the code. I see a lot of challenges to the code, but none about the queue foundation. The theory and design should be the ones mostly scrutinized, way more than the implementation. But I still didn't see a single challenge to the queue foundation.

If you do have any specific concerns regarding the theory, the design or the implementation that you feel like would make the queue not perform competitively, please point out what your concerns are. I need real examples in order to help validate the threat of those concerns. Please point me out to a specific line of code, for instance, in the implementation. State why you feel like that line could pose problems, in which scenario and why.

Another workable solution would be for you to point out exactly which synthetic test cases you would like to see. I can very quickly code those up and run the tests.

Took a look at the nil thing you mentioned, it's being cleared https://github.com/egonelbre/exp/blob/059fd158c18a79639ee31a5f50b1b8c07df72e52/queue/blockqueue/queue.go#L68.

It should be sufficient to find a number of programs that need to use such high-performance queue (there are a lot of examples). Demonstrate that queue indeed is the right thing they need. Show that by using this queue improves their performance significantly (above 5%). Using production systems would be of course ideal, but lack of them using local benchmarks should suffice. I've have recorded all the instructions and then just played back them on different data-structures, to see the performance difference.

With regards to implementation my main concerns are about: 1. allocation amount, 2. GC pressure, 3. use interfaces (but this would require generics or unsafe), 4. concurrency (I cannot remember the last time I needed a basic-non-concurrent-unbounded-queue).

With regards to which direction to take this. I would recommend making a proper package out of it. Get feedback on the implementation from golang-nuts forum and performance channel in Slack, you won't get a lot of feedback here. Also recommend making a blog post about the analysis, because it contains good examples on how to analyse a implementation and you will get more comments on how it can be improved.

This means that:

  1. there's a public queue package that people can start using today,
  2. you can get real world feedback,
  3. when there eventually is a decision whether to include this, something else or no data-structure package at all, then this implementation already has had real-world usage and
  4. the implementation doesn't get stuck in this issue.

Either way, I'm withdrawing from the discussion as it needs to wait for decisions whether a unbounded-non-concurrent-queue makes sense to be included in the Go repositories at all.

There's so much more to this queue than just the code. I see a lot of challenges to the code, but none about the queue foundation. The theory and design should be the ones mostly scrutinized, way more than the implementation. But I still didn't see a single challenge to the queue foundation.

I don't understand how you came to that conclusion. ISTM that the arguments brought against this proposal are a) unbounded queues are a bad idea in general, b) that adding another generic container while generics in Go are actively discussed is not a good idea and c) that there is little benefit to a generic queue, if you can get sufficient (or better) performance using a naive implementation. These seem pretty foundational to me - in particular, I don't see a single comment that is related to the actual code. You might not agree with these arguments, but they are there.

[edit] oh, there's of course also d) that a generic queue is just as good outside the stdlib as inside it. Which also has nothing to do with the code.

cznic commented

On the performance of queue push

package main

import (
	"fmt"
	"runtime"
	"runtime/debug"
	"testing"

	"github.com/christianrpetrin/queue-tests/queueimpl7"
)

var (
	q7 *queueimpl7.Queueimpl7
	q  []int
)

func testQ(n int) {
	q = nil
	for i := 0; i < n; i++ {
		q = append(q, i) // Push
	}
}

func testQ7(n int) {
	q7 = queueimpl7.New()
	for i := 0; i < n; i++ {
		q7.Push(i)
	}
}

func test(f func(int), tag string) {
	var stats0, stats runtime.MemStats
	q = nil
	q7 = nil
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats0)
	result := testing.Benchmark(func(b *testing.B) { f(b.N) })
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats)
	fmt.Printf("%10s %v memory used %10v\n", tag, result, stats.Alloc-stats0.Alloc)
}

func main() {
	test(testQ, "[]int")
	test(testQ7, "Queueimpl7")
}

Playground (Does not run on playground)

16GB machine, i5-4670 CPU @ 3.40GHz × 4:

==== jnml@4670:~/src/github.com/cznic/issue27935> go version
go version go1.11.1 linux/amd64
==== jnml@4670:~/src/github.com/cznic/issue27935> ./issue27935 
     []int 100000000	        12.4 ns/op memory used  987122112
Queueimpl7 50000000	        31.3 ns/op memory used 1212501008
==== jnml@4670:~/src/github.com/cznic/issue27935> 

Wrt the 'memory used' value: The Queueimpl7 has in the end only half as much items as the []int slice.

edit: Added memory note

On the performance of queue push

package main

import (
	"fmt"
	"runtime"
	"runtime/debug"
	"testing"

	"github.com/christianrpetrin/queue-tests/queueimpl7"
)

var (
	q7 *queueimpl7.Queueimpl7
	q  []int
)

func testQ(n int) {
	q = nil
	for i := 0; i < n; i++ {
		q = append(q, i) // Push
	}
}

func testQ7(n int) {
	q7 = queueimpl7.New()
	for i := 0; i < n; i++ {
		q7.Push(i)
	}
}

func test(f func(int), tag string) {
	var stats0, stats runtime.MemStats
	q = nil
	q7 = nil
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats0)
	result := testing.Benchmark(func(b *testing.B) { f(b.N) })
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats)
	fmt.Printf("%10s %v memory used %10v\n", tag, result, stats.Alloc-stats0.Alloc)
}

func main() {
	test(testQ, "[]int")
	test(testQ7, "Queueimpl7")
}

Playground (Does not run on playground)

16GB machine, i5-4670 CPU @ 3.40GHz × 4:

==== jnml@4670:~/src/github.com/cznic/issue27935> go version
go version go1.11.1 linux/amd64
==== jnml@4670:~/src/github.com/cznic/issue27935> ./issue27935 
     []int 100000000	        12.4 ns/op memory used  987122112
Queueimpl7 50000000	        31.3 ns/op memory used 1212501008
==== jnml@4670:~/src/github.com/cznic/issue27935> 

Wrt the 'memory used' value: The Queueimpl7 has in the end only half as much items as the []int slice.

edit: Added memory note

Nice test @cznic!

  1. You are falling in this trap. This is not an apples to apples comparison
  2. Unfortunately a queue where the data only gets pushed and none gets removed should not have many real world applications. Full lifecycle tests should be more representative of real world use cases

@cznic @Merovius @egonelbre @bcmills You guys are tough! I love tough guys! Your challenges and contributions so far has been amazing! :-)

In seems to me that in the same way you can't convince me the queue design (see "Design Considerations") is favorable for any pure slice based queue implementation, I won't be able to convince you impl7 deserves to be part of the stdlib. In any case, your opinion is highly respected and appreciated!

Now out of all the challenges to the proposal and the queue, below are what I believe the most relevant points made against the queue proposal and the queue itself.

  1. There's some clear edge cases (here, here) where the queue actually performs worse than pure slice based implementations for small and mid-sized data sets

  2. The queue could perform as well outside of the stdlib as an independent package as well inside it (here, here)

  3. It's easy to build a performant queue using a simple slice. There's no need for a specialized implementation (here)

  4. There's so many different queue designs and implementations, so we should not include a queue in the stdlib (here)

  5. Queue should not be included in the stdlib until generics gets sorted out (here, here, here, here, here)

  6. The queue needs real world data in order to help validate if, and how much would it improve the performance of real world applications (here and here)

  7. Extending, growing the stdlib means more maintenance work (of my own)

  8. There no real use case for an unbounded, not safe for concurrent use queue (here)

Below are my thoughts on each point.

  1. Even with edge cases like this, impl7 still consumes ~1/3 of the memory used by the pure slice based implementation. For more real world like cases, impl7 is much faster or very competitive in virtually all scenarios

  2. Completely agree, but please consider these counter arguments

  3. Refer to the same comment here as well how I was able to help @egonelbre fix his queue implementation here

  4. The proposal includes benchmarking the proposed solution against 6 other experimental
    queue designs and implementations, as well the standard list package and a few external
    queue implementations such as phf, gammazero, juju. Other suggested queue implementation was tested as well such as @bcmills proposed queue implementation
    as well cookiejar. The benchmark tests are still very favorable for impl7 when comparing to all tested queues.

  5. Completely agree with it, but please refer to this and this regarding my concerns

  6. I don't think this is a practical suggestion. Please refer here for details

  7. Consider that I plan to keep not only using, but helping to contribute to Go. I plan to propose many other suggestions to Go and very much would like to get involved into helping to maintain Go as well by helping with small bug fixes, validating other proposals, etc. By helping me helping you, you'll find yet another member that is willing to help you not only extend, but also to maintain Go. So in a way, you are not getting more work. It's less because I'll be also helping. So please consider the healthy of the community as well. If you guys are so conservative for accepting new contributions, current and future contributors will probably think twice before proposing any contributions.

  8. CloudLogger has a very clear case for that. Also quoting Wikipedia: "Some common ADTs, which have proved useful in a great variety of applications, are". Queue features prominently on the list. Now even if we can't see ourselves many real world use cases for a queue, please consider when Thomas Edison invented the light bulb; or when Henry Ford invented the car; initially a lot of people thought: we don't need light bulbs or cars; we're completely fine with our candles and horses!. To me the new design represents a significant step forward in queue designs, one that will last for the foreseeable future because the design does not make any assumption around how much data the struct will hold. The design dictates the implementation will perform well no matter what,
    differently from pure slice based implementations that will only perform well with small to mid-sized data sets (refer to all pure slice based implementations such as "BenchmarkImpl1" and "BenchmarkImpl2"). "Build useful things and people will find all sorts of uses for it.".

To make it clear, I truly think slice is a fantastic data structure. It's extremely flexible and was so brilliantly well implemented in Go that I had a really hard time coming up with a design and implementation that would be able to do better for a queue implementation, even though slices, as-is,
are not really the best data structure for a queue. So don't fail to see impl7 IS a slice based implementation. What all the benchmark tests are testing, for the most part, how the Go slice implementation works for small (128) sized slices. There's really no special, brand new directive or data structure in impl7.

Thanks very much @egonelbre to help validate the design of the queue. Your implementation of the proposed design looks very good and I'm very glad I was able to help you get it right. This kind of contribution, where the actual theory and design is validated through independent implementations is most welcome! Also thanks very much for the insight into using head vs len. I look forward to working with you in mine (and yours!) next proposals!

Also a big thanks to @Merovius. Fantastic edge case you pointed out. I think this is a great case study for the design of generics. Also a big thanks to @cznic to have actively engaged in the discussion.

In my experience, often the biggest critics are also the biggest contributors. You guys did a fantastic job validating the queue. Thanks very much for your contribution to the queue and to the discussion.

As for the people who expressed support for the queue and the proposal, I can't be more appreciative of your time and support. Your support is highly appreciated!

Also a big thanks for the thumps up folks! :-)

I really can't afford to spend much more time on this discussion. I'll certainly keep an eye on the issue, but very much like to get CloudLogger out and due to this proposal, I'm already 2 weeks (and counting) late now. I also have a few other ideas that I think has enormous potential. I'll keep investigating them on my spare time and once I'm able to reach a final design and am able to prove the design works, I'll submit proposals. Stay tuned!

In any case, I look forward to hear what's the final decision on the matter. If the proposal is accepted, of coarse I'll invest whatever time is needed to get it fully tested and into stdlib.

I'll keep an eye on the issue, so please keep commenting and expressing your support/concerns. I just don't plan to keep looking at the issue every 5 minutes moving forward. :-)

  1. I didn't say there aren't use cases, just that I haven't seen one lately.
  2. My code didn't have a bug, it was already clearing things properly, I merely pointed out the line where it was being done. Your conclusion about memory leaking was wrong.

CloudLogger has a very clear case for that.

Just to pick this one out (it applies to all the other things too): We simply disagree on this. I think your use case is a traditional example of why unbounded queues are not a good idea. They seem to offer convenience and simplicity on the surface, but in practice lead to cascading failures. Fundamentally, an unbounded queue is only useful, if the consumer is slower than the producer - at least at times. But given that an unbounded queue - in particular an in-memory one - is never actually unbounded, if you don't actually handle that situation (e.g. by backpressure or dropping messages), the only thing they get you is that you invariably end in a situation where a failure of the consumer leads to a failure of the queue service, which leads to a failure of the producer. I don't think this math can really be changed - either queues can be bounded, or they contribute to cascading failure scenarios.

I didn't actually argue about this so far, because a) I don't think there can be an argument, because ultimately neither the assertion that unbounded queues are useful nor the assertion that they aren't can be verified or falsified and b) because I don't think it really matters - even assuming they are useful, it's IMO still not a good idea to have a generic version in the stdlib (for all the reasons mentioned).

I'm just trying to clarify that I understand that you feel you have good arguments on your side. Which is fair. But that doesn't mean we can't still disagree. Just as you feel you have good counters to all my arguments, I feel to have good counters to all your arguments :)

I really can't afford to spend much more time on this discussion. I'll certainly keep an eye on the issue, but very much like to get CloudLogger out and due to this proposal, I'm already 2 weeks (and counting) late now. I also have a few other ideas that I think has enournous potential. I'll keep investigating them on my spare time and once I'm able to reach a final design and am able to prove the design works, I'll submit proposals. Stay tuned!

Looking forward to it. FWIW, I believe you put a bit more work into this proposal than necessary. It's admirable that you took the effort to explain your arguments in detail and benchmark many different implementations against each other. But I, at least, would have found it just as useful to just present your proposed implementation as an example of a generic implementation that has different asymptotic behavior than a naive slice queue. The actual, detailed implementation questions can be sorted out after general agreement is reached that we want this package in the stdlib.

Not saying that to discourage you, but to maybe help you tuning this tradeoff in your future proposals. Making the proposal simpler in the beginning and fleshing them out after the general ideas are accepted can only safe you time :)

cznic commented
// Copyright 2018 The Issue27935 Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import (
	"fmt"
	"runtime"
	"runtime/debug"
	"testing"

	"github.com/christianrpetrin/queue-tests/queueimpl7"
)

var (
	q7   *queueimpl7.Queueimpl7
	q    []int
	tmpi interface{}
	tmp  int
	tmp2 bool
)

func testQ(b *testing.B, items *int, stats *runtime.MemStats) {
	q = nil
	for n := 0; n < b.N; n++ {
		q = append(q, n) // Push
		if n%3 == 0 {
			tmp = q[0]
			q = q[1:]
		}
	}
	b.StopTimer()
	*items = len(q)
	debug.FreeOSMemory()
	runtime.ReadMemStats(stats)
	b.StartTimer()
	for len(q) != 0 {
		tmp = q[0]
		q = q[1:]
	}
}

func testQ7(b *testing.B, items *int, stats *runtime.MemStats) {
	q7 = queueimpl7.New()
	for n := 0; n < b.N; n++ {
		q7.Push(n)
		if n%3 == 0 {
			tmpi, tmp2 = q7.Pop()
		}
	}
	b.StopTimer()
	*items = q7.Len()
	debug.FreeOSMemory()
	runtime.ReadMemStats(stats)
	b.StartTimer()
	for q7.Len() != 0 {
		tmpi, tmp2 = q7.Pop()
	}
}

func test(f func(*testing.B, *int, *runtime.MemStats), tag string) {
	var stats0, stats runtime.MemStats
	var items int
	q = nil
	q7 = nil
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats0)
	result := testing.Benchmark(func(b *testing.B) { f(b, &items, &stats) })
	memUsed := stats.Alloc - stats0.Alloc
	fmt.Printf("%10s: %9v ops, %4.1f ns/op, memory used %9v, items %7v, %5.2f B/item\n",
		tag, result.N, float64(result.T)/float64(result.N), memUsed, items, float64(memUsed)/float64(items),
	)
}

func main() {
	test(testQ, "[]int")
	test(testQ7, "Queueimpl7")
}
==== jnml@4670:~/src/github.com/cznic/issue27935> ./issue27935 
     []int: 100000000 ops, 15.8 ns/op, memory used 595890688, items 66666666,  8.94 B/item
Queueimpl7:  50000000 ops, 38.0 ns/op, memory used 808335248, items 33333333, 24.25 B/item
==== jnml@4670:~/src/github.com/cznic/issue27935> 

This is the edge case identified and talked about (here and here). We hope generics will fix this. For a more real world like scenario, please refer to the performance of impl1 as impl1 uses "interface{}" as its data type instead of int (much more flexible).

For a more real world like scenario

I disagree with the characterization of interface{} as "more real world like". I don't think there is anything more realistic about pointer-shaped values over non-pointer-shaped ones - on the contrary, actually. I can't think of a good reason why things in a queue should be pointers. ISTM they will, in general, represent plain data.

much more flexible

The point is specifically (and it would be helpful if you'd acknowledge that instead of dismissing it as an "edge case") that this flexibility incurs a cost, in the form of more allocations and indirections. The generic queue implementation is slower. And FTR, I predict that generics won't change that either (though we'll have to see that once they are implemented).

rsc commented

At the least, this proposal should be marked as "on hold for generics". But really I think this should be done outside the standard library and prove its necessity first. We have not been creating new packages from scratch in the standard library for quite some time.

@rsc I agree we should wait for the generics discussion to take place. In the meantime, I'll try to validate the proposed solution as much as possible by pinging golang-nuts, making a proper external package, etc. The biggest problem about exposing the queue as an external package only is how to bring awareness about it. Marketing is a gigantic problem that I can't solve by myself. It will probably take years before the package really takes off and people start using it and report their experiences.

In any case, I'll keep an eye and try to join the generics discussion as well.

Thanks for taking a look at the proposal!

Given the many suggestions, I have built a new queue and deployed it as a proper package. The new queue is based on impl7 and is actually a deque now. Appreciate the suggestions (@phf, @bcmills) as a deque is indeed a much more flexible data structure that can be used not only as a FIFO queue, but also as LIFO stack and as a deque with mixed push/pop.

The new deque is a considerably more advanced data structure than impl7 due to the fact is now a deque and also because it has a new design.

The new design leverages the linked slices design by making it an auto shrinking ring. The new design improved impl7 performance and efficiency quite dramatically in most tests, although it is now a bit slower on the fill tests. Refer here for the design details.

ns/op

I have also implemented many new new unit, integration, API and especially, benchmark tests. @egonelbre, much appreciated for the test suggestions and to benchmark cookiejar's queue.

Given the need for real world usage data, I need to somehow get people to start using it. I'll do my best to raise awareness about the deque, but this is an area I really need help. If you are using a queue/stack/deque, please consider switching to the new deque. Also if you know someone who is using a queue/stack/deque, please let them know about the deque.

Below is a few select tests of deque vs impl7.

deque vs impl7 - FIFO queue - fill tests

benchstat testdata/BenchmarkFillDequeQueue.txt testdata/BenchmarkFillImpl7Queue.txt
name        old time/op    new time/op    delta
/0-4          39.9ns ± 9%    35.9ns ± 3%   -9.88%  (p=0.000 n=10+9)
/1-4           142ns ± 1%     133ns ± 1%   -6.61%  (p=0.000 n=10+10)
/10-4          636ns ± 1%     764ns ± 7%  +20.26%  (p=0.000 n=9+9)
/100-4        4.74µs ± 3%    4.28µs ± 3%   -9.67%  (p=0.000 n=9+9)
/1000-4       43.0µs ±23%    38.8µs ± 7%   -9.84%  (p=0.004 n=10+10)
/10000-4       450µs ±19%     388µs ± 5%  -13.70%  (p=0.001 n=10+10)
/100000-4     4.24ms ± 4%    3.95ms ± 2%   -6.84%  (p=0.000 n=10+8)
/1000000-4    46.8ms ± 1%    45.9ms ± 4%   -1.87%  (p=0.008 n=8+9)

name        old alloc/op   new alloc/op   delta
/0-4           64.0B ± 0%     48.0B ± 0%  -25.00%  (p=0.000 n=10+10)
/1-4            144B ± 0%      112B ± 0%  -22.22%  (p=0.000 n=10+10)
/10-4           608B ± 0%      736B ± 0%  +21.05%  (p=0.000 n=10+10)
/100-4        6.19kB ± 0%    4.26kB ± 0%  -31.27%  (p=0.000 n=10+10)
/1000-4       33.0kB ± 0%    33.2kB ± 0%   +0.58%  (p=0.000 n=10+10)
/10000-4       322kB ± 0%     323kB ± 0%   +0.23%  (p=0.000 n=10+10)
/100000-4     3.22MB ± 0%    3.23MB ± 0%   +0.20%  (p=0.000 n=10+10)
/1000000-4    32.2MB ± 0%    32.3MB ± 0%   +0.19%  (p=0.000 n=10+9)

name        old allocs/op  new allocs/op  delta
/0-4            1.00 ± 0%      1.00 ± 0%     ~     (all equal)
/1-4            4.00 ± 0%      4.00 ± 0%     ~     (all equal)
/10-4           15.0 ± 0%      17.0 ± 0%  +13.33%  (p=0.000 n=10+10)
/100-4           107 ± 0%       109 ± 0%   +1.87%  (p=0.000 n=10+10)
/1000-4        1.01k ± 0%     1.02k ± 0%   +0.99%  (p=0.000 n=10+10)
/10000-4       10.1k ± 0%     10.2k ± 0%   +0.79%  (p=0.000 n=10+10)
/100000-4       101k ± 0%      102k ± 0%   +0.78%  (p=0.000 n=10+10)
/1000000-4     1.01M ± 0%     1.02M ± 0%   +0.78%  (p=0.000 n=10+10)

deque vs impl7 - FIFO queue - refill tests

benchstat testdata/BenchmarkRefillDequeQueue.txt testdata/BenchmarkRefillImpl7Queue.txt
name       old time/op    new time/op    delta
/1-4         3.79µs ± 1%   10.03µs ± 6%  +165.06%  (p=0.000 n=9+10)
/10-4        37.8µs ± 7%    74.7µs ± 5%   +97.58%  (p=0.000 n=10+10)
/100-4        361µs ± 7%     442µs ± 4%   +22.25%  (p=0.000 n=9+10)
/1000-4      3.75ms ± 4%    3.97ms ± 3%    +6.10%  (p=0.000 n=10+9)
/10000-4     36.5ms ± 3%    39.1ms ± 6%    +7.17%  (p=0.000 n=10+10)
/100000-4     380ms ± 1%     421ms ±24%   +10.67%  (p=0.000 n=10+9)

name       old alloc/op   new alloc/op   delta
/1-4         1.60kB ± 0%    6.40kB ± 0%  +300.00%  (p=0.000 n=10+10)
/10-4        16.0kB ± 0%    68.8kB ± 0%  +330.00%  (p=0.000 n=10+10)
/100-4        160kB ± 0%     421kB ± 0%  +163.00%  (p=0.000 n=8+10)
/1000-4      2.42MB ± 0%    3.32MB ± 0%   +37.30%  (p=0.000 n=10+10)
/10000-4     17.0MB ± 0%    32.3MB ± 0%   +89.36%  (p=0.000 n=8+10)
/100000-4     162MB ± 0%     323MB ± 0%   +99.52%  (p=0.000 n=8+8)

name       old allocs/op  new allocs/op  delta
/1-4            100 ± 0%       300 ± 0%  +200.00%  (p=0.000 n=10+10)
/10-4         1.00k ± 0%     1.60k ± 0%   +60.00%  (p=0.000 n=10+10)
/100-4        10.0k ± 0%     10.8k ± 0%    +8.00%  (p=0.000 n=10+10)
/1000-4        100k ± 0%      102k ± 0%    +1.80%  (p=0.000 n=10+10)
/10000-4      1.00M ± 0%     1.02M ± 0%    +1.57%  (p=0.000 n=8+10)
/100000-4     10.0M ± 0%     10.2M ± 0%    +1.56%  (p=0.000 n=10+10)

Although impl7 marginally outperforms deque on the fill test, deque outperforms impl7
on all other tests by large margins in most cases, especially when it comes to memory usage.
The refill and (below) stable are just two examples.

deque vs impl7 - FIFO queue - stable tests

benchstat testdata/BenchmarkStableDequeQueue.txt testdata/BenchmarkStableImpl7Queue.txt
name        old time/op    new time/op    delta
/1-4          34.8ns ± 1%    38.1ns ± 5%    +9.48%  (p=0.000 n=10+9)
/10-4          353ns ± 2%     391ns ± 7%   +10.92%  (p=0.000 n=10+10)
/100-4        3.45µs ± 1%    4.13µs ±19%   +19.82%  (p=0.000 n=10+10)
/1000-4       34.5µs ± 1%    39.6µs ±12%   +14.95%  (p=0.000 n=10+10)
/10000-4       346µs ± 2%     385µs ± 4%   +11.40%  (p=0.000 n=10+10)
/100000-4     3.43ms ± 1%    3.85ms ±11%   +12.11%  (p=0.000 n=10+10)
/1000000-4    34.4ms ± 1%    38.3ms ± 6%   +11.13%  (p=0.000 n=10+10)

name        old alloc/op   new alloc/op   delta
/1-4           16.0B ± 0%     32.0B ± 0%  +100.00%  (p=0.000 n=10+10)
/10-4           160B ± 0%      322B ± 0%  +101.25%  (p=0.000 n=10+10)
/100-4        1.60kB ± 0%    3.23kB ± 0%  +101.56%  (p=0.000 n=10+10)
/1000-4       16.0kB ± 0%    32.2kB ± 0%  +101.56%  (p=0.000 n=10+10)
/10000-4       160kB ± 0%     322kB ± 0%  +101.56%  (p=0.000 n=10+10)
/100000-4     1.60MB ± 0%    3.23MB ± 0%  +101.56%  (p=0.000 n=10+10)
/1000000-4    16.0MB ± 0%    32.3MB ± 0%  +101.56%  (p=0.000 n=10+10)

name        old allocs/op  new allocs/op  delta
/1-4            1.00 ± 0%      1.00 ± 0%      ~     (all equal)
/10-4           10.0 ± 0%      10.0 ± 0%      ~     (all equal)
/100-4           100 ± 0%       101 ± 0%    +1.00%  (p=0.000 n=10+10)
/1000-4        1.00k ± 0%     1.01k ± 0%    +1.50%  (p=0.000 n=10+10)
/10000-4       10.0k ± 0%     10.2k ± 0%    +1.56%  (p=0.000 n=10+10)
/100000-4       100k ± 0%      102k ± 0%    +1.56%  (p=0.000 n=10+10)
/1000000-4     1.00M ± 0%     1.02M ± 0%    +1.56%  (p=0.000 n=10+10)

Refer here for all benchmark tests.

The new package is production ready and I have released its first version. It can be found here.

Regarding the suggestions for more tests (here, here) and to gather real world data, @bcmills, @egonelbre, does the new benchmark tests help address your concerns?

Regarding the questions whether a FIFO queue is actually useful (here,here), @egonelbre, @rsc, does a deque help address your concerns? Please consider a deque not only can be used as a FIFO queue, but also as a LIFO stack as well as with mixed push/pop.

Regarding the concerns whether Go needs a specialized queue or just using a simple slice as a queue is good enough (here), @Merovius, does the new queue performance vs slice helps address your concerns? In the wake of all the new benchmark tests and the much improved deque design, do you still feel like a deque naive implementation that uses a slice such as CustomSliceQueue is the best option?

tv42 commented

Marketing is a gigantic problem that I can't solve by myself.

I'm an outsider to this conversation, but please take this point: the Go stdlib (or golang.org/x) should not be a marketing mechanism for your code; especially not for its clever implementation. That line of thinking can only decrease the stdlib's technical quality.

Marketing is a gigantic problem that I can't solve by myself.

I'm an outsider to this conversation, but please take this point: the Go stdlib (or golang.org/x) should not be a marketing mechanism for your code; especially not for its clever implementation. That line of thinking can only decrease the stdlib's technical quality.

@tv42 , thanks for joining the discussion!

I realized (and mentioned) the marketing problem only after it was proposed to build an external queue package and get real, production systems to use it to validate its performance. My point was I need help to raise awareness about the Deque package, so people would consider using it. It was never my intention in the first place to even publish an external package. The goal was to propose a very efficient queue implementation to be added to the standard library. For that, there should be no need to market the dequeue in any shape or form other than propose it here.

Wanted to throw in a use case here as I've been searching for a queue-like solution and none of the existing implementations I can find address it (which has me suspicious I'm thinking about the problem the wrong way, or focused on the wrong solution, but ignoring that for now).
The difficulty I'm encountering is the need for a queue where I can peek at the entire queue.

Might make sense in an example: Lets say the app is sending messages to a remote system asynchronously. So it sends a number of messages individually (batching handled transparently), and gets an acknowledgement some time in the future of some, or all, of the messages. The app needs to handle things like losing its connection, so it has to hang on to the messages until it receives that acknowledgement. If it does lose the connection, once reestablished, it has to resend them.

At first this seems like a pretty straightforward unbounded circular queue. Something that is usually small, but can grow when needed, and can handle frequent push/pop without reallocation. However that difficulty I mentioned is when the connection is reestablished. The app needs to resend the messages, so it has to iterate the queue. But it can't remove messages from the queue until they've been acknowledged. All the queue implementations I've seen (including the one proposed) only allow you to peek at the first item. Meaning that we'd have to send one message and wait for that ack before before proceeding to the next. This can be extremely slow, and without batching, it may never catch back up to real-time.

@phemmer I think you could also remove an element and add it again (to the other end). But also, wouldn't something like var unackedMsgs map[msgID]message make far more sense? After all, that allows you to drop the requirement that messages are ack'ed in the same order they are sent. Lastly, if you actually want an ordered queue, then why not use a slice? It allows you to iterate over it and it can grow dynamically.

Removing and adding to the end destroys ordering. The connection could die half way through the re-sending, and the message source is still adding its own messages to the queue as well.
Messages are always acked by the remote end in the order that they were sent. It'll never ack message 10 without message message 9 having been acked.

I've considered using a map like you suggest, but it would mean that when re-sending, I'd have to grab the keys and sort them before re-sending. And as the message source is still adding messages, this may have to be done several times, skipping the ones that are now in flight. Where as with a queue I could just iterate until I reach the end. But it's not a terrible idea.

As for why not using a slice, because of the high amount of reallocation, or manual work to prevent re-allocation. As messages are added, and shifted off the front, the slice would either have to be re-allocated, or the contents moved over.

But if we don't think the ability to peek at the whole queue is something the proposed solution should address, then this can be dropped. Don't want to hijack this issue for solving my problem. Just wanted to throw the use case out for consideration.

@christianrpetrin If we are going to add container/deque, then it should probably support generics from the very beginning. Perhaps we should also wait for more clarity on how #48287 will affect API design?

P.S. I think your work is great and I’m not asking you to rewrite it. I’m only pointing out that it would be a real shame to implement a pre-generics API for a stdlib container.

my implementation https://github.com/alphadose/ZenQ

also this -> #52652

@alphadose this is about implementing an unbounded queue.

Generics are out. Should we remove the "Hold"?

This issue discusses too many implementation details. If I understand the proposal correctly, with the current generics design, we are likely to propose a set of APIs as follows?

package queue

// Queue implements a FIFO queue.
// The zero value of Queue is an empty queue ready to use.
type Queue[T any] struct

func (q *Queue[T]) Push(v T)
func (q *Queue[T]) Pop() T
func (q *Queue[T]) Len() int
func (q *Queue[T]) Front() T
func (q *Queue[T]) Tail() T
func (q *Queue[T]) Clear()

+1 to the above
And also we can create a golang STL much like C++ STL

PS:- it would be even better if we can make the STL thread-safe by default which is not the case in C++

Per #27935 (comment) this should be done outside the standard library first. Has that happened?

It should have happened since day 1 because the proposal also included an implementation that lives here: https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go, although it is not a generic version.

A good candidate https://github.com/liyue201/gostl
covers most DS/algos but not thread safe and doesn't use generics
With a bit of modification, we can have a thread safe generic golang STL

I added a generic implementation of Queueimpl7 (fork), here is a PR containing the changes I did, the benchmark results suggest that the generic version is even faster.

Benchmark results
go test -bench=. benchmark_test.go
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkList/0-16          	36096993	        33.10 ns/op
BenchmarkList/1-16          	15740581	        72.43 ns/op
BenchmarkList/10-16         	 2861898	       425.1 ns/op
BenchmarkList/100-16        	  263336	      4220 ns/op
BenchmarkList/1000-16       	   25462	     47118 ns/op
BenchmarkList/10000-16      	    2110	    534454 ns/op
BenchmarkList/100000-16     	     127	   9137029 ns/op
BenchmarkChannel/0-16       	35017080	        31.85 ns/op
BenchmarkChannel/1-16       	15119679	        76.04 ns/op
BenchmarkChannel/10-16      	 2512936	       470.3 ns/op
BenchmarkChannel/100-16     	  249133	      4416 ns/op
BenchmarkChannel/1000-16    	   23700	     49719 ns/op
BenchmarkChannel/10000-16   	    2091	    527633 ns/op
BenchmarkChannel/100000-16  	     219	   5352969 ns/op
BenchmarkGammazero/0-16     	1000000000	         1.158 ns/op
BenchmarkGammazero/1-16     	23397309	        44.38 ns/op
BenchmarkGammazero/10-16    	 9304092	       127.2 ns/op
BenchmarkGammazero/100-16   	  668485	      1562 ns/op
BenchmarkGammazero/1000-16  	   57878	     20585 ns/op
BenchmarkGammazero/10000-16 	    4836	    221882 ns/op
BenchmarkGammazero/100000-16         	     477	   2449230 ns/op
BenchmarkPhf/0-16                    	49410674	        23.23 ns/op
BenchmarkPhf/1-16                    	35873148	        31.50 ns/op
BenchmarkPhf/10-16                   	 2566104	       458.0 ns/op
BenchmarkPhf/100-16                  	  405482	      2672 ns/op
BenchmarkPhf/1000-16                 	   45190	     26028 ns/op
BenchmarkPhf/10000-16                	    4074	    266890 ns/op
BenchmarkPhf/100000-16               	     237	   5075018 ns/op
BenchmarkJuju/0-16                   	 3639322	       324.1 ns/op
BenchmarkJuju/1-16                   	 3553687	       337.7 ns/op
BenchmarkJuju/10-16                  	 2811319	       427.6 ns/op
BenchmarkJuju/100-16                 	  624660	      1837 ns/op
BenchmarkJuju/1000-16                	   56377	     21430 ns/op
BenchmarkJuju/10000-16               	    4329	    241211 ns/op
BenchmarkJuju/100000-16              	     369	   3212147 ns/op
BenchmarkImpl1/0-16                  	286230117	         4.186 ns/op
BenchmarkImpl1/1-16                  	28610796	        36.27 ns/op
BenchmarkImpl1/10-16                 	 3347899	       362.3 ns/op
BenchmarkImpl1/100-16                	  680595	      1635 ns/op
BenchmarkImpl1/1000-16               	   52677	     21847 ns/op
BenchmarkImpl1/10000-16              	    4353	    246534 ns/op
BenchmarkImpl1/100000-16             	     136	   8901747 ns/op
BenchmarkImpl2/0-16                  	278119263	         4.377 ns/op
BenchmarkImpl2/1-16                  	28694433	        36.99 ns/op
BenchmarkImpl2/10-16                 	 3432369	       355.2 ns/op
BenchmarkImpl2/100-16                	  650058	      1645 ns/op
BenchmarkImpl2/1000-16               	   54530	     21765 ns/op
BenchmarkImpl2/10000-16              	    3944	    286462 ns/op
BenchmarkImpl2/100000-16             	     136	   8631269 ns/op
BenchmarkImpl3/0-16                  	 2795673	       422.7 ns/op
BenchmarkImpl3/1-16                  	 2726344	       431.1 ns/op
BenchmarkImpl3/10-16                 	 2419592	       488.0 ns/op
BenchmarkImpl3/100-16                	  999382	      1041 ns/op
BenchmarkImpl3/1000-16               	   70945	     16683 ns/op
BenchmarkImpl3/10000-16              	    6306	    192223 ns/op
BenchmarkImpl3/100000-16             	     532	   2215646 ns/op
BenchmarkImpl4/0-16                  	 3473112	       342.2 ns/op
BenchmarkImpl4/1-16                  	 3378958	       353.7 ns/op
BenchmarkImpl4/10-16                 	 2882436	       416.4 ns/op
BenchmarkImpl4/100-16                	 1219239	       980.0 ns/op
BenchmarkImpl4/1000-16               	   74776	     16138 ns/op
BenchmarkImpl4/10000-16              	    5860	    193688 ns/op
BenchmarkImpl4/100000-16             	     452	   2592877 ns/op
BenchmarkImpl5/0-16                  	 3095726	       383.5 ns/op
BenchmarkImpl5/1-16                  	 2985129	       405.7 ns/op
BenchmarkImpl5/10-16                 	 2607858	       460.0 ns/op
BenchmarkImpl5/100-16                	  999908	      1015 ns/op
BenchmarkImpl5/1000-16               	   69468	     16653 ns/op
BenchmarkImpl5/10000-16              	    5889	    197066 ns/op
BenchmarkImpl5/100000-16             	     415	   2834379 ns/op
BenchmarkImpl6/0-16                  	717184951	         1.692 ns/op
BenchmarkImpl6/1-16                  	17817142	        63.48 ns/op
BenchmarkImpl6/10-16                 	 3145177	       386.9 ns/op
BenchmarkImpl6/100-16                	  650286	      1635 ns/op
BenchmarkImpl6/1000-16               	   62350	     19002 ns/op
BenchmarkImpl6/10000-16              	    5012	    213779 ns/op
BenchmarkImpl6/100000-16             	     499	   2388561 ns/op
BenchmarkImpl7/0-16                  	1000000000	         1.005 ns/op
BenchmarkImpl7/1-16                  	19193989	        59.29 ns/op
BenchmarkImpl7/10-16                 	 3058149	       400.5 ns/op
BenchmarkImpl7/100-16                	  724555	      1444 ns/op
BenchmarkImpl7/1000-16               	   68046	     17667 ns/op
BenchmarkImpl7/10000-16              	    5490	    199761 ns/op
BenchmarkImpl7/100000-16             	     526	   2275840 ns/op
BenchmarkImpl7Generic/0-16           	1000000000	         1.015 ns/op
BenchmarkImpl7Generic/1-16           	21470794	        49.56 ns/op
BenchmarkImpl7Generic/10-16          	 4868738	       242.6 ns/op
BenchmarkImpl7Generic/100-16         	  929288	      1138 ns/op
BenchmarkImpl7Generic/1000-16        	   72534	     16419 ns/op
BenchmarkImpl7Generic/10000-16       	    6132	    188530 ns/op
BenchmarkImpl7Generic/100000-16      	     644	   1865776 ns/op
BenchmarkCookiejar/0-16              	  135667	      8730 ns/op
BenchmarkCookiejar/1-16              	  136633	      8764 ns/op
BenchmarkCookiejar/10-16             	  135256	      8783 ns/op
BenchmarkCookiejar/100-16            	  125581	      9418 ns/op
BenchmarkCookiejar/1000-16           	   49455	     24119 ns/op
BenchmarkCookiejar/10000-16          	    5970	    193347 ns/op
BenchmarkCookiejar/100000-16         	     519	   2263696 ns/op
BenchmarkBcmills/0-16                	527711010	         2.181 ns/op
BenchmarkBcmills/1-16                	30507412	        37.13 ns/op
BenchmarkBcmills/10-16               	 1707492	       706.4 ns/op
BenchmarkBcmills/100-16              	  475962	      2306 ns/op
BenchmarkBcmills/1000-16             	   33268	     35265 ns/op
BenchmarkBcmills/10000-16            	    3345	    331535 ns/op
BenchmarkBcmills/100000-16           	      84	  11991809 ns/op
PASS
ok  	command-line-arguments	146.154s

Completely agree, @sfllaw! Any first Deque version that makes into the STL should support generics, so I added support for generics in Deque v2.0.0. Check it out! https://github.com/ef-ds/deque

I appreciate the kind words!

@changkun , @ianlancetaylor , regarding your comments here and here.

@changkun, yes, completely agree we should remove the hold and move forward with the proposal.

@ianlancetaylor, yes, a double ended queue, that is a major improvement over the initially porposed FIFO queue, was implemented and released here in 2018. Currently, the released Deque package has over 200 open source repositories importing it in over 400 packages, including some large open source projects such as go-micro, ziti, flow, etc! Unfortunately, GitHub doesn't give me info about private repos importing it, but considering the number of git clones, which sits at almost 150 clones every single day, suggests Deque is likely being used by potentially thousands of private projects as well. Check out below some of the Deque's usage metrics.

image
image

I believe the proposal should be approved due to below main points:

  • The many great suggestions the OSS community posted in this proposal (as well as via email exchanges in golang-nuts) has been addressed. A few of the great addressed suggestions are:
    • Added additional tests: here, here (thanks @bcmills, @egonelbre)
    • Make it a double ended queue instead of simple FIFO queue: here (thanks @phf, others)
    • Deque has been released outside of the std library (request) and I believe, given its current usage, it has proven to be useful (thanks @rsc)
  • Deque implementation has been validated and I believe the current code has no bugs/issues as no new issues has been filed since 27/04/2020 (almost 3 years now). Last one was this one
  • Deque's performance is unrivaled, even when tested against the best designed and implemented available open source queues. Not only it performs incredibly well in all test cases, but still manages to do so with an unmatched low memory footprint. Refer here for details
  • Generics is out! Deque fully supports generics now and it performs even better than before. Refer here for details (btw, great job implementing generics, Go team, especially @ianlancetaylor !)

Generics are out. Should we remove the "Hold"?

This issue discusses too many implementation details. If I understand the proposal correctly, with the current generics design, we are likely to propose a set of APIs as follows?

package queue

// Queue implements a FIFO queue.
// The zero value of Queue is an empty queue ready to use.
type Queue[T any] struct

func (q *Queue[T]) Push(v T)
func (q *Queue[T]) Pop() T
func (q *Queue[T]) Len() int
func (q *Queue[T]) Front() T
func (q *Queue[T]) Tail() T
func (q *Queue[T]) Clear()

@changkun, regarding the API, Deque's current API is as follow.

Should the proposal be approved, I propose we go with this design. However, I'm completely open to review and make changes to either the API design and pretty much anywhere in the source code and documentation should the community agree with the changes.

type Deque[T any] struct

func New[T any]() *Deque[T]
func (d *Deque[T]) Init() *Deque[T] // Init initializes or clears deque d.
func (d *Deque[T]) Len() int
func (d *Deque[T]) Front() (T, bool)
func (d *Deque[T]) Back() (T, bool)
func (d *Deque[T]) PushFront(v T)
func (d *Deque[T]) PushBack(v T)
func (d *Deque[T]) PopFront() (T, bool)
func (d *Deque[T]) PopBack() (T, bool)

Note: The second, bool result in Front, Back, PopFront and PopBack indicates whether a valid value was returned; if the deque is empty, false will be returned. This follows the same idea used by map (i.e. "value, ok := myMap["key"]")

@alphadose, thanks for bringing light to the GoSTL's Deque. I agree GoSTL is a good deque. I personally really like its segmented design, however, I added GoSTL's deque to the Deque's benchmark tests and it doesn't perform very well against Deque. As an example, check out below the results of the Microservice test.

benchstat testdata/BenchmarkMicroserviceDequeQueue.txt testdata/BenchmarkMicroserviceGostlQueue.txt
name        old time/op    new time/op    delta
/0-4          45.9ns ± 0%    87.9ns ± 2%   +91.81%  (p=0.000 n=8+9)
/1-4           431ns ± 1%    1243ns ± 1%  +188.38%  (p=0.000 n=9+9)
/10-4         2.70µs ± 1%    7.55µs ± 1%  +179.42%  (p=0.000 n=9+10)
/100-4        23.8µs ± 2%    68.3µs ± 1%  +187.27%  (p=0.000 n=10+10)
/1000-4        235µs ± 4%     685µs ± 1%  +191.26%  (p=0.000 n=10+9)
/10000-4      2.29ms ± 3%    6.86ms ± 2%  +199.21%  (p=0.000 n=10+9)
/100000-4     26.0ms ± 1%    71.3ms ± 1%  +174.80%  (p=0.000 n=10+10)
/1000000-4     275ms ± 2%     724ms ± 2%  +163.62%  (p=0.000 n=10+10)

name        old alloc/op   new alloc/op   delta
/0-4           64.0B ± 0%     88.0B ± 0%   +37.50%  (p=0.000 n=10+10)
/1-4            384B ± 0%     1352B ± 0%  +252.08%  (p=0.000 n=10+10)
/10-4         1.90kB ± 0%    2.60kB ± 0%   +36.55%  (p=0.000 n=10+10)
/100-4        16.2kB ± 0%    14.8kB ± 0%    -8.26%  (p=0.000 n=10+10)
/1000-4        123kB ± 0%     156kB ± 0%   +26.23%  (p=0.000 n=10+10)
/10000-4      1.28MB ± 0%    1.54MB ± 0%   +20.67%  (p=0.000 n=10+10)
/100000-4     12.8MB ± 0%    15.4MB ± 0%   +19.80%  (p=0.000 n=10+10)
/1000000-4     128MB ± 0%     154MB ± 0%   +19.71%  (p=0.000 n=9+10)

name        old allocs/op  new allocs/op  delta
/0-4            1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=10+10)
/1-4            11.0 ± 0%      18.0 ± 0%   +63.64%  (p=0.000 n=10+10)
/10-4           75.0 ± 0%     101.0 ± 0%   +34.67%  (p=0.000 n=10+10)
/100-4           709 ± 0%       911 ± 0%   +28.49%  (p=0.000 n=10+10)
/1000-4        7.01k ± 0%     9.07k ± 0%   +29.32%  (p=0.000 n=10+10)
/10000-4       70.2k ± 0%     90.4k ± 0%   +28.86%  (p=0.000 n=10+10)
/100000-4       702k ± 0%      903k ± 0%   +28.75%  (p=0.000 n=10+10)
/1000000-4     7.02M ± 0%     9.03M ± 0%   +28.73%  (p=0.000 n=10+10)

benchstat testdata/BenchmarkMicroserviceDequeStack.txt testdata/BenchmarkMicroserviceGostlStack.txt
name        old time/op    new time/op    delta
/0-4          45.9ns ± 1%    87.1ns ± 1%   +89.85%  (p=0.000 n=9+8)
/1-4           360ns ± 1%    1307ns ± 3%  +263.31%  (p=0.000 n=9+9)
/10-4         2.53µs ± 0%    8.37µs ± 1%  +231.01%  (p=0.000 n=8+10)
/100-4        23.2µs ± 2%    76.4µs ± 0%  +228.90%  (p=0.000 n=10+8)
/1000-4        229µs ± 1%     770µs ± 0%  +236.41%  (p=0.000 n=9+10)
/10000-4      2.38ms ± 1%    7.72ms ± 1%  +224.42%  (p=0.000 n=9+10)
/100000-4     26.2ms ± 4%    80.2ms ± 1%  +206.81%  (p=0.000 n=10+10)
/1000000-4     274ms ± 2%     809ms ± 1%  +195.78%  (p=0.000 n=10+10)

name        old alloc/op   new alloc/op   delta
/0-4           64.0B ± 0%     88.0B ± 0%   +37.50%  (p=0.000 n=10+10)
/1-4            256B ± 0%     1352B ± 0%  +428.12%  (p=0.000 n=10+10)
/10-4         1.39kB ± 0%    2.60kB ± 0%   +86.78%  (p=0.000 n=10+10)
/100-4        14.1kB ± 0%    14.8kB ± 0%    +5.40%  (p=0.000 n=10+10)
/1000-4        121kB ± 0%     157kB ± 0%   +29.37%  (p=0.000 n=10+10)
/10000-4      1.27MB ± 0%    1.54MB ± 0%   +20.99%  (p=0.000 n=10+10)
/100000-4     12.8MB ± 0%    15.4MB ± 0%   +19.81%  (p=0.000 n=10+10)
/1000000-4     128MB ± 0%     154MB ± 0%   +19.71%  (p=0.000 n=10+9)

name        old allocs/op  new allocs/op  delta
/0-4            1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=10+10)
/1-4            10.0 ± 0%      18.0 ± 0%   +80.00%  (p=0.000 n=10+10)
/10-4           74.0 ± 0%     101.0 ± 0%   +36.49%  (p=0.000 n=10+10)
/100-4           707 ± 0%       911 ± 0%   +28.85%  (p=0.000 n=10+10)
/1000-4        7.01k ± 0%     9.08k ± 0%   +29.46%  (p=0.000 n=10+10)
/10000-4       70.2k ± 0%     90.4k ± 0%   +28.87%  (p=0.000 n=10+10)
/100000-4       702k ± 0%      903k ± 0%   +28.75%  (p=0.000 n=10+10)
/1000000-4     7.02M ± 0%     9.03M ± 0%   +28.73%  (p=0.000 n=10+9)

Similar results are observed across all/most tests. Refer here for details.

I added a generic implementation of Queueimpl7 (fork), here is a PR containing the changes I did, the benchmark results suggest that the generic version is even faster.

Benchmark results

@lovromazgon, much appreciated for the contribution to impl7! However, Deque's (which is like impl7's v2) supports generics now and we're proposing to add Deque to STL (instead of impl7)

Note that the next hurdle here for adding container types to the standard library is some plan for iterating over a generic container. We aren't going to add any generic containers to the standard library until that is settled, so that we aren't locked into to an unusual iteration approach.

Sounds good, @ianlancetaylor . What's the current state of the conversation? Is there a proposal for it already?

@christianrpetrin Yes, related discussions are in #54245 and #56413.