golang/go

net: add mechanism to wait for readability on a TCPConn

bradfitz opened this issue Β· 124 comments

EDIT: this proposal has shifted. See #15735 (comment) below.

Old:

The net/http package needs a way to wait for readability on a TCPConn without actually reading from it. (See #15224)

http://golang.org/cl/22031 added such a mechanism, making Read(0 bytes) do a wait for readability, followed by returning (0, nil). But maybe that is strange. Windows already works like that, though. (See new tests in that CL)

Reconsider this for Go 1.8.

Maybe we could add a new method to TCPConn instead, like WaitRead.

CL https://golang.org/cl/23227 mentions this issue.

read(2) with a count of zero may be used to detect errors. Linux man page confirms, as does POSIX's read(3p) here. Mentioning it in case it influences this subverting of a Read(0 bytes) not calling syscall.Read.

I found a way to do without this in net/http, so punting to Go 1.9.

Actually, the more I think about this, I don't even want my idle HTTP/RPC goroutines to stick around blocked in a read call. In addition to the array memory backed by the slice given to Read, the goroutine itself is ~4KB of wasted memory.

What I'd really like is a way to register a func() to run when my *net.TCPConn is readable (when a Read call wouldn't block). By analogy, I want the time.AfterFunc efficiency of running a func in a goroutine later, rather than running a goroutine just to block in a time.Sleep.

My new proposal is more like:

package net

// OnReadable runs f in a new goroutine when c is readable;
// that is, when a call to c.Read will not block.
func (c *TCPConn) OnReadable(f func()) {
   // ...
}

Yes, maybe this is getting dangerously into event-based programming land.

Or maybe just the name ("OnWhatever") is offensive. Maybe there's something better.

I would use this in http, http2, and grpc.

/cc @ianlancetaylor @rsc

Sounds like you are getting close to #15021.

I'm worried that the existence of such a method will encourage people to start writing their code as callbacks rather than as straightforward goroutines.

Yeah. I'm conflicted. I see the benefits and the opportunity for overuse.

If we do OnReadable(f func()), won't we need to fork half of standard library for async style? Compress, io, tls, etc readers all assume blocking style and require a blocked goroutine.
I don't see any way to push something asynchronously into e.g. gzip.Reader. Does this mean that I have to choose between no blocked goroutine + my own gzip impl and blocked goroutine + std lib?

Re 0-sized reads.
It should work with level-triggered notifications, but netpoll uses epoll in edge-triggered mode (and kqueue iirc). I am concerned if cl/22031 works in more complex cases: waiting for already ready IO, double wait, wait without completely draining read buffer first, etc?

@dvyukov, no, we would only use OnReadable in very high-level places, like the http1 and http2 servers where we know the conn is expected to be idle for long periods of time. The rest of the code underneath would remain in the blocking style.

This looks like a half-measure. An http connection can halt in the middle of request...

@dvyukov, but not commonly. This would be an optimization for the common case.

An alternative interface can be to register a channel that will receive readiness notifications. The other camp wants this for packet-processing servers, and there starting a goroutine for every packet will be too expensive. However, if at the end you want a goroutine, then the channel will introduce unnecessary overhead.
Channel has a problem with overflow handling (netpoll can't block on send, on the other hand it is not OK to lose notifications).
For completeness, this API should also handle writes.

We need to make sure that this works with Windows IOCP as well.

rsc commented

Not obvious to me why the API has to handle writes. The thing about reads is that until the data is ready for reading, you can use the memory for other work. If you're waiting to write data, that memory is not reusable (otherwise you'd lose the data you are waiting to write).

@rsc If we do just 0-sized reads, then write support is not necessary. However, if we do Brad's "My new proposal is more like": func (c *TCPConn) OnReadable(f func()), then this equally applies to writes as well -- to avoid 2 blocked goroutines per connection.

If memory usage is the concern, it is possible to make long parked G use less memory instead of changing programming style? One main selling point of Go to me is high efficiency network servers without resorting to callbacks.

Something like shrink the stack or move the stack to heap by the GC using some heuristics, that will be littile different from spinning up a new goroutine on callback memory usage wise, and scheduling wise a callback is not much different than goready(). Also I assume the liveness change in Go1.8 could help here too.

For the backing array, if it is preallocated buffer, than a callback doesn't make much different than Read(), maybe it will make some different if it is allocated per-callback and use a pool.

Edit:
Actually we could have some GC deadline or gopark time in runtime.pollDesc, so we could get a list of long parked G from the poller, then GC can kick in, but more dance is still needed to avoid race and make it fast.

How about a epoll like interface for net.Listener:

type PollableListener interface {
   net.Listener
   // Poll will block till at least one connection been ready for read or write
   // reads and writes are special net.Conn that will not block on EAGAIN
   Poll() (reads []net.Conn, writes []net.Conn)
}

Then the caller of Poll() can has a small number of goroutines to poll for readiness and handle the reads and writes. This should also works well for packet-processing servers.

Note that this only needs to be implemented in the runtime for those Listeners that multiplexed in the kernel, like the net.TCPListener. For other protocol that multiplex in the userspace and doesn't attached to the runtime poller directly, like udp listener or multiplexing streams in a tcp connection, can be implemented outside the runtime. For example, for multiplexing in a tcp connection, we can implemented the epoll like behavior by read from/write to some buffers then poll from them or register callbacks on buffer size changed.

Edit:
To implement this, we can let users of the runtime poller, like socket and os.File, provide a callback function pointer when open the poller for a fd, to notify them the readiness of I/O. The callback should
looks like:

type IOReadyNotify func(mode int32)

And we store this in the runtime.pollDesc, then the runtime.netpollready() function should also call this callback if not nil besides give out the pending goroutine(s).

I'm fairly new to Go but seeing the callback interface is a little grating given the blocking API exposed everywhere else. Why not expose a public API to the netpoll interfaces?

Go provides no standard public facing event loop (correct me if I'm wrong please). I have need to wait for readability on external FFI socket(s) (given through cgo). It would be nice to re-use the existing netpoll abstraction to also spawn FFI sockets onto rather than having to wrap epoll/IOCP/select. Also I'm guessing wrapping (e.g) epoll from the sys package does not integrate with the scheduler which would also be a bummer.

For a number of my use cases, something like this :

package net

// Readable returns a channel which can be read from whenever a call to c.Read
// would not block.
func (c *TCPConn) Readable() <-chan struct{} {
        // ...
}

.. would be nice because I can select on it. I have no idea whether it's practical to implement this though.

Another alternative (for some of my cases at least) might be somehow enabling reads to be canceled by using a context.

type ReadableSig struct {
    C *TCPConn,
    I interface{}
}
// Readable marks connection as waiting for read.
// It will send connection to channel when it becomes readable (or closed by peer)
//  together with additional information passed as 'i' argument.
// Note: if channel blocks, it will create goroutine.
func (c *TCPConn) Readable(i interface{}, ch chan<- ReadableSig)

Or even better:

type ReadableNotify struct {
    Conn *TCPConn,
    Ctx context.Context
}
// Readable marks connection as waiting for read and It will send connection
// to channel when it becomes readable (or closed) together with context.
// Internally it may spawn goroutine. It will certainly spawn goroutine if channel
// blocks.
// Note: connections always waits for readability, but if context is closed then
// notification may be omitted. If notification recieved, then Context should be
// checked manually.
func (c *TCPConn) Readable(ctx context.Context, ch chan<- ReadableNotify)

btw, can runtime spawn utility goroutine with less stack than user-faced goroutine?

Hi! Great idea πŸŽ†

In our go server with >2millions of alive WebSocket connections we were used epoll to bring the same functionality.

epoll.Handle(conn, EPOLLIN | EPOLLONESHOT, func(events uint) {
    // No more calls will be made for conn until we call epoll.Resume().
    ...
    epoll.Resume(conn)
})

That is, it could be useful to make callback call only once after registration.
In our case we've used EPOLLONESHOT to make this work.

I want API for nonblocking Read, or nonblocking Readable test.
Neither blocking wait nor callback don't satisfy my use case.

My use case is DB driver (go-sql-driver/mysql).
Basically, it is request-response protocol. But sometime connection is closed by server or someone in the middle.
If we send query, we can't retry safely because query is not idempotent. So we want to check connection
is closed or not before sending query. Pseudo code is here.

func (mc mysqlConn) Query(query string, args... driver.Value) error {
    if mc.conn.Readable() {
        // Connection is closed from server.
        return mc.handleServerGone()
    }

    mc.sendQuery(query, args...)
    return mc.recvResult()
}

Adding one or more goroutine makes the driver more complicated and slower.

@methane I'm sorry, I don't understand your example: if Readable returns true, I would not expect the connection to be closed. And, even if you omitted a !, if Readable returns false, that could just mean that the other side hasn't written any data; it doesn't mean that the connection is closed. Determining whether it is possible to read from a socket is not the same as determining whether the socket is closed.

@ianlancetaylor Note that how we get EOF. We need to call conn.Read().

As I said above, what I want is "nonblocking Read, or nonblocking Readable test."

Won't write on a closed/broken connection return an error? If yes, then you could just write the request always. If write succeeded, request was sent. If write failed, request wasn't send.
This has obvious races, but it's the same as for your code.

@dvyukov No. Even though remote peer closed connection, conn.Write() will success.
It's because TCP allows "half close". (I hadn't tested it on Unix socket yet)

If the database does not want to receive requests anymore, shouldn't it close the other half? The one that will fail write on the client?

@dvyukov DB closes full connection at socket API level. But it doesn't affect TCP.
Until client closes connection (including half close of write side), Write() will success.

Yup, write fails as expected:

$ go run /tmp/test.go
0 Read 0 (EOF)
0 Written 6 (<nil>)
1 Read 0 (EOF)
1 Written 0 (write tcp 127.0.0.1:42246->127.0.0.1:39591: write: broken pipe)
2 Read 0 (EOF)
2 Written 0 (write tcp 127.0.0.1:42246->127.0.0.1:39591: write: broken pipe)

I would expect it to be the other way around. Readability test should always check local read buffer first. So if some data is already received, but remote end is closed, readability test will always succeed, but write will fail.

Yup, write fails as expected:

0 Read 0 (EOF)
0 Written 6 (<nil>)

You send 6 bytes, even after you received EOF.

Why does kernel do this? Is it a kernel bug? The connection is already in CLOSE_WAIT state. This makes read fail, but not write...

Because there are no difference between half close (CloseWrite) and full close on TCP layer.
Both send just one FIN packet.
So client don't know remote peer will receive data or not.

When client send data packet, server will return RST packet.
That's why second send returns EPIPE. Here is tcpdump of the sample program.

tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
19:45:09.516488 IP (tos 0x0, ttl 64, id 21231, offset 0, flags [DF], proto TCP (6), length 60)
    localhost.33704 > localhost.11451: Flags [SEW], cksum 0xfe30 (incorrect -> 0x751c), seq 30788114, win 43690, options [mss 65495,sackOK,TS val 4103978601 ecr 0,nop,wscale 7], length 0
19:45:09.516497 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    localhost.11451 > localhost.33704: Flags [S.E], cksum 0xfe30 (incorrect -> 0x0a34), seq 2698581878, ack 30788115, win 43690, options [mss 65495,sackOK,TS val 4103978601 ecr 4103978601,nop,wscale 7], length 0
19:45:09.516505 IP (tos 0x0, ttl 64, id 21232, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.33704 > localhost.11451: Flags [.], cksum 0xfe28 (incorrect -> 0xdcb8), ack 1, win 342, options [nop,nop,TS val 4103978601 ecr 4103978601], length 0
19:45:09.516529 IP (tos 0x0, ttl 64, id 36746, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.11451 > localhost.33704: Flags [F.], cksum 0xfe28 (incorrect -> 0xdcb7), seq 1, ack 1, win 342, options [nop,nop,TS val 4103978601 ecr 4103978601], length 0
19:45:09.616632 IP (tos 0x2,ECT(0), ttl 64, id 21234, offset 0, flags [DF], proto TCP (6), length 58)
    localhost.33704 > localhost.11451: Flags [P.], cksum 0xfe2e (incorrect -> 0x9869), seq 1:7, ack 2, win 342, options [nop,nop,TS val 4103978701 ecr 4103978601], length 6
19:45:09.616641 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    localhost.11451 > localhost.33704: Flags [R], cksum 0x4929 (correct), seq 2698581880, win 0, length 0

Interesting. Thanks.

But readability is also not what you want, right? Readability will say OK is connection is half closed the other way, or fully closed but have local read data queued.

"Readability" meant "conn.Read() will return without block".
MySQL protocol is basically "request-response" protocol. Until we send a query, server don't send
anything in most case. But there are exceptions: server shut down, server (or Linux or router) closed idle connection, etc.

In such cases, we don't want send query because query may be not idempotent.
That's why I want non-blocking Read.

@DemiMarie It's off topic.

wow, it takes so long to make such a decision,
for long connected tcp server which connections above c10k, the memory and cpu was wasted so much by one connection one goroutine.

On #29947, I suggested a variant of make(chan) which takes blockable io.Reader objects and provides the next readable object via <-c. This would work with select, and avoids callback drawbacks.

func f(list ...net.Conn) error {
   c := make(chan, list...)
   buf := make([]byte, kSize)

   for r := range c {
      len, err := r.Read(buf)
      if err != nil { return err }
      if something(buf[:len]) {
         c <- r // continue waiting on this channel
      }
   }
}

What if instead of an explicit callback API we use something channel based? We could add a method NotifyWhenReadable to os.File and net.TCPConn, etc.

func (t *TCPConn) NotifyWhenReadable(c chan<- interface{}, val interface{})

When the connection is readable, val is sent to c. This would happen once. If you wanted another notification, you would have to call NotifyWhenReadable again. Naturally the channel send would be non-blocking, and the channel should be buffered.

This is a little bit like signal.Notify.

This permits a callback API but doesn't guide people toward it. A goroutine could wait for its connection to be readable easily enough via

    ch := make(chan interface{}, 1)
    conn.NotifyWhenReadable(ch, nil)
    <-ch

or you could have a single server goroutine

    for f := range ch {
        go f.(func())()
    }

and each separate connection would

    conn.NotifyOnRead(ch, func() { /* do the next thing */ })
    return

Naturally the channel send would be non-blocking

How are lost values supposed to be handled? If we drop a notification, part of the program will stay deadlocked forever?

Here @prattmic says the reason gvisor avoids standard network polling is not memory consumption, but rather latency:
#29734 (comment)
So I think we need to consider this aspect as well and if possible not introduce inherent latency penalty.

Since a notification is sent at most once per call to NotifyWhenReadable, it's easy to give the channel a large enough buffer to never lose any notifications.

I think gVisor is kind of a special case. I think most large servers care more about memory. But of course if we can fix both that would be better.

Since a notification is sent at most once per call to NotifyWhenReadable, it's easy to give the channel a large enough buffer to never lose any notifications.

You mean for the channel per connection scheme? If a single channel is used for all connections, it would require knowing some hard upper bound on number of connections.

If one of the recommended schemes is to use a single channel, I suspect it will become a scalability bottleneck. Of course users can start partitioning the channel into N independent channels, but then it will have a problem with load balancing: either some partitions will be overloaded and/or bad locality and/or preallocataing N times more memory for channel buffers.

I also wonder if we will be able to do this "exactly one non-spurious notification". I think C networking servers will do own polling, but then always followed by non-blocking reads/writes. If we leave reads/writes blocking, then it will bring very unique requirements for the polling mechanism.

E.g. the current linux polling scheme is always armed edge-triggered epoll. It by design generates spurious notifications.

Why don’t just add callback interface? That is, with the same semantics as @ianlancetaylor suggested – with re-register of the callback for the next readable event after one occurred. Inside callback user could define its own strategy of handling event. Probably it will dispatching task between some free goroutines.

Also regarding this comment from @dvyukov:

If we do OnReadable(f func()), won't we need to fork half of standard library for async style? Compress, io, tls, etc readers all assume blocking style and require a blocked goroutine.

I think this may not affect any wrappers around raw conn in general – we could subscribe for readable event on conn and read from wrapper, hoping that it would not block on split cases, when available data is not sufficient for wrapper purposes of decoding/processing data.

@gobwas A callback interface is what @bradfitz suggested above. My main concern with a callback interface is, simply, it's not typical Go style. I was hoping to explore alternatives.

And to be clear to others (Ian knows this), in the same comment I suggested that above, I wrote:

Yes, maybe this is getting dangerously into event-based programming land.

Which is another way of saying "not typical Go style"

unfortunately, "not typical Go style" is the thing people have to fall in when they want top perofrmance and/or millions of connections per process.
https://medium.freecodecamp.org/million-websockets-and-go-cc58418460bb
https://github.com/eranyanay/1m-go-websockets

typical Go style is quite convenient for high-level nd medium-level logic, and quite inefficient at low level.

@ianlancetaylor as exploration, maybe option of exporting network poller API would be better idea than a callback? I mean something like this:

import "net/poll"

p, err := poll.Create(...) // epoll_create() on Linux.
if err != nil {
    // handle error
}
go func() {
    for ev := range poll.Wait() { // epoll_wait on Linux.
        ev.Conn.Read(...)
        ev.Data // User data bound within Notify() call.
    }
}()
conn, err := net.Dial(...)
if err != nil {
    // handle error
}
if err := p.Notify(poll.Read, conn, data); err != nil { // epoll_ctl on Linux.
    // handle error
}

Thus, we may leave general poller untouched (and non-blocking), but provide additional polling mechanism which will block when poll.Wait() stopped being drained.

Also Notify() api may differ, e.g. accepting different channels to write event to, as you suggested, but in blocking manner – without events loss and with strong notice in method comment =)

I also wonder if we will be able to do this "exactly one non-spurious notification". I think C networking servers will do own polling, but then always followed by non-blocking reads/writes. If we leave reads/writes blocking, then it will bring very unique requirements for the polling mechanism.

@dvyukov, would it work to do a non-blocking read of a byte or two internally to verify readability before raising a notification? That adds at least one syscall to the overhead, but for a network or storage situation, maybe that's OK.

In the spirit of epoll, it would be helpful to get notifications for an arbitrary set of io.Reader objects via a single channel, since we cannot select on a runtime-defined set of channels.

would it work to do a non-blocking read of a byte or two internally to verify readability before raising a notification?

This looks dirty and slow and in the end there can be just 1 byte. So it was readable, but not now anymore ;)

In the spirit of epoll, it would be helpful to get notifications for an arbitrary set of io.Reader objects via a single channel, since we cannot select on a runtime-defined set of channels.

I think this is supported by the Ian's proposal, no?

Then let r.Read() be non-blocking after r.NotifyWhenReadable() and possibly return error io.WouldBlock?

@networkimprov JFYI there is an ability to Select() on runtime defined set of channels via reflect api.

@networkimprov Having a non-blocking Read method amounts to the C API. It's appropriate for single-threaded programs but it's not a good choice for Go.

@gobwas I think that in practice using a poller is going to lead people to write code using callback APIs. It's the most natural way to use a poller.

Do you have a plan to deal with the issue Dmitry raised?

@networkimprov Which issue? If you mean the problem with possibly losing a channel notification, then of course he is quite right, and the channel send must be a blocking send. Depending on the implementation, this may require spinning up a goroutine to do the send.

No, I was referring to

I also wonder if we will be able to do this "exactly one non-spurious notification". I think C networking servers will do own polling, but then always followed by non-blocking reads/writes. If we leave reads/writes blocking, then it will bring very unique requirements for the polling mechanism. E.g. the current linux polling scheme is always armed edge-triggered epoll. It by design generates spurious notifications.

How do you deal with this without letting r.Read() be non-blocking after r.NotifyWhenReadable() and possibly returning error io.WouldBlock?

The problem I was trying to address was "large server with a lot of goroutines burns a lot of memory in buffers waiting to read." For that purpose it's OK if a spurious notification leads to a blocking Read, as long as spurious notifications are rare. But maybe I'm missing something.

Brad was trying to get rid of the idle goroutines: #15735 (comment)

If that's not a requirement, an alternate read API could internally poll, malloc/free, and do non-blocking reads, and return a new buffer on success. An internal buffer pool would avoid thrashing the allocator due to fruitless reads after spurious poll events.

func (r *T) ReadSize(maxLen int) (byte[], error)

My channel suggestion also permits getting rid of the idle goroutines, for people who want to write a more complex program.

But it does incur a read-buffer + idle goroutine for every spurious poll event, which could be costly.

@dvyukov, any idea as to the proportion of spurious epoll wakeups in Linux?

More on epoll difficulties (from 2y ago):
https://news.ycombinator.com/item?id=13736674

@dvyukov, any idea as to the proportion of spurious epoll wakeups in Linux?

No, I don't.
But if a solution relies on absence of spurious wakeups for correctness/performance, I guess we need these numbers for all OSes before making a decision.

I am trying to understand when a tcp server has closed the connection, as I am not receiving errors on the reads.
So if I understand correctly, I am stuck: I cannot read a 0 byte string to detect if the connection is closed and I cannot read a 1 byte because my server never writes anything and the read is blocking -

Is there any reliable way to detect that a TCP connection was closed from the server without relying on the server writing something on the connection ?

Also see https://stackoverflow.com/questions/55544914/cannot-detect-that-a-tcp-service-in-kubernetes-has-no-pods-with-golang-app?answertab=oldest#tab-top

FYI, we implemented this in mysql driver.
https://github.com/go-sql-driver/mysql/blob/master/conncheck.go

Note that it works only on Unix, and it caused several allocations.
I really want this is implemented in stdlib in efficient (and probably cross platform) way.

It's highly useful to reduce overhead when the go program holds a lot of connections (connection pool) and now I'm using the epoll mechanisms do this which is not idiomatic.

Looking forward to land in Go, many thanks:)

@ianlancetaylor + @bradfitz a typical problem I have in an http proxy is, that connection spikes can create spikes in memory usage. I think this can be fixed with using epoll and I hope your approach will cover the problem. We would need to be able to set the max concurrency level for the goroutine calls, that will read and write from/to sockets.
We have an internal protection mechanism to avoid this problem, but you can see the memory spike in in bufio:

image

xtaci commented

I need readability notification badly in my project, as I have to pre-allocate 4k buffer per connection before conn.Read([]byte), just like io.Copy does:

https://golang.org/src/io/io.go?s=12796:12856#L399

UPDATE:
solved this by RawConn:
https://github.com/xtaci/kcptun/blob/v20191219/generic/rawcopy_unix.go

eloff commented

So there is RawConn now which has an interface very much like what @bradfitz was proposing here. However, it calls the read callback before calling wait for read. It must do this, as the net poller uses edge-triggered events - they won't fire if there is already data on the socket.

One workaround is to use a small stack buffer for the initial Read, and then when that reads some data, allocate and copy it to a real buffer, and then call Read again. That helps, but you'll still have the goroutine's 4KB stack overhead.

Another option is use RawConn.Control with unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.TCP_DEFER_ACCEPT, 1) for connections like HTTP where the client must send data first, it won't create a goroutine in the first place until there's data buffered, but only helps for that initial Read. It can be combined with the stack buffer approach for long-lived connections.

I'd still like to see an approach (maybe an async OnReadable callback method on RawConn, like bradfitz is proposing here?) to avoid having the goroutine overhead. Mail.ru avoids the net poller entirely and manually manages epoll for precisely this reason, as 4KB per WebSocket connection with millions of open but mostly idle WebSockets waiting for new mail is just too much overhead.

xtaci commented

I'd still like to see an approach (maybe an async OnReadable callback method on RawConn, like bradfitz is proposing here?) to avoid having the goroutine overhead. Mail.ru avoids the net poller entirely and manually manages epoll for precisely this reason, as 4KB per WebSocket connection with millions of open but mostly idle WebSockets waiting for new mail is just too much overhead.

I don't think OnReadable callback is a good idea. If I have 2 handlers to toggle based on incoming data, then it's not impossible to code this kind of nested callbacks which reference to each other.

For such reason, in order to writing comprehensive logic we have to copy the buffer from callback out to another goroutine, in such case, the memory usage will be out of control, as we lost the ability to back-pressure the congestion signal to senders. (Before, we won't start next net.Read() until we've processed data.)

Even for a callback like func() { Read();Process();Write() } scenario, if Write() blocks, this callback still holds 4KB-buffer waiting to be sent.

In all these cases above, we still have inactive 4KB-buffers somewhere.

xtaci commented

I wrote a library based on ideas above, golang async-io

https://github.com/xtaci/gaio

For my particular use case having a non-blocking Read is highly desirable. An chat service aiming to manage millions of connections per instance. Blocking interface forces to have an alive goroutine per connection, which makes this goal totally unrealistic.

I wrote a library based on ideas above, golang async-io

https://github.com/xtaci/gaio

And it works great - it uses an epoll event handler and callback to solve the readability and writeability issues that are discussed here (if I'm not mistaken ... lol)
I use it instead of a multiplexer for a message proxy component which handles one to many client connections for inbound and outbound traffic.
It has a batch style of message processing which requires a message reader and frame size delimited message protocol. That is, the message reader compiles a complete message from 1 or more partial messages provided in each event result buffer

I wrote another async-io lib nbio to avoid using one or more goroutines per connection and reduce memory usage.
It supports http 1.x and is basically compatible with net/http. Many net/http-based web frameworks can be easily converted to use the async-io network layer.
It also supports tls and websocket.
I am trying to increase the load capacity of a single hardware, 1m/1000k-connections would not be that hard anymore for golang.

https://github.com/lesismal/nbio

In addition to the array memory backed by the slice given to Read, the goroutine itself is ~4KB of wasted memory.

If the problem is the size of the idle goroutine stack, would it make sense to have the runtime recognize goroutines that are likely to block ~immediately and allocate a much smaller initial stack for them?

@bcmills, maybe that could be a new proposal?

In addition to the array memory backed by the slice given to Read, the goroutine itself is ~4KB of wasted memory.

If the problem is the size of the idle goroutine stack, would it make sense to have the runtime recognize goroutines that are likely to block ~immediately and allocate a much smaller initial stack for them?

Problems are both goroutine stack and buffer for Read. Especially when TLS is involved.

Problems are both goroutine stack and buffer for Read. Especially when TLS is involved.

At least, if we had goroutines memory pressure under control the read buffer issue may be tackled by extending net.Conn API:

conn := getTLSConn()

_, err := conn.WaitRead() // block until there's available content
if err != nil {
    return err
}
rdBuff := grabReadBuffer()
defer releaseReadBuffer(rdBuff)

n, err := conn.ReadAvailable(rdBuff)
if err != nil {
    return err
}
handleContent(rdBuff)
return nil

Problems are both goroutine stack and buffer for Read. Especially when TLS is involved.

At least, if we had goroutines memory pressure under control the read buffer issue may be tackled by extending net.Conn API:

conn := getTLSConn()

_, err := conn.WaitRead() // block until there's available content
if err != nil {
    return err
}
rdBuff := grabReadBuffer()
defer releaseReadBuffer(rdBuff)

n, err := conn.ReadAvailable(rdBuff)
if err != nil {
    return err
}
handleContent(rdBuff)
return nil

I wrote an async-io lib nbio. I also forked std tls and rewrote it to support non-blocking.
Here are examples that save more memory:
lesismal/nbio#62 (comment)

In my simple echo test with100k websocket tls connections, compared with std based frameworks:

solution qps cpu memory
nbio avg:110-120k, running fine around 300% 1.3G
std avg: 60-80k, with obvious stw around 300% 3.3G
CAFxX commented

If I may voice a contrary opinion, I would argue that there are other ways to obtain the same benefits (minimize space wasted on parked goroutine stacks and on buffers that are waiting for data to be written into them) without unleashing callbacks and/or readability signals.

One such way could be to have buffers be decommitted from memory before submitting them for a blocking read (I haven't tested this, but AFAIK it should work). Right now this would require an additional syscall (e.g. madvise) before the read, but once/if #31908 comes around that won't be required anymore (as the madvise and read could be submitted in a single syscall). Given page-aligned buffers of size greater than a page, this should transparently achieve the goal at least for the buffers.

For the goroutine stack, something similarly transparent could also be achieved: as part of GC we already shrink the stacks of goroutines, if they are too big. It is not impossible to imagine extending this mechanism to detect goroutines that have been parked for a while waiting for I/O and "freeze" their stacks, aggressively packing together their contents (without maintaining the minimum 4KB size), and freeing the stacks themselves. When the goroutine needs to be unparked a new stack of the appropriate size would be allocated, the frozen stack contents would be copied into the newly allocated stack and the goroutine would resume. This should transparently achieve the goal for the stacks of goroutines blocked on I/O.

I won't deny that there is a lot of handwaving in this comment: in my defense, I am not trying to provide a fully fleshed-out design. I'm just trying to point out that there may be other possibilities to achieve the same goals besides the mechanisms being discussed in the rest of this issue. Because these mechanisms, as has been pointed out already, are pretty non-go-like I would hope that any alternative that may achieve transparently the desirable goals set out in this issue is fully exhausted before more extreme options are implemented.

There are several potential issues with madvise:

  1. It conflicts with large 2M pages.
  2. The buffers are not necessary page-aligned.
  3. The contents are not necessary discard-able (for a short read the rest of buffer must be left intact).
CAFxX commented

(sorry for being off topic, I will not add further comments in this issue; if needed I will update this comment to address further replies to avoid polluting the discussion with design considerations of a potential counterproposal)


IIUC 1 and 2 have multiple potential solutions (document that only pagesize-aligned buffers can benefit from this optimization, teach the memory allocator to handle pagesize-sized allocations especially, provide an allocator function specialized for this purpose, suggest to use off-go-heap memory, ...) so possibly these are not a showstopper (unless I'm missing something). (update: furthermore, it seems that in practice page-sized allocations are already page-aligned)

3 OTOH could be a showstopper (limited to the buffer part), but I'm confused. io.Reader docs seems to indicate that it is not the case the rest of the buffer must be left intact in case of a short read:

type Reader interface {
	Read(p []byte) (n int, err error)
}

[...] Even if Read returns n < len(p), it may use all of p as scratch space during the call. [...]

It only says that the space beyond n can be used as scratch space, but it does not say whether the contents of p[n:] are restored after the area has been used as scratch space. I always interpreted that to mean that data in p[n:] may not contain the original contents when Read returns, otherwise to be able to use all of p as scratch space Read would always require yet another scratch space to temporarily store a copy of p just in case of a short read (but then what would be the point of using p as the scratch space?). Furthermore, the caller can know n only once Read returns, so the "during" in the doc can not just refer as a means to avoid data races in which both Read and some other goroutine both access the same slice. Did I get this wrong all this time?

[...] Even if Read returns n < len(p), it may use all of p as scratch space during the call. [...]

I think you are right. I just assumed it should not change what's not returned following the principle of the least surprise.

I am surprised to find out that there is no non-blocking way to read network connections in stdlib while Go is the most network-centric language out there. How come there is no progress since 2016 and one has to look for 3rd party libraries(none of which work on windows)?

@ivanjaros The Go runtime uses non-blocking I/O internally, so that Go programs don't have to.

Note that this issue is not about non-blocking I/O. It's about a way for a goroutine to wait for a connection to be readable without having to actually read from it.

@ianlancetaylor my point was that this issue has been ongoing for years but we still have to result to using 3rd party libraries when we want to seek the bleeding edge of performance over DX through any of:

https://github.com/tidwall/evio
https://github.com/panjf2000/gnet
https://github.com/hslam/netpoll
https://github.com/xtaci/gaio
https://github.com/Allenxuxu/gev
https://github.com/pigogo/netgo
https://github.com/npat-efault/poller
https://github.com/alberliu/gn
https://github.com/smallnest/1m-go-tcp-server
https://github.com/lesismal/nbio

On the other hand, since there are these options, the entire discussion might have become obsolete πŸ€·β€β™‚οΈ

For me personally, this was never a problem or a need. This performance is needed only in very niche situations where you have to serve a lot of idling connections. For example imap email server or chat server where you can have millions of people/devices connected but there is mostly no data transmission going on. There you will feel the pain of goroutines(memory+context switching). But with the modern times, these niche examples are becoming more mainstream needs and so they become an issue. Hence why I think it is worth addressing it. Everyone wants to be connected 24/7, use multiple devices from their phone, through their computer to their fridge. So devs make various servers, services and applications to serve these needs and so they might be forced to use other languages like rust or other programs like haproxy, redis... instead of writing what they need in go and keep the code base unified.

@ivanjaros
It seems only my lib nbio supports async-io and async parser for tls/http1.x/websocket by now but wasn't listed in your list and few people know it...

Note that this issue is not about non-blocking I/O. It's about a way for a goroutine to wait for a connection to be readable without having to actually read from it.

@ianlancetaylor If go support a readable callback, the conn.Read and conn.Write should not blocked any more, else users still need to use at least one goroutine to handle Read/Write logic, then it would not be different from the current std solution on reducing the cost from lots of goroutines.

Sounds like you are getting close to #15021.

I'm worried that the existence of such a method will encourage people to start writing their code as callbacks rather than as straightforward goroutines.

@ianlancetaylor

Definitely agree that async-io does need to be used with callback, in which we also needs to consider the memory optimization of the half-packet bytes cache.
But only the part of parsing needs to use callbacks, such as the http parser.

The application-layer users can still keep the synchronous style as std does: async-io and callback-parser parse the complete package and pass them to a size-limited-goroutine-pool for business logic processing.
Actually, my lib has already implemented it like this and basically compatible with net/http. A simple hello-world server as below:

package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/lesismal/nbio/nbhttp"
)

// This handler will be executed in a different goroutine pool from io goroutine pool.
// It will not block the io logic even there are some block logic in this handler, 
// such as database operations.
// Since we can customize the goroutine pool, we can set an appropriate size for it, 
// and the program will be able to keep itself in a perform-cost-balance-healthy state 
// even if there are 1m connections.
func HelloWorld(w http.ResponseWriter, r *http.Request) {
	fmt.Fprintf(w, "hello world")
}

func main() {
	router := &http.ServeMux{}
	router.HandleFunc("/", HelloWorld)
	// also can use gin
	// router := gin.New()
	// router.GET("/", ...)
	// or echo
	// router := echo.New()
	// router.GET("/hello", ...)

	svr := nbhttp.NewEngine(nbhttp.Config{
		Network: "tcp",
		Addrs:   []string{"localhost:8080"},
		Handler: router,
	})

	err := svr.Start()
	if err != nil {
		log.Fatalf("nbio.Start failed: %v\n", err)
		return
	}
	defer svr.Stop()

	<-make(chan int)
}

And even can be used with other std based frameworks like gin or echo, more examples:

I am sorry if it is not suitable to mention my own lib here, but I think this [async-io] + [callback parsing] + [goroutine pool] solution can really allow us both to get the resource cost under control and to write synchronous logic.
We have the cake and eat it, too.
It has been verified in load testing and in some of my projects.

There is waiting for a TCP connection to have data available, which is this issue. And there is non-blocking I/O. They are not the same thing.

There is waiting for a TCP connection to have data available, which is this issue. And there is non-blocking I/O. They are not the same thing.

We know that it is non-blocking io, but the currently provided interfaces such as conn.Read/Write are blocking interfaces for the application layer.
If the non-blocking Read/Write interfaces are not provided with Readable together, for the application layer, it would be the same as the current std.

@lesismal can you provide the data, maybe it's more convincing to show data.

@ianlancetaylor If we would consider to have a fixed size goroutine pool (as suggested in #15735 (comment)) for handlers that can be set by some configuration, it would also fix the unbounded memory usage issue in #35407.

@szuecs I've provided some test results on the previous floor, please check it or better to try the code in yourself's env:
#15735 (comment)

We know that it is non-blocking io, but the currently provided interfaces such as conn.Read/Write are blocking interfaces for the application layer.

As you probably know, they block the goroutine that calls them. They don't block any threads. Goroutines are cheap. Programs that need to avoid blocking goroutines are a special case. And as noted above, it is already possible to use non-blocking I/O in Go.

As you probably know, they block the goroutine that calls them. They don't block any threads. Goroutines are cheap. Programs that need to avoid blocking goroutines are a special case. And as noted above, it is already possible to use non-blocking I/O in Go.

Yes, I know that they don't block any threads.
But when using std and there are millions of connections, the huge num of goroutines are not as cheap as there are only thousands of connections. That's why there are many third libs to do it as @ivanjaros listed:
#15735 (comment)

In a 1m connections test:
Golang needs at least 8G memory(8k*1m), and the CPU cost for both schedule and GC is also huge, so we can't deploy this service on a VM/hardware with low-middle level specifications, else we come across both OOM and STW problems.
Compared to other languages in this test, Golang performs even worse than Java-Netty and Nodejs, with multiple memory costs, that's such unexpected amazing.

As a cloud-native language, Golang will handle more and more connections as basic frameworks. The more we can reduce the num of goroutines, the more hardware, energy, and money we can save.

Not joking, but sincerely:joy::
For environmental protection!
For the earth!

My opinion is that if provide Readable, non-blocking Read/Write interface should be provided together, else Readable will not be useful.

When PHP 7, i think, came out couple of years ago, Rasmus(PHP author) had a talk on some PHP conference where he showed how much electricity the better performance of this new version saved due to lower demand on hardware and hence lower economic costs in general. So this is not as outlandish as it might seem. As pointed above, the cost-savings are already impressive #15735 (comment) and with stdlib implementation we might get even better CPU performance.

d2gr commented

All of the implementations listed above use mostly a reactor pattern, not a pro-actor pattern. See proactor pattern. (boost::asio is a pro-actor lib).

If I am not mistaken the proposal here is to add readability which would be implemented using callbacks, which would make Go support some kind of semi-proactor interfaces. In my opinion this is fine, many people have been looking forward to this implementation, but I think the main problem is going to be the callbacks.
In a high-frequency environment where we call OnReadable more than 1M times per second, we are going to create a function object every time (because closures are structs), which will be allocated in the heap because it'll escape the stack.

I have implemented myself a library that uses a true pro-actor pattern (not open-sourced yet), and it performs well, but not better than Go. Mainly because the heap allocations increase and decrease per second (given the above callback problem).
Libraries that implement a pro-actor pattern might improve the TLS & HTTP/2 performance given that the tls package has locks everywhere and HTTP/2 in Go is full of channels and mutexes (not criticism, both libraries are very well done).

Summary: If this issue goes forward and a PR is presented, are we going to see closures allocated in the stack? In C++, lambdas are just objects (like in Go) but they are moved from stack to stack, afaik they are never allocated in the heap (unlike Go).
I know that callbacks might not allocate actual heap, they just reuse memory because if the object is less than 32KiB mallocgc takes memory that is available, but still reducing performance.