gorilla/websocket

[feature] Avoid unnecessary heap allocations

deanm opened this issue ยท 16 comments

deanm commented

c.messageReader = &messageReader{c}

In our profiles this is the number one source of heap allocations, because a messageReader is allocated for every single message read. A messageReader is nothing more than a *Conn so this really doesn't seem necessary to be heap allocated.

I think a very simple solution without changing the code much would be to have a messageReader value (not pointer) stored on the Conn that can be re-used instead of a heap allocated object. The messageReader and reader pointers could still exist and then just point to the address of the value on the Conn. This would work fine as long as there are not multiple readers inflight or they need to outlive their time as .reader/.messageReader, I don't know the code well enough but I would assume this isn't the case.

I'm happy to make a patch for this.

deanm commented

Seems #134 is related, and what I proposed is similar to the discussion there. Seems like nothing was every done though?

deanm commented

I think part of the concern was caching the value would subvert the purpose of this check:

if c.messageReader != r {

To make sure that a stale reader wasn't used. I don't know enough about the code to know if/when this could actually be possible, or if it's just some defensive programming in case some bug could crop up someday, but in that case completely dominating the allocations in our application and applying a lot of GC pressure to address such a scenario is definitely not a good set of tradeoffs, so perhaps being able to opt-in, NextReaderIKnowWhatImDoing() would be a possible path. I know it's not pretty but neither is unnecessary GC.

I ran into this writing https://github.com/nhooyr/websocket as well.

My solution was to keep all reader state in the Conn except for the state necessary to check if a stale reader is being used which is a single bit. In total each reader allocates only 16 bytes every time which imo is a negligible amount of allocation. Gorilla could take a similar approach to reduce GC pressure while maintaining the safety of the check.

edit: Gorilla's solution is much more clever and only allocates 8 bytes. I'll probably be using it instead at some point. So ignore my paragraph above please.

@deanm It's defensive programming for sure but I could easily see someone making this mistake, I've made similar mistakes in the past where I accidentally assigned to a different variable and then used a different variable. static typing helps but doesn't always cache such mistakes, especially with Go's := which makes it so easy to declare new variables.

I'm surprised you're seeing this affect you. It's only 8 bytes per Reader, how many messages are you processing a second?

Interesting how the Writer allocates a lot more though and doesn't take the clever approach by Reader.

type messageWriter struct {

Actually I think you're right, you can't really cause a bug by using the same Reader unless you're using it concurrently as if every Reader returned by Reader is identical and has no state, then they are all capable of reading the exact same data from the connection so you can use any.

The only time you could introduce a bug is if you try to read concurrently from multiple Reader's, an old one and a new one. Or you try to read a bit from a old one into a different variable and then some from a new one which imo is pretty much the same as reading concurrently.

I'm not sure how such a bug could manifest considering most usages have the Reader locally scoped in the current function and certainly the current goroutine so I am in favour right now of making Reader zero alloc. But still want to sit on it.

Still curious how it's making such a large impact on your GC though given its only 8 bytes an alloc ๐Ÿค”

deanm commented

It's not just about the size of the allocation it's also about the number of allocations. I implemented my suggestion above and it reduces allocations in my application by more than 95%, which means a massive decrease in the amount of GC time, pauses, etc, and generally decreased CPU utilization by a decent factor. Yes, the application receives a lot of websocket messages (millions), is that not a use case supported by the library?

deanm commented

Btw, if you really need this stale reader bug prevention thing, there is a very simple solution that still does not require any allocation. Next reader simply returns an io.Reader, which is an interface, interfaces can either be (type, value) or (type, *value), meaning you can still return by value (which isn't an allocation). So you make the messageReader type store an int64 seq number, and you return it by value, the Conn has another seq number that is incremented every time it creates a new messageReader. The messageReader can then instead of doing a pointer check and can perform a seq num check, if r.c.seqnum != r.seqnum { }. Therefore this stale prevention mechanism is done explicitly with a system, instead of delegating it to the heap allocation system, which is indirectly using the fact that you get differing heap pointers as a seq number system.

Yes, the application receives a lot of websocket messages (millions), is that not a use case supported by the library?

It definitely is. It's just still really surprising to me that even with so many messages, the main bottleneck you're having is that each Reader is 8 bytes. What are you doing with the messages themselves that is generally zero alloc and super fast? Is each connection reading thousands of messages before dropping?

Btw, if you really need this stale reader bug prevention thing, there is a very simple solution that still does not require any allocation. Next reader simply returns an io.Reader, which is an interface, interfaces can either be (type, value) or (type, *value), meaning you can still return by value (which isn't an allocation). So you make the messageReader type store an int64 seq number, and you return it by value, the Conn has another seq number that is incremented every time it creates a new messageReader. The messageReader can then instead of doing a pointer check and can perform a seq num check, if r.c.seqnum != r.seqnum { }. Therefore this stale prevention mechanism is done explicitly with a system, instead of delegating it to the heap allocation system, which is indirectly using the fact that you get differing heap pointers as a seq number system.

Interface's can only store 8 bytes by value, the value you're proposing would be 16 bytes which wouldn't fit and require an allocation. Otherwise it would definitely be preferred.

deanm commented

Yes, the application receives a lot of websocket messages (millions), is that not a use case supported by the library?

It definitely is. It's just still really surprising to me that even with so many messages, the main bottleneck you're having is that each Reader is 8 bytes. What are you doing with the messages themselves that is generally zero alloc and super fast? Is each connection reading thousands of messages before dropping?

Yes everything else is zero alloc (at least in the vast majority of the case). Yes messageReader really was more than 95% of total allocations.

GC costs are not linear with the size of memory allocated, the costs involve both the number of allocations and the size of those allocations (and the frequency/timing of the allocations, in terms of lock contention, etc), but in many cases the number of allocations can be a bigger contributor (if I allocate a million objects, whether the size of each object is 8 bytes or 800 bytes probably won't make a big difference in terms of GC overhead, and it definitely won't be 100x).

Btw, if you really need this stale reader bug prevention thing, there is a very simple solution that still does not require any allocation. Next reader simply returns an io.Reader, which is an interface, interfaces can either be (type, value) or (type, *value), meaning you can still return by value (which isn't an allocation). So you make the messageReader type store an int64 seq number, and you return it by value, the Conn has another seq number that is incremented every time it creates a new messageReader. The messageReader can then instead of doing a pointer check and can perform a seq num check, if r.c.seqnum != r.seqnum { }. Therefore this stale prevention mechanism is done explicitly with a system, instead of delegating it to the heap allocation system, which is indirectly using the fact that you get differing heap pointers as a seq number system.

Interface's can only store 8 bytes by value, the value you're proposing would be 16 bytes which wouldn't fit and require an allocation. Otherwise it would definitely be preferred.

Bummer, I had guessed interfaces could be implemented as a header but it makes sense that they have to be fixed size. Just looked at the source code and it looks like they are always two words.

Yes everything else is zero alloc (at least in the vast majority of the case). Yes messageReader really was more than 95% of total allocations.

GC costs are not linear with the size of memory allocated, the costs involve both the number of allocations and the size of those allocations (and the frequency/timing of the allocations, in terms of lock contention, etc), but in many cases the number of allocations can be a bigger contributor (if I allocate a million objects, whether the size of each object is 8 bytes or 800 bytes probably won't make a big difference in terms of GC overhead, and it definitely won't be 100x).

Fair point, I had not considered that.

Btw, given your scale, you might be interested in gobwas/ws.

https://www.freecodecamp.org/news/million-websockets-and-go-cc58418460bb/

It's a very unidiomatic way of writing Go but will give you the fastest performance and least memory allocation.

Here are some options to mitigate the number of allocations:

  • Create message readers in chunks (make([]messageReader, chunkSize)) and allocate out of those chunks.
  • Alternate between two or more pre-allocated readers. This will probably catch most uses of stale writers.

This is now available on master in my library, reads now do not allocate any memory.

I didn't add any protection against stale readers as you can only have one reader open at a time and its not a concurrent operation so if you end up with a stale reader, something really went wrong as you called Reader again but somehow are using the old reader. I think a similar approach would work well in gorilla/websocket for readers.

Writers as well but only if there is no plan to allow concurrent writers as then you can easily call a method on a stale writer if you double close as another goroutine might have picked it up in the meantime. That's why writers still allocate in my lib.

Closing out - I think this has run its course.

gorilla/websocket takes a slightly safer stance here - protecting from stale reads (see this comment) - than other libraries may. There are always trade-offs in library design, and we currently believe this is the right one for most users.

@nhooyr
I have seen this problem and solved in my another libraray.
You can use a lock and a bool variable on the connection ,the bool stand for is reading or not. Before reading check the bool is false, if it is true than data race happend, just panic the caller (like golang map) , then set the bool to true .After reading set the bool to false.
So you can protect against stale readers(two goroutines call read at the same time) and no allocation.