[Question] CPU cache friendly byte slice

Question

[Question] CPU cache friendly byte slice

stokito opened this issue 2 years ago · 2 comments

Thank you for your library, articles and talks. Could you please help me.

I have a logger that writes a lot of messages in parallel to stdout. The problem is that messages were written simultaneously and shuffled. So I had to add a mutex and lock before printing:

l.mu.Lock()
fmt.Fprintf(os.Stdout format, v...)
l.mu.Unlock()

I wish to avoid the locking because I need as small latency as possible. But I'm fine with some pauses and I don't care much about order of messages.
On my server I have 24 CPUs and each has it's own cache. I have an idea to make per-cpu list of byte slices and then periodically gather all of them and dump to a log.
Can this work in practice? I'm feeling that I'm reinventing some existing structure. Could you please recommend an optimal way to do that.

I see that the state striping is something that probably can help me.
I see that the library has a Counter that uses the same principle.

I also asked the question on SO https://stackoverflow.com/questions/74954360/cpu-cache-friendly-byte-slice

stokito commented 2 years ago

Thank you

Answer 1 · 2023-01-04T14:54:14.000Z

First of all, thanks for the feedback!

Notice that any write into stdout/stderr already involves locking - that's why it's quite expensive on its own. Logging framework that you're describing is not something new, say log4j v2 works in a similar way. The idea is that a log call itself doesn't write into stdout, but instead prepares a message and publishes it into a MPSC queue. The queue is consumed by a single thread/goroutine and then written into stdout/stderr or a file or anything else, like network.

Your question seems to be related with ways to avoid litter, i.e. redundant allocations for the message byte arrays/slices. To do that you could either use some kind of pooling (sync.Pool) or use LMAX Disruptor style queue instead of something more traditional. The latter means separate sequence API for both producers and the consumer and a ring buffer. Sequences themselves synchronize the access to the ring buffer while the ring buffer could hold the byte slices. QuestDB uses this approach, but the code base is large and may be too complex to be used as an example:
https://github.com/questdb/questdb/tree/584339afe80d4fda238b18f76ab20d6d8666bc90/core/src/main/java/io/questdb/log