crystal-lang/crystal

Add a "prefill" method to IO::Buffered with read buffering

spuun opened this issue · 15 comments

Feature Request

I haven't found a way to "prefill " a IO::Buffered with read_buffering? == true. I want to fill it's buffer with at least n bytes (n <= buffer_size).

A use case is that I'd like to peek into the buffer to see some bytes to autodetect what protocol a socket client is using. A peek call will trigger a fill, but if not enough data is read I have no way of telling the IO to read more data.

I'd like to do something like

io.prefill_buffer(8) # i must have at least 8 bytes in the buffer
if io.peek[0,8] == "protocol"
  handle_protocol1_client(io)
else if io.peek[0,5] == "proto"
  handle_protocol2_client(io)
else
  handle_default_client(io)
end

Maybe a #peek(n) : Bytes overload? It would make sure the buffer has at least n bytes and could return a slice to the first n bytes instead of the current buffer.

I like the generic prefill method, but looking at IO and IO::Buffered I'm not noticing other methods that could benefit from filling the buffer 🤔

Maybe a #peek(n) : Bytes overload? It would make sure the buffer has at least n bytes and could return a slice to the first n bytes instead of the current buffer.

Agree!

This seems like it could be useful. However I'm wondering about the behaviour outside the happy path.

For example, what happens on EOF? I suppose the result is trimmed down to the size of available data? But then we must make sure to not assume peek(n).size == n.
Also a significant difference from base #peek is that the original performs at most one read operation and returns as soon as some data has arrived. But in order to fill the peek buffer to n we may need to read multiple times. This deviates from the expectations that you would have on #peek. Not sure about the impact and relevance, but it's something to note.

This filled peek also needs a memmove on the peek buffer. I don't see any immediate downside of that except that it can have a performance effect. But that should be neglectable because peek size is usually low.

Another positive effect I can think of is that this could perhaps allow us to simplify some algorithms. IO::Delimited for example has three different implementations for delimiter detection: #read has two different ones, depending on if peek from upstream is available, and #peek. Prefilling the peek buffer would also allow to persist the remainder of a peek read while filling up the rest to read more. I haven't thought this through entirely, but it might be a chance.

Currently I've done https://github.com/cloudamqp/lavinmq/blob/ba4adbddaa76dfadfe8310df0b42e9d1c0d52e7c/src/stdlib/io_buffered.cr in LavinMQ. Maybe I'm missing a case or two, and I know it's only used for sockets.

Yeah, I think there's a bug in your code: to_read = @buffer_size - @in_buffer_rem.size should be to_read = size - @in_buffer_rem.size.

And it does not handle the case where @in_buffer_rem + to_read goes out of bounds on @in_buffer. E.g. this use case:

# This should fill the buffer and consume all but the last byte
io.read(Bytes.new(io.buffer_size - 1))
# Now we want to read one more byte than the buffer's capacity
io.peek(2)

In order to fix this, we need to always move the contents of @in_buffer_rem to the start @in_buffer. And then we fill the remainder up to at least size.

Yeah, I think there's a bug in your code: to_read = @buffer_size - @in_buffer_rem.size should be to_read = size - @in_buffer_rem.size.

Hm. My intention was to try to fill the entire buffer, but at least size. Isn't @buffer_size - @in_buffer_rem.size what's left in in_buffer?

And it does not handle the case where @in_buffer_rem + to_read goes out of bounds on @in_buffer. E.g. this use case:

# This should fill the buffer and consume all but the last byte
io.read(Bytes.new(io.buffer_size - 1))
# Now we want to read one more byte than the buffer's capacity
io.peek(2)

I'm only reading while @in_buffer_rem.size < size, so in this case no read would happen.

What i want #peek(n) to do is to ensure there is n bytes in the buffer/returned slice. Not always read n more bytes.

I have these specs that maybe shows what I want the method to do.

Hm, seems like misread the code, sorry. So to_read determines the remaining capacity.
But in the case @in_buffer_rem.size + to_read < size, we cannot fill the buffer to fit the requested amount. That would mean returning a slice that's smaller than slice, despite having potentially more data to read. That's unergonomic because you cannot expect peek(n) to return n bytes even in normal operation. The only case where that should be true is when the IO closes and there is not more to read.

So I still think we need to move @in_buffer_rem to the start of @in_buffer.

Yeah, i recreate @in_buffer_rem after data has been added to in_buffer. Maybe this must be done in some other way?

I think I understand what you mean now, yes.

I was thinking the algorithm should look something like this (untested):

    if @in_buffer_rem.size < size
      # We need to read more into the buffer

      in_buffer = in_buffer()

      # 1. Move leftover data to the front of the buffer so we have maximum capacity to fill
      @in_buffer_rem.copy_to(in_buffer, @in_buffer_rem.size)

      # 2. Fill the buffer up to at least size
      while @in_buffer_rem.size < size
        bytes_read = unbuffered_read(in_buffer + @in_buffer_rem.size).to_i
        break if bytes_read.zero?
        @in_buffer_rem = Slice.new(@in_buffer, @in_buffer_rem.size + bytes_read)
      end
    end

    @in_buffer_rem[0, size]

I was thinking the algorithm should look something like this (untested)

Exactly. I just did a spec to verify that.

Currently I have https://github.com/cloudamqp/lavinmq/pull/803/files

I think it make sense to raise something if we can't read size bytes. Also added a check to not always move data in in_buffer which I think makes sense.

I'm not too happy about raising. It's costly and not alway necessary.
There could be a variant that raises, but I think the default action should just return a slice with a size smaller than requested. The caller can figure out what to make of this and raise only if appropriate for the use case.

I hear ya!