Add a "prefill" method to IO::Buffered with read buffering
spuun opened this issue · 15 comments
Feature Request
I haven't found a way to "prefill " a IO::Buffered
with read_buffering? == true
. I want to fill it's buffer with at least n
bytes (n <= buffer_size
).
A use case is that I'd like to peek into the buffer to see some bytes to autodetect what protocol a socket client is using. A peek
call will trigger a fill, but if not enough data is read I have no way of telling the IO to read more data.
I'd like to do something like
io.prefill_buffer(8) # i must have at least 8 bytes in the buffer
if io.peek[0,8] == "protocol"
handle_protocol1_client(io)
else if io.peek[0,5] == "proto"
handle_protocol2_client(io)
else
handle_default_client(io)
end
Maybe a #peek(n) : Bytes
overload? It would make sure the buffer has at least n
bytes and could return a slice to the first n
bytes instead of the current buffer.
I like the generic prefill method, but looking at IO
and IO::Buffered
I'm not noticing other methods that could benefit from filling the buffer 🤔
Maybe a
#peek(n) : Bytes
overload? It would make sure the buffer has at leastn
bytes and could return a slice to the firstn
bytes instead of the current buffer.
Agree!
This seems like it could be useful. However I'm wondering about the behaviour outside the happy path.
For example, what happens on EOF? I suppose the result is trimmed down to the size of available data? But then we must make sure to not assume peek(n).size == n
.
Also a significant difference from base #peek
is that the original performs at most one read operation and returns as soon as some data has arrived. But in order to fill the peek buffer to n
we may need to read multiple times. This deviates from the expectations that you would have on #peek
. Not sure about the impact and relevance, but it's something to note.
This filled peek also needs a memmove on the peek buffer. I don't see any immediate downside of that except that it can have a performance effect. But that should be neglectable because peek size is usually low.
Another positive effect I can think of is that this could perhaps allow us to simplify some algorithms. IO::Delimited
for example has three different implementations for delimiter detection: #read
has two different ones, depending on if peek from upstream is available, and #peek
. Prefilling the peek buffer would also allow to persist the remainder of a peek read while filling up the rest to read more. I haven't thought this through entirely, but it might be a chance.
Currently I've done https://github.com/cloudamqp/lavinmq/blob/ba4adbddaa76dfadfe8310df0b42e9d1c0d52e7c/src/stdlib/io_buffered.cr in LavinMQ. Maybe I'm missing a case or two, and I know it's only used for sockets.
Yeah, I think there's a bug in your code: to_read = @buffer_size - @in_buffer_rem.size
should be to_read = size - @in_buffer_rem.size
.
And it does not handle the case where @in_buffer_rem + to_read
goes out of bounds on @in_buffer
. E.g. this use case:
# This should fill the buffer and consume all but the last byte
io.read(Bytes.new(io.buffer_size - 1))
# Now we want to read one more byte than the buffer's capacity
io.peek(2)
In order to fix this, we need to always move the contents of @in_buffer_rem
to the start @in_buffer
. And then we fill the remainder up to at least size
.
Yeah, I think there's a bug in your code:
to_read = @buffer_size - @in_buffer_rem.size
should beto_read = size - @in_buffer_rem.size
.
Hm. My intention was to try to fill the entire buffer, but at least size
. Isn't @buffer_size - @in_buffer_rem.size
what's left in in_buffer
?
And it does not handle the case where
@in_buffer_rem + to_read
goes out of bounds on@in_buffer
. E.g. this use case:# This should fill the buffer and consume all but the last byte io.read(Bytes.new(io.buffer_size - 1)) # Now we want to read one more byte than the buffer's capacity io.peek(2)
I'm only reading while @in_buffer_rem.size < size
, so in this case no read would happen.
What i want #peek(n)
to do is to ensure there is n
bytes in the buffer/returned slice. Not always read n
more bytes.
I have these specs that maybe shows what I want the method to do.
Hm, seems like misread the code, sorry. So to_read
determines the remaining capacity.
But in the case @in_buffer_rem.size + to_read < size
, we cannot fill the buffer to fit the requested amount. That would mean returning a slice that's smaller than slice
, despite having potentially more data to read. That's unergonomic because you cannot expect peek(n)
to return n
bytes even in normal operation. The only case where that should be true is when the IO closes and there is not more to read.
So I still think we need to move @in_buffer_rem
to the start of @in_buffer
.
Yeah, i recreate @in_buffer_rem
after data has been added to in_buffer
. Maybe this must be done in some other way?
I think I understand what you mean now, yes.
I was thinking the algorithm should look something like this (untested):
if @in_buffer_rem.size < size
# We need to read more into the buffer
in_buffer = in_buffer()
# 1. Move leftover data to the front of the buffer so we have maximum capacity to fill
@in_buffer_rem.copy_to(in_buffer, @in_buffer_rem.size)
# 2. Fill the buffer up to at least size
while @in_buffer_rem.size < size
bytes_read = unbuffered_read(in_buffer + @in_buffer_rem.size).to_i
break if bytes_read.zero?
@in_buffer_rem = Slice.new(@in_buffer, @in_buffer_rem.size + bytes_read)
end
end
@in_buffer_rem[0, size]
I was thinking the algorithm should look something like this (untested)
Exactly. I just did a spec to verify that.
Currently I have https://github.com/cloudamqp/lavinmq/pull/803/files
I think it make sense to raise something if we can't read size
bytes. Also added a check to not always move data in in_buffer
which I think makes sense.
I'm not too happy about raising. It's costly and not alway necessary.
There could be a variant that raises, but I think the default action should just return a slice with a size smaller than requested. The caller can figure out what to make of this and raise only if appropriate for the use case.
I hear ya!