Tracking Issue for RFC 2930 (read-buf)
nikomatsakis opened this issue Β· 51 comments
This is a tracking issue for the RFC "2930" (rust-lang/rfcs#2930).
The feature gate for the issue is #![feature(read_buf)]
.
This is now called BorrowedBuf
rather than ReadBuf
.
About tracking issues
Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.
Steps
- Implement the RFC
- Initial implementation in #81156
- Vectorized APIs (
read_buf_vectored
,ReadBufs
).
- Adjust documentation (see instructions on rustc-dev-guide)
- Stabilization PR (see instructions on rustc-dev-guide)
- Note that this is a use of
rustc_must_implement_one_of
, which is a language-observable thing, and thus needs some amount of oversight/documentation about that as a prerequisite of stabilization.
- Note that this is a use of
Unresolved Questions
- Should
read_buf
return the number of bytes read likeread
does or should theReadBuf
track it instead? Some operations, like checking for EOF, are a bit simpler ifread_buf
returns the value, but the confusion around what is and is not trustworthy is worrysome for unsafe code working withRead
implementations. - What should
assume_init
be named? - Should the API use a wrapper around
&mut ReadBuf
to prevent unexpected swapping of the caller-providedReadBuf
? - Resolve soundness issue: #93305
Implementation history
( cc @rust-lang/libs )
I'm interested in working on this.
@rustbot claim
I am slightly confused by the API of ReadBufs
. What return type should methods like initialized
have: &[u8]
or &[IoSlice]
? If its the former, how do you know which slices are initialized/filled? If it's the latter, what happens if a slice is partially initialized/filled?
edit: probably I should ask this on zulip instead of github
It would return &[IoSliceMut]
, and only include the slices that are fully initialized.
It would return
&[IoSliceMut]
, and only include the slices that are fully initialized.
I would have expected it to return (&[IoSliceMut], &[u8])
where it is some number of fully initialized buffers and the initialized portion of the partially initialized buffer.
Sure, that makes sense.
@drmeepster are you still working on this?
Yeah, I am. Although I'm currently working on #79607 because I needed it for this.
Okay, I can continue working on this now that we have MaybeUninit::write_slice
I have a PR to add an inner_mut
method to Tokio's implementation of ReadBuf
: tokio-rs/tokio#3443. As far as I can tell it's a valid use case that there's no other way to do short of pointer arithmetic, so it may make sense to have this be in upstream Rust's ReadBuf
as well.
Have we considered extending ReadBuf
to be generic on the type, rather than be constrained to u8
? I'm guessing much of ReadBuf
is generic over the type.
This came up because I'm looking into fixing some UB in rayon, which is passing around uninitialized &mut [T]
in its collect iterator. I think the main thing making rayon use this over a &mut Vec<T>
is that it wants to split the output buffer across threads, but there's no safe way to do this without initializing the slice. I'd like to replace this with a safe abstraction that's probably quite similar to the design proposed for ReadBuf
(plus a split_at
method), so I thought maybe there are other people potentially interested in this functionality.
Going further, it would also be interesting to see if we could rewrite Vec
to sit upon ReadBuf
.
I made very similar points here:
https://internals.rust-lang.org/t/readbuf-as-part-of-rust-edition-2021/14256/9?u=djc
Note that we now have a spare_capacity_mut
method on Vec
which gives you a &mut [MaybeUninit<T>]
for the uninitialized part of a vector.
The getrandom devs point out that there's not really any reason (aside from maybe std Error) as to why this could not be in libcore. Given the low level nature of this concept libcore to me does seem a more natural fit for it to live - I could easily see no_std libs being very keen to use this. Thoughts?
Note that we now have a
spare_capacity_mut
method onVec
which gives you a&mut [MaybeUninit<T>]
for the uninitialized part of a vector.
I wrote a post (internals thread) on what using this for Read
might look like. It contains a brief introduction to the problem space, then discusses the solution proposed in RFC 2930, and finally shows how Vec::spare_capacity_mut
might be used both with and without overloading. I'm no expert at this, but I figured it might still be useful for someone to reason through what that might look like. I hope this is a useful contribution!
Examples
An overview of usage examples described in the post.
Example 1: Uninitialized slices
let mut buf = [MaybeUninit::<u8>::uninit(); 1024];
let mut buf = ReadBuf::uninit(&mut buf);
file.read_buf(&mut buf)?; // a new method which takes `ReadBuf`
let data = buf.filled();
Example 2: Uninitialized vectors
let mut buf = Vec::with_capacity(1024);
let read = file.read_to_capacity(&mut buf)?; // similar to `read_buf` above, but takes `Vec`
let data = buf[..read];
Example 2: Uninitialized vectors with overloading
let mut buf = Vec::with_capacity(1024);
let read = file.read(&mut buf)?; // specialize if `Vec` is passed.
let data = buf[..read];
Note that an initial implementation of this landed in #81156, though without the vectorized APIs (read_buf_vectored
, ReadBufs
).
#92164 adds support for a lint that enforces at least one method is implemented and has been merged (relevant here because we would like both read and read_buf to have default implementations but to require at least one to be explicitly implemented).
Should read_buf return the number of bytes read like read does or should the ReadBuf track it instead? Some operations, like checking for EOF, are a bit simpler if read_buf returns the value, but the confusion around what is and is not trustworthy is worrysome for unsafe code working with Read implementations.
We discussed this on Zulip (https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/ReadBuf.20API). We concluded that including the bytes read inside the ReadBuf was preferable to keeping it outside. As well as being less error prone it seems like better encapsulation in data structures and a more Rust-y API.
Resolve soundness issue: #93305
We also discussed this on Zulip (https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/Read.3A.3Abuf_read.20signature). We did not reach a conclusion re the read-buf API (I'll open an issue for the interesting part) but, we (eventually) reached agreement on the following points:
- there is a bug in the implementation of copy_to which makes that function unsound and which should be fixed
- there is no soundness issue with the read_buf API
- there is a footgun in read_buf's API which makes it easy to unknowingly write incorrect (unsound) code, which is that it is possible for a malicious implementation of
Read
to replace the buffer with a new one and thus if callers of read_buf rely on the sizes in the returned buffer to apply to an underlying buffer, then they must check that the buffer has not been replaced (copy_to demonstrates that it is easy to forget this check).
My experience with having a similar API in Tokio is that you have to fix the footgun. You will have a lot of buggy implementations if you don't.
ReadBuf doesn't implement the Write trait AFAIK. Are there any reasons against doing that? See #94665 for reasoning.
The getrandom devs point out that there's not really any reason (aside from maybe std Error) as to why this could not be in libcore. Given the low level nature of this concept libcore to me does seem a more natural fit for it to live - I could easily see no_std libs being very keen to use this. Thoughts?
Is there any reason for the ReadBuf
type and most of its methods to not live in libcore
? This abstraction would be nice for some OS-level stuff we are working on.
I think the only reason is that core::io doesn't exist.
Should read_buf return the number of bytes read like read does or should the ReadBuf track it instead? Some operations, like checking for EOF, are a bit simpler if read_buf returns the value, but the confusion around what is and is not trustworthy is worrysome for unsafe code working with Read implementations.
We discussed this on Zulip (https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/ReadBuf.20API). We concluded that including the bytes read inside the ReadBuf was preferable to keeping it outside. As well as being less error prone it seems like better encapsulation in data structures and a more Rust-y API.
Some more thoughts on this:
- It's actually pretty difficult to get the number of bytes read, there is no convenient length to use, so you have to do the following:
let old_filled_len = buf.filled_len();
reader.read_buf(buf)?;
let bytes_read = buf.filled_len() - old_filled_len;
That is obviously less convenient than let bytes_read = reader.read_buf(buf)?;
(it's also unsound because read_buf
might switch out buf
, but that's something we want to fix.
- The filled part of the buffer may shrink as well as grow, (via
clear
orset_filled
), so the above code may overflow. - The docs suggest that it is legitimate for a reader to mutate the filled part of the buffer as well as write new data, so the reader could read n bytes, combine with the filled part of the buffer and compress it, writing everything back into the buffer. In that case the above code would show that the filled part of the buffer has grown by m bytes, where m < n, and there is no way for the caller to find the number of bytes actually read.
Should there be an owned variant of this type? As in, uses Box<[MaybeUninit<u8>]>
instead of a slice reference.
@notgull I don't think there's any value in that. This just exists so the Read
trait can use it.
Should there be an owned variant of this type? As in, uses Box<[MaybeUninit]> instead of a slice reference.
It's something we may want for the async work (where completion IO systems want to take logical ownership of the buffer), but it doesn't need to block anything on the sync side (and we'd probably want a trait for it rather than forcing the user to commit to a specific storage strategy).
Hmmβ¦ in regards to the "donβt want to be able to move the ReadBuf
out through the mutable reference", I wonder if it would be possible to pin it, like we do with futures. That way, bytes can be written into it, but the buffer itself canβt be moved out.
We can't use pinning because Pin
allows overwriting the value with a new one.
Just want to confirm, if there is a error in read_buf_exact:
https://doc.rust-lang.org/src/std/io/mod.rs.html#821-836
It doesn't handle the Ok(0) case.
read_exact function will break the loop and throws UnexpectedEof if the buffer is not full if it returns 0.
Just want to confirm, if there is a error in read_buf_exact: https://doc.rust-lang.org/src/std/io/mod.rs.html#821-836
It doesn't handle the Ok(0) case.
Looks to me that it does handle it by comparing old versus new filled length: https://doc.rust-lang.org/src/std/io/mod.rs.html#830-832
Can we put this struct into core
crate? It can be useful for no_std implementations of getrandom
or other things.
Am I right that this struct never allocates?
Can we put this struct into
core
crate?
I'd imagine this would be blocked by #48331.
There's no reason that this isn't in core other than the fact that the module core::io
doesn't exist yet. It doesn't rely on anything else in std::io
.
Currently, if you Write::write()
into BorrowedCursor
more space than it can hold, it panics like if you call append()
. I'd expect that it write only part of the data for write()
and err with ErrorKind::WriteZero
for write_all()
, like impl Write for &mut [u8]
does.
Could we please have a shared interface trait for Seek::stream_position and BorrowedBuf::len? That would help in adapting library crates such as zip_next
to use a BorrowedBuf (which would make it possible to read back one compressed file while writing another without loading the whole file into memory).
I just want to note that while rustc_must_implement_one_of
is a prerequisite to stabilizing Read::read_buf
, BorrowedBuf
/BorrowedCursor
are separately useful even without the existence of Read::read_buf
, so could be stabilized separately. I happen to be writing some unsafe
APIs which could be made meaningfully safer with the use of BorrowedCursor
.
I'm interested in BorrowedBuf
and BorrowedCursor
for use in no_std
environments. Would it be possible to move them into core
and stabilize them separately from the new std::io::{Read,Write}
functionality?
The current implementation of those types has no dependency on std
, and I'd be happy to send out the PRs if the Rust folks are willing to review them.
The BorrowedCursor
documentation could use some polish. It says some slightly contradictory or misleading things. The docs start with
A writeable view of the unfilled portion of a BorrowedBuf.
but the very next paragraph:
Provides access to the initialized and uninitialized parts of the underlying BorrowedBuf.
Looking at the actual implementations makes it obvious that they mostly pass-through and grab slices from the underlying BorrowedBuf
and totally ignore the write position (start
). The write position is only relevant for written()
.
Does core_io_borrowed_buf
really need to be a separate feature gate, or could it be rolled into read_buf
? Bit confusing that the BorrowedBuf
/BorrowedCursor
docs all point to #117693 rather than this issue, assuming the same types are designed to work in core
.
Also related to @the8472's comment, docs need examples.
There is something "fun" with read_buf
as it is defined today, but I am not sure it was done on purpose: unlike read
, it is possible to return both data and an error.
This is possible because the cursor keeps tracks of the read bytes itself, so returning an error is not incompatible with reading bytes.
I don't know if this is expected and it is probably useful, but it might be surprising if some implementations start doing that.
In fact, read_buf_exact
documents that it has such a behavior:
If this function returns an error, all bytes read will be appended to cursor.
Something more problematic is that it is impossible to write a correct read
implementation on top of read_buf
(either the written bytes or the error would have to be discarded).
Tbh, I think read
is in the wrong here. For example, I/O reads very much depends on the behaviour of the underlying OS which we have no control over. Currently we are indeed forced to discard any OS errors if any bytes are read.
It would be great to have a final word from libs-team on this:
- Say that this undesirable and change the API to prevent this
- Say that this undesirable and document about such behavior (and that such
read_buf
implementations are buggy) - Say that this fine (or even great), document that this may happen because it is not obvious and change the few uses of
read_buf
instd
(that don't take this possibility into account). - Some other choice that I didn't think about ?
Currently we are indeed forced to discard any OS errors if any bytes are read.
I don't about Windows at all but on Unix at least I don't think this is possible. Can you link such an implementation ?
See the post above for why this is nominated.
Currently we are indeed forced to discard any OS errors if any bytes are read.
I don't about Windows at all but on Unix at least I don't think this is possible. Can you link such an implementation ?
Neither the Windows API nor the documentation guarantees that no bytes will be read in an error case. One documented case is pipes in message mode:
If a named pipe is being read in message mode and the next message is longer than the nNumberOfBytesToRead parameter specifies, ReadFile returns FALSE and GetLastError returns ERROR_MORE_DATA. The remainder of the message can be read by a subsequent call to the ReadFile or PeekNamedPipe function.
To be clear, I'd have to check if we actually do handle this (I suspect not) but if we're going strictly by the Read
trait documentation then we should test if any bytes are read and return success if so. The other option, of course, would be to weaken the std documentation.
See the post above for why this is nominated.
Currently we are indeed forced to discard any OS errors if any bytes are read.
I don't about Windows at all but on Unix at least I don't think this is possible. Can you link such an implementation ?
Neither the Windows API nor the documentation guarantees that no bytes will be read in an error case. One documented case is pipes in message mode:
If a named pipe is being read in message mode and the next message is longer than the nNumberOfBytesToRead parameter specifies, ReadFile returns FALSE and GetLastError returns ERROR_MORE_DATA. The remainder of the message can be read by a subsequent call to the ReadFile or PeekNamedPipe function.
To be clear, I'd have to check if we actually do handle this (I suspect not) but if we're going strictly by the
Read
trait documentation then we should test if any bytes are read and return success if so. The other option, of course, would be to weaken the std documentation.
Does this case really matter? The Read
trait is for byte streams, not message streams. For example, UdpSocket
doesn't implement the Read
trait (I assume for this reason).
The example you give is very similar to the MSG_TRUNC
flag in a recvmsg()
call. But for a bytestream this can never happen. For Read
, reading a partial message, swallowing the error and performing another read is the correct thing to do.
The Read
trait is implemented for File
which means it needs to work for anything you can open with File::open
, The read
method also has zero context for the read. All this means it's completely oblivious to what it is reading and has to work (for some definition of "work") in all situations, especially when we provide a "guarantee".
Maybe swallowing any error is fine. But I think in an ideal world we'd have some way for the user to access this information.
We discussed this in the libs-api meeting today. We agree that this can be surprising behavior, but don't see a good way to change the API to avoid it. Our conclusion was to update the documentation to say that returning an error after reading some bytes is allowed, but strongly discouraged.
Maybe the return type should be Result<usize, (usize, io::Error)>
so that the length of the partial read is available?
I disagree that the signature of read
is to blame here. That method is intended to be called repeatedly in a loop to get all the data you need. Thus if an implementation wants to return both data and an error it'll return the data the first time read
is called and cache the error to return on the second call.
The problem comes in because a default implementation of read
in terms of read_buf
cannot implement caching behavior because it doesn't have access to fields within the reader. Thus, another option would be for the standard library to document that the default implementation of read
swallows errors if any bytes are produced, and recommend that if read_buf
is capable of producing both data and an error that the implementer should also provide a read
implementation that caches errors.
On internals.r-l.o, keepsimple1 commented that they didn't realise that ReadBuf
has been renamed to BorrowedBuf
rather than removed, because it's not obvious from the summary.
Could you update the summary to include something like:
This is now called BorrowedBuf
rather than ReadBuf
.
Thanks!
I've added a note in the summary, hopefully that should help.