Avoid moving bytes to the beginning of the buffer
Closed this issue · 40 comments
Would it make sense to use something like the slice_deque crate to avoid having to move bytes to the beginning of the buffer?
If that cannot be used due to a problem with the library, I'll be happy to try and fix it. I was looking for crates that could potentially benefit from SliceDeque
since I am just trying to improve it for fun.
My use-case for this crate required the data to remain contiguous where possible. I'd be willing to field a separate DequeBuffer
and DequeReader
, maybe, to test the pattern. I guess BufReader
could even be generic over the buffer implementation but it's already got three type parameters and that makes for a rather unwieldy API.
My use-case for this crate required the data to remain contiguous where possible.
SliceDeque<T>
's data is always contiguous so if your code is already taking &[T]
everywhere then you can just pass that code a SliceDeque<T>
without issues. If your code takes (&[T], &[T])
to handle discontiguous data, then SliceDeque
provides the same methods as VecDeque
to produce these for you, but the second slice will always be empty.
Ah right, yours crate is the one that does the virtual memory tricks to make a contiguous slice. The name doesn't make that very clear by itself (it kind of sounds like a deque built on a boxed slice instead of a vector as with VecDeque
).
It could probably be used internally on targets that support it, yeah, then Buffer.make_room()
would just be a no-op. It'd either be a target-feature or an optional feature though, I don't want to limit buf_redux
just to the platforms slice_deque
supports.
The name doesn't make that very clear by itself (it kind of sounds like a deque built on a boxed slice instead of a vector as with VecDeque).
Naming is not one of my strength, I was hoping that the short description would clarify that but oh well :D
It could probably be used internally on targets that support it,
You can check the supported targets in the readme, I don't recall if its up-to-date but the travis-ci matrix is: https://travis-ci.org/gnzlbg/slice_deque/builds/351930217
It'd either be a target-feature or an optional feature though, I don't want to limit buf_redux just to the platforms slice_deque supports.
Makes sense. I think right now it supports all targets that rustc supports except the arm thumb-targets, the nvidia nvptx one, wasm, and emscripten. From the targets that are coming, I don't think that it will support AVR once that lands.
I don't know if cargo lets you enable default features on a per-target basis, but if so you might want to explore that.
I'll admit I didn't initially click the crate link.
I think we can figure out a way to integrate it transparently, probably via target-features. However, what's the minimum allocation size? The default for both std's and this crate's BufReader
is 8KiB, which is the typical default page size for the big OSes. How much does slice_deque typically overcommit to support what it does?
So I can either build this on top of slice_deque::Buffer
without any API changes or I need the following added to SliceDeque
:
-
An unsafe method to get the slice of uninitialized data between the tail and the head. This would be passed to a
Read
impl to fill the buffer, or copied into from a passed-in slice. -
An unsafe method to advance the tail pointer by a given number of elements (e.g. the number of bytes returned from
Read::read()
or the length of the slice that was copied from); it can and probably should panic if the tail pointer would pass the head pointer. -
A safe method like
truncate()
but that drops elements from the front of the buffer, as when dropping written-out data after aWrite::write()
call or to implementBufRead::consume()
. This can either panic or be a no-op if the given count exceeds the deque's size.
With these additions buf_redux::Buffer
's API can be pared down to a thin wrapper, at least on supported targets; I'll have to keep the existing guts for everything else.
So I can either build this on top of slice_deque::Buffer without any API changes
I would prefer not to go this way - I consider Buffer
an implementation detail of SliceDeque
, and while I have thought about moving it into its own crate, things are not there yet, and I don't know how useful that would be anyways.
An unsafe method to get the slice of uninitialized data between the tail and the head. This would be passed to a Read impl to fill the buffer, or copied into from a passed-in slice.
This can be done.
An unsafe method to advance the tail pointer by a given number of elements (e.g. the number of bytes returned from Read::read() or the length of the slice that was copied from); it can and probably should panic if the tail pointer would pass the head pointer.
These methods are already exist and have the semantics that you want. They are called move_{head,tail}
and you can advance the head and the tail in both directions by using a signed integer, e.g., move_head(-1)
. They are just not public yet but I could make them public.
A safe method like truncate() but that drops elements from the front of the buffer, as when dropping written-out data after a Write::write() call or to implement BufRead::consume(). This can either panic or be a no-op if the given count exceeds the deque's size.
I actually thought about this while implementing truncate
. That is, why is it not called truncate_back
and why isn't there a truncate_front
method. I will add a truncate_front
method since this is something that makes sense anyways.
Yeah if you're all right making those changes I'd be glad to give SliceDeque
a try in buf_redux.
I may still expose it as a separate buffer/BufReader
/BufWriter
implementation, but I do want to give it a try in multipart
.
So I've a pull-request (gnzlbg/slice_deque#36) that:
- makes
move_{head,tail}
public, these only panic if-C debug-assertions=1
, that is, debug builds. Maybe I should call thesemove_{head,tail}_unchecked
and makemove_{head,tail}
checked by default ? - adds
truncate_{back,front}
methods (truncate_back
does the same astruncate
). - adds
uninitialized_mut_slice
method that returns a slice to the uninitialized elements between the tail and the head (capacity() - size()
elements)
@gnzlbg I've started prototyping on https://github.com/abonander/buf_redux/tree/0.7 but I'm also working on a general API overhaul that this crate has sorely needed.
@gnzlbg There is one additional API I need, though it's mostly an optimization. I'd like to see either impl From<SliceDeque<T>> for Vec<T>
or an inherent method fn into_vec(self) -> Vec<T>
. This is to implement Buffer::into_inner()
without reallocating.
However I realize that's not safe because SliceDeque
is dealing with allocations directly from the OS but Vec
could be using another allocator like jemalloc. I'll work something out on my own.
I'd like to see either impl From<SliceDeque> for Vec or an inherent method fn into_vec(self) -> Vec
That's not really possible. Buffer::into_inner
should return whatever storage the buffer internally uses. If you use a Vec<T, CustomAlloc>
you can't return Vec<T>
without copying the data, you can at best return either Vec<T, CustomAlloc>
or Box<&mut [T],CustomAlloc>
. The same applies if you use a SliceDeque<T>
internally. The only way to map collections from one allocator to another is to copy the data. So if you really need Buffer::into_inner() -> Vec<T>
with that signatures and without performing copies you need to stick to Vec<T>
or to anything that uses Vec<T>
under the hood.
I've got a working-ish prototype now, however I have to rewrite all my tests because they were copied from libstd/io/buffered.rs
and they assume exact capacity allocation with small sizes so none of them pass for the slice-deque
based buffer.
@gnzlbg as an optimization option, can slice-deque
allocate pre-zeroed pages or does it already do that? If you're calling the OS APIs then it should be as simple as setting a bitflag but it should probably be an optional constructor since most users aren't going to be looking at uninitialized parts of the buffer. I'm actually debating whether I would want to use it or not as I'd have to wait to allocate the buffer until the user attempts to read into it with a Read
impl that needs its buffer zeroed (Read::initializer().should_initialize()
returns true or the user has the nightly
feature disabled).
I also found that, on Windows 10 64-bit, the minimum allocation appears to be 64KiB which is significantly bigger than std::io::BufReader
's default. I'm not sure how that would ultimately affect performance as my use-case for BufReader
handles inputs bigger than that but it seems like it could be a pessimization for many cases.
After doing some research, I see that your chosen approaches for Windows, Linux and System-V should produce memory mappings that are zero-filled by default. I can't verify this for Mac because the mach_vm_allocate()
function doesn't appear to have non-paywalled documentation.
I am concerned about the raciness of the Windows implementation but I couldn't find a better solution in my brief searches. You can get information on the free pages but that's going to be racy as well, and stopping all other threads for a single allocation call is unacceptable. However, I'm wondering about debug-related APIs that let you see into existing virtual memory allocations; I know Windows has a bunch of stuff like that, maybe something there could be useful.
can slice-deque allocate pre-zeroed pages or does it already do that?
Even if it currently does this, it is not guaranteed by the API, so one shouldn't rely on this. Why do you need this?
I also found that, on Windows 10 64-bit, the minimum allocation appears to be 64KiB
Yes, as I mentioned, the size is the platform dependent allocation granularity, which is the size of a page size on most systems. Windows isn't one of them, as documented in the readme. Also, on any other system, if huge pages are enabled, the minimum allocation might be greater than 4096kB (haven't tested this, but it follows from the page size being OS dependent).
I am concerned about the raciness of the Windows implementation
On Windows it isn't possible to write a non-racy slice_deque
AFAICT (Windows support POSIX so maybe there is a way to write a non-racy slice_deque
using POSIX exclusively, but I haven't found one yet).
can slice-deque allocate pre-zeroed pages or does it already do that
There is a PR to slice deque that implements thread-local page caches intended to make many short-lived slice deques perform well. But on deallocation a SliceDeque
does not necessary zeros any memory. If the deque has drop fields, it will drop them, but deallocating a SliceDeque
does not touch any uninitialized memory, so if one did not use all of the capacity
, the memory that the OS returned will be left as is.
When you pass a buffer to Read::read()
it needs to be zeroed as otherwise a poorly behaved implementation could read uninitialized memory in safe code. The unstable Read::initializer()
method can be overridden by an implementation to declare that it's trustworthy.
std::io::BufReader
zeroes its buffer on construction while Buffer
in this crate only zeroes in .read_from()
when the Read
impl can't be determined to be trustworthy.
However, this up-front cost can be avoided when allocating at page-level granularity as they're often specified to read out as zeroed until they're written to. This is specified both by CreateViewOfFile
and memfd_create
if I remember the documentation correctly.
When you pass a buffer to Read::read() it needs to be zeroed as otherwise a poorly behaved implementation could read uninitialized memory in safe code.
So you mean that when you call Read::read(.., &mut buf)
that the bytes in buf
must be properly initialized right? If so then that makes sense. I guess you want to pass as buf the tail_head_slice
which is per the SliceDeque
API uninitialized memory, and therefore you would need to zero it right?
However, this up-front cost can be avoided when allocating at page-level granularity as they're often specified to read out as zeroed until they're written to. This is specified both by CreateViewOfFile and memfd_create if I remember the documentation correctly.
I would need to check. I think that all the memory you get from the OS is zeroed, but one can change options in some OSes to alter that. The memory allocation apis take some config parameters, so one typically has to | MEM_ZEROED
or use calloc
or something like this to get zeroed memory in a reliable way.
The only thing that bugs me is the API of Read::read
. Why/when does Read::read
ever need to read from the &mut buf
? If the answer is "never", shouldn't it take a (*mut ptr, usize)
tuple instead ? The problem is IMO with passing it &mut [u8]
which requires the content to be initialized to something (zero or something else, it doesn't really matter).
An alternative is to use a different type as an argument that denotes the semantics over memory, in this case, for example, that the memory will only be written to, but never read. Some people might construct this from &mut [T]
, but others might construct it from a (*mut, usize)
to uninitialized memory.
Thinking about this more, tail_head_slice
has the wrong signature, and it should probably return a (*mut, usize)
instead of a slice.
Yes, we're writing to the tail of the buffer beyond the occupied space, so we need to be sure that uninitialized memory can't be observed without unsafe
.
I believe Read::read()
takes a slice because it's the most convenient, that way you don't need unsafe code to implement it. If you only ever write to the buffer inside the implementation then the zeroing should be optimized out.
@abonander sorry, I've expanded my comment above with more thoughts/questions.
As for the page zeroing behavior I'm pretty confident that both Windows and Linux specify that the pages you get from the functions you're calling are zeroed by default. I'm not sure for OS X because the documentation for mach_vm_allocate()
is blank.
The signature of tail_head_slice
is fine as it's unsafe to call and the only memory safety issue with an uninitialized byte slice is reading what might have been written before.
For Drop
types it's a different issue since overwriting an index of a mutable slice causes the previous value to be dropped, properly initialized or not. But that's something for the caller to worry about anyway.
So I can't point you to any official docs, but mach_vm_allocate
should return zero-filled memory as well by default: http://yoursubtitle.blogspot.de/2009/11/section-86-mach-vm-user-space-interface.html
If I were you I'd add a test to your ci that checks this, and add a travis-ci macos buildbot that runs the test there.
For Drop types it's a different issue since overwriting an index of a mutable slice causes the previous value to be dropped, properly initialized or not. But that's something for the caller to worry about anyway.
Yes, so I suppose it only makes sense to call Read::read
on u8
buffers, for which none of this matters. But if you are allowed to call it on Drop
objects, you want the slice of memory you pass to contain live objects, because Read::read
will drop them on write. If you just pass it uninitialized or even just zeroed memory, you might be calling Drop
on objects that aren't in a valid state, right?
Read
isn't polymorphic, it purely deals with &mut [u8]
, so yeah its just about observing uninitialized memory. If it goes straight to a write or a writing syscall then the optimizer will just drop the zeroing operation as redundant.
If you are concerned about tail_head_slice
you could restrict it to T: Copy
or add a note saying to only use ptr::write()
when overwriting values, but unsafe
is usually a pretty good red flag on its own.
Read isn't polymorphic, it purely deals with &mut [u8], so yeah its just about observing uninitialized memory.
I think that then it doesn't really matter. All u8
bit patterns are valid, and SliceDeque
gets memory from opaque calls to C. The memory we obtain from C is initialized in some way, so reading it is not undefined behavior.
If you are concerned about tail_head_slice you could restrict it to T: Copy or add a note saying to only use ptr::write() when overwriting values, but unsafe is usually a pretty good red flag on its own.
Yeah I was concerned about this. I might just add a comment that the memory there is uninitialized
in the Rust sense, and that if one wants to write to it, it should make sure that the objects aren't double-dropped and such.
The main concern with "uninitialized" memory is when it may have been cached by a userspace allocator after a free, so you could have confidential information in there that could leak if you fail to zero the buffer and then mistakenly read from it. It's one of those things you never think about until it gets a CVE and then everyone's asking why it happened in the first place.
The main concern with "uninitialized" memory is when it may have been cached by a userspace allocator after a free,
Aha! This makes much more sense.
This is the thing. Unless one uses tail_head_slice
, which is unsafe
, and then misuses this to read from the slice, which the current API says that it contains uninitialized
memory, so reading from it is "undefined behavior", none of this can happen.
What the OS does here is actually irrelevant, because SliceDeque
does not guarantee that the memory pages that will back the SliceDeque
will come from the OS at all. As mentioned above, I have a PR that improves allocation performance significantly by caching and reused freed pages in thread-local heaps.
I don't zero this pages before giving them to a SliceDeque
and that's ok as long as one only uses the safe API. Making everybody pay to make an unsafe
API a bit safer feels wrong to me.
Also, whether I do this in the allocator, or you do this in your own code, doesn't really matter much here. I am not going to be able to do it in any more efficient way than you are I think (maybe I could avoid doing it when you get pages from the OS and I know the OS returns zeroed memory).
IMO if Read::read
isn't supposed to read from buf
, then it should take a type that doesn't allow it to do so. Also, the std
library does not guarantee this for Vec
either. So if you really want to be sure that the memory passed to Read::read
is zeroed, I don't see a portable way of doing this that does not involve zeroing it yourself.
I'm not asking to make zero-filling the default behavior. I'm asking if you wouldn't mind providing a separate constructor that could guarantee this by going directly to the OS every time.
On Windows, MapViewOfFileEx()
state that, "the initial contents of the pages in a file mapping object backed by the page file [CreateFileMapping(INVALID_HANDLE_VALUE, ...)
] are 0 (zero)." On Linux, ftruncate()
specifies that when it is used to extend a file, the new space will read as zeroes. vm_allocate()
(not mach_vm_allocate()
but I've read that their semantics are mostly the same) specifies that new allocations are zero-filled as well.
This would only be necessary when the buffer implementation can't determine that a reader can be trusted not to read buffers passed to it (on stable/beta or otherwise when the nightly
feature is turned off; I had previously implemented my own trait but it required specialization to get the same semantics anyway). I don't expect most users to be allocating and dropping BufReader
in tight loops so going to the OS every time isn't as big of an issue; if it is, I can make it easy for them to reuse the same buffer.
I'm asking if you wouldn't mind providing a separate constructor
One problem is that beyond the constructor, there are many APIs that allocate memory (e.g. push_front
/push_back
, resize
, insert
, ...).
by going directly to the OS every time.
Zeroing the memory manually is way faster than going through all OSes if you already have the memory pages.
The alternatives that I can think of are:
- providing new APIs for everything that can reallocate (
new
,push_..
,resize
,reserve
,insert
, ...) to make sure that the memory is zeroed (e.g.new_zeroed
). - providing a global way of customizing the allocation behavior. The PR with thread local heaps has some global knobs already, so I could add a new
zeroed_memory()
method that when called forces all memory allocations to be zeroed. - guaranteeing that
tail_head_slice
always returns zeroed memory, by making it part of the contract.
I don't know exactly if I can make the last one work, but I think that would be the better option if it can be done. On no_std
there are no thread-local heaps, so one always calls the OS directly, and as long as I use those APIs, this is already guaranteed. On std
with thread local heaps, I just need to track which memory comes from the OS and which doesn't. I might be able to store this inside the allocator somewhere.
One thing that worries me is that if you call move_tail
and then tail_head_slice
some elements might not be zeroed. So maybe I should only offer a weak guarantee.
If you're sure that userspace zeroing is always going to be faster than allocating a new page then I'll keep the existing zeroing logic in buf_redux
which covers both SliceDeque
and Vec
buffer implementations.
@abonander it's only faster if you don't need to get pages from the OS. If you need to get pages from the OS, and the pages are already zeroed, then, it's going to be slower.
If you already have non-zero pages, then zeroing them manually is cheaper than freeing them and obtaining already zeroed ones from the OS.
ll keep the existing zeroing logic in
buf_redux
Keep that for now. I am going to try to see how hard it is to guarantee that tail_head_slice
is always zeroed. If I can guarantee that easily even for the cases in which pages are cached, then I'll change its docs to say that "as long as unsafe
APIs have been used correctly, the memory pointed at by the slice is guaranteed to be zeroed" or something like that.
That would allow you to remove the zeroing logic for SliceDeque
at least. How faster things will become depends on how much memory zeroing currently touches.
Buffer
zeroes the entire uninitialized tail when .read_from()
is called. It saves the initialized capacity and only re-zeroes if the buffer is reallocated. If it fails to reallocate in place and the buffer is empty it will free the existing allocation and allocate a new one to avoid the system moving already consumed bytes to the new allocation; in this case the buffer may have to be zeroed all over again but that's faster than a pointless copy.
@gnzlbg The final-ish API is on the 0.7
branch. The slice-deque
feature adds the new_ringbuf()
and with_capacity_ringbuf
constructors to both BufReader
and Buffer
.
I originally decided not to add them to BufWriter
because the default flush behavior would not leave the buffer partially filled, but WriterPolicy
now supports partial flushes (required to implement LineWriter
so that it matches the behavior of its std
counterpart) so a ringbuffer may come in handy so I'll be adding them to both BufWriter
and LineWriter
.
I'm not a datastructure expert so it may not be the right term, but I figured calling SliceDeque
a ringbuffer would concisely explain what it's doing internally. I also elected to use an enum
internally instead of making it a type parameter or separate type, mostly for ease of use. With the slice-deque
feature off the enum only has one variant and so the matches on it should be optimized out.
Actually since the variant is constant across the lifetime of the buffer/reader/writer the matches should be optimized out anyway. I could probably throw some #[inline]
s in to help ensure this.
I'm not a datastructure expert so it may not be the right term, but I figured calling SliceDeque a ringbuffer would concisely explain what it's doing internally.
The technical term is a double-ended queue - a queue where you can push/pull elements from both ends. This is also the name std::collections
uses (see VecDeque
). Ring buffer is also a technical term for this, and many literature use that, I just decided to stick with what Rust std
does here.
The final-ish API is on the 0.7 branch.
Cool! I have some time tomorrow so I will take a look at it :)
Deques and ringbuffers are very similar concepts, though deque is used more in academia. If I had to relate the two I would consider a ringbuffer to be a deque that is specialized for byte streams and bulk access, which describes my use case quite well. In fact, the Wikipedia article for ringbuffers mentions the exact virtual memory trick you're using. So I think it would be accurate to say that I'm using SliceDeque
in the guise of a ringbuffer.