abonander/buf_redux

Avoid moving bytes to the beginning of the buffer

Closed this issue · 40 comments

Would it make sense to use something like the slice_deque crate to avoid having to move bytes to the beginning of the buffer?

If that cannot be used due to a problem with the library, I'll be happy to try and fix it. I was looking for crates that could potentially benefit from SliceDeque since I am just trying to improve it for fun.

My use-case for this crate required the data to remain contiguous where possible. I'd be willing to field a separate DequeBuffer and DequeReader, maybe, to test the pattern. I guess BufReader could even be generic over the buffer implementation but it's already got three type parameters and that makes for a rather unwieldy API.

My use-case for this crate required the data to remain contiguous where possible.

SliceDeque<T>'s data is always contiguous so if your code is already taking &[T] everywhere then you can just pass that code a SliceDeque<T> without issues. If your code takes (&[T], &[T]) to handle discontiguous data, then SliceDeque provides the same methods as VecDeque to produce these for you, but the second slice will always be empty.

Ah right, yours crate is the one that does the virtual memory tricks to make a contiguous slice. The name doesn't make that very clear by itself (it kind of sounds like a deque built on a boxed slice instead of a vector as with VecDeque).

It could probably be used internally on targets that support it, yeah, then Buffer.make_room() would just be a no-op. It'd either be a target-feature or an optional feature though, I don't want to limit buf_redux just to the platforms slice_deque supports.

The name doesn't make that very clear by itself (it kind of sounds like a deque built on a boxed slice instead of a vector as with VecDeque).

Naming is not one of my strength, I was hoping that the short description would clarify that but oh well :D

It could probably be used internally on targets that support it,

You can check the supported targets in the readme, I don't recall if its up-to-date but the travis-ci matrix is: https://travis-ci.org/gnzlbg/slice_deque/builds/351930217

It'd either be a target-feature or an optional feature though, I don't want to limit buf_redux just to the platforms slice_deque supports.

Makes sense. I think right now it supports all targets that rustc supports except the arm thumb-targets, the nvidia nvptx one, wasm, and emscripten. From the targets that are coming, I don't think that it will support AVR once that lands.

I don't know if cargo lets you enable default features on a per-target basis, but if so you might want to explore that.

I'll admit I didn't initially click the crate link.

I think we can figure out a way to integrate it transparently, probably via target-features. However, what's the minimum allocation size? The default for both std's and this crate's BufReaderis 8KiB, which is the typical default page size for the big OSes. How much does slice_deque typically overcommit to support what it does?

So I can either build this on top of slice_deque::Buffer without any API changes or I need the following added to SliceDeque:

  • An unsafe method to get the slice of uninitialized data between the tail and the head. This would be passed to a Read impl to fill the buffer, or copied into from a passed-in slice.

  • An unsafe method to advance the tail pointer by a given number of elements (e.g. the number of bytes returned from Read::read() or the length of the slice that was copied from); it can and probably should panic if the tail pointer would pass the head pointer.

  • A safe method like truncate() but that drops elements from the front of the buffer, as when dropping written-out data after a Write::write() call or to implement BufRead::consume(). This can either panic or be a no-op if the given count exceeds the deque's size.

With these additions buf_redux::Buffer's API can be pared down to a thin wrapper, at least on supported targets; I'll have to keep the existing guts for everything else.

So I can either build this on top of slice_deque::Buffer without any API changes

I would prefer not to go this way - I consider Buffer an implementation detail of SliceDeque, and while I have thought about moving it into its own crate, things are not there yet, and I don't know how useful that would be anyways.

An unsafe method to get the slice of uninitialized data between the tail and the head. This would be passed to a Read impl to fill the buffer, or copied into from a passed-in slice.

This can be done.

An unsafe method to advance the tail pointer by a given number of elements (e.g. the number of bytes returned from Read::read() or the length of the slice that was copied from); it can and probably should panic if the tail pointer would pass the head pointer.

These methods are already exist and have the semantics that you want. They are called move_{head,tail} and you can advance the head and the tail in both directions by using a signed integer, e.g., move_head(-1). They are just not public yet but I could make them public.

A safe method like truncate() but that drops elements from the front of the buffer, as when dropping written-out data after a Write::write() call or to implement BufRead::consume(). This can either panic or be a no-op if the given count exceeds the deque's size.

I actually thought about this while implementing truncate. That is, why is it not called truncate_back and why isn't there a truncate_front method. I will add a truncate_front method since this is something that makes sense anyways.

Yeah if you're all right making those changes I'd be glad to give SliceDeque a try in buf_redux.

I may still expose it as a separate buffer/BufReader/BufWriter implementation, but I do want to give it a try in multipart.

So I've a pull-request (gnzlbg/slice_deque#36) that:

  • makes move_{head,tail} public, these only panic if -C debug-assertions=1, that is, debug builds. Maybe I should call these move_{head,tail}_unchecked and make move_{head,tail} checked by default ?
  • adds truncate_{back,front} methods (truncate_back does the same as truncate).
  • adds uninitialized_mut_slice method that returns a slice to the uninitialized elements between the tail and the head (capacity() - size() elements)

@gnzlbg Awesome. I added a couple comments.

@gnzlbg I've started prototyping on https://github.com/abonander/buf_redux/tree/0.7 but I'm also working on a general API overhaul that this crate has sorely needed.

@gnzlbg There is one additional API I need, though it's mostly an optimization. I'd like to see either impl From<SliceDeque<T>> for Vec<T> or an inherent method fn into_vec(self) -> Vec<T>. This is to implement Buffer::into_inner() without reallocating.

However I realize that's not safe because SliceDeque is dealing with allocations directly from the OS but Vec could be using another allocator like jemalloc. I'll work something out on my own.

I'd like to see either impl From<SliceDeque> for Vec or an inherent method fn into_vec(self) -> Vec

That's not really possible. Buffer::into_inner should return whatever storage the buffer internally uses. If you use a Vec<T, CustomAlloc> you can't return Vec<T> without copying the data, you can at best return either Vec<T, CustomAlloc> or Box<&mut [T],CustomAlloc>. The same applies if you use a SliceDeque<T> internally. The only way to map collections from one allocator to another is to copy the data. So if you really need Buffer::into_inner() -> Vec<T> with that signatures and without performing copies you need to stick to Vec<T> or to anything that uses Vec<T> under the hood.

I've got a working-ish prototype now, however I have to rewrite all my tests because they were copied from libstd/io/buffered.rs and they assume exact capacity allocation with small sizes so none of them pass for the slice-deque based buffer.

@gnzlbg as an optimization option, can slice-deque allocate pre-zeroed pages or does it already do that? If you're calling the OS APIs then it should be as simple as setting a bitflag but it should probably be an optional constructor since most users aren't going to be looking at uninitialized parts of the buffer. I'm actually debating whether I would want to use it or not as I'd have to wait to allocate the buffer until the user attempts to read into it with a Read impl that needs its buffer zeroed (Read::initializer().should_initialize() returns true or the user has the nightly feature disabled).

I also found that, on Windows 10 64-bit, the minimum allocation appears to be 64KiB which is significantly bigger than std::io::BufReader's default. I'm not sure how that would ultimately affect performance as my use-case for BufReader handles inputs bigger than that but it seems like it could be a pessimization for many cases.

After doing some research, I see that your chosen approaches for Windows, Linux and System-V should produce memory mappings that are zero-filled by default. I can't verify this for Mac because the mach_vm_allocate() function doesn't appear to have non-paywalled documentation.

I am concerned about the raciness of the Windows implementation but I couldn't find a better solution in my brief searches. You can get information on the free pages but that's going to be racy as well, and stopping all other threads for a single allocation call is unacceptable. However, I'm wondering about debug-related APIs that let you see into existing virtual memory allocations; I know Windows has a bunch of stuff like that, maybe something there could be useful.

can slice-deque allocate pre-zeroed pages or does it already do that?

Even if it currently does this, it is not guaranteed by the API, so one shouldn't rely on this. Why do you need this?

I also found that, on Windows 10 64-bit, the minimum allocation appears to be 64KiB

Yes, as I mentioned, the size is the platform dependent allocation granularity, which is the size of a page size on most systems. Windows isn't one of them, as documented in the readme. Also, on any other system, if huge pages are enabled, the minimum allocation might be greater than 4096kB (haven't tested this, but it follows from the page size being OS dependent).

I am concerned about the raciness of the Windows implementation

On Windows it isn't possible to write a non-racy slice_deque AFAICT (Windows support POSIX so maybe there is a way to write a non-racy slice_deque using POSIX exclusively, but I haven't found one yet).

can slice-deque allocate pre-zeroed pages or does it already do that

There is a PR to slice deque that implements thread-local page caches intended to make many short-lived slice deques perform well. But on deallocation a SliceDeque does not necessary zeros any memory. If the deque has drop fields, it will drop them, but deallocating a SliceDeque does not touch any uninitialized memory, so if one did not use all of the capacity, the memory that the OS returned will be left as is.

When you pass a buffer to Read::read() it needs to be zeroed as otherwise a poorly behaved implementation could read uninitialized memory in safe code. The unstable Read::initializer() method can be overridden by an implementation to declare that it's trustworthy.

std::io::BufReader zeroes its buffer on construction while Buffer in this crate only zeroes in .read_from() when the Read impl can't be determined to be trustworthy.

However, this up-front cost can be avoided when allocating at page-level granularity as they're often specified to read out as zeroed until they're written to. This is specified both by CreateViewOfFile and memfd_create if I remember the documentation correctly.

When you pass a buffer to Read::read() it needs to be zeroed as otherwise a poorly behaved implementation could read uninitialized memory in safe code.

So you mean that when you call Read::read(.., &mut buf) that the bytes in buf must be properly initialized right? If so then that makes sense. I guess you want to pass as buf the tail_head_slice which is per the SliceDeque API uninitialized memory, and therefore you would need to zero it right?

However, this up-front cost can be avoided when allocating at page-level granularity as they're often specified to read out as zeroed until they're written to. This is specified both by CreateViewOfFile and memfd_create if I remember the documentation correctly.

I would need to check. I think that all the memory you get from the OS is zeroed, but one can change options in some OSes to alter that. The memory allocation apis take some config parameters, so one typically has to | MEM_ZEROED or use calloc or something like this to get zeroed memory in a reliable way.

The only thing that bugs me is the API of Read::read. Why/when does Read::read ever need to read from the &mut buf? If the answer is "never", shouldn't it take a (*mut ptr, usize) tuple instead ? The problem is IMO with passing it &mut [u8] which requires the content to be initialized to something (zero or something else, it doesn't really matter).

An alternative is to use a different type as an argument that denotes the semantics over memory, in this case, for example, that the memory will only be written to, but never read. Some people might construct this from &mut [T], but others might construct it from a (*mut, usize) to uninitialized memory.

Thinking about this more, tail_head_slice has the wrong signature, and it should probably return a (*mut, usize) instead of a slice.

Yes, we're writing to the tail of the buffer beyond the occupied space, so we need to be sure that uninitialized memory can't be observed without unsafe.

I believe Read::read() takes a slice because it's the most convenient, that way you don't need unsafe code to implement it. If you only ever write to the buffer inside the implementation then the zeroing should be optimized out.

@abonander sorry, I've expanded my comment above with more thoughts/questions.

As for the page zeroing behavior I'm pretty confident that both Windows and Linux specify that the pages you get from the functions you're calling are zeroed by default. I'm not sure for OS X because the documentation for mach_vm_allocate() is blank.

The signature of tail_head_slice is fine as it's unsafe to call and the only memory safety issue with an uninitialized byte slice is reading what might have been written before.

For Drop types it's a different issue since overwriting an index of a mutable slice causes the previous value to be dropped, properly initialized or not. But that's something for the caller to worry about anyway.

So I can't point you to any official docs, but mach_vm_allocate should return zero-filled memory as well by default: http://yoursubtitle.blogspot.de/2009/11/section-86-mach-vm-user-space-interface.html

If I were you I'd add a test to your ci that checks this, and add a travis-ci macos buildbot that runs the test there.

For Drop types it's a different issue since overwriting an index of a mutable slice causes the previous value to be dropped, properly initialized or not. But that's something for the caller to worry about anyway.

Yes, so I suppose it only makes sense to call Read::read on u8 buffers, for which none of this matters. But if you are allowed to call it on Drop objects, you want the slice of memory you pass to contain live objects, because Read::read will drop them on write. If you just pass it uninitialized or even just zeroed memory, you might be calling Drop on objects that aren't in a valid state, right?

Read isn't polymorphic, it purely deals with &mut [u8], so yeah its just about observing uninitialized memory. If it goes straight to a write or a writing syscall then the optimizer will just drop the zeroing operation as redundant.

If you are concerned about tail_head_slice you could restrict it to T: Copy or add a note saying to only use ptr::write() when overwriting values, but unsafe is usually a pretty good red flag on its own.

Read isn't polymorphic, it purely deals with &mut [u8], so yeah its just about observing uninitialized memory.

I think that then it doesn't really matter. All u8 bit patterns are valid, and SliceDeque gets memory from opaque calls to C. The memory we obtain from C is initialized in some way, so reading it is not undefined behavior.

If you are concerned about tail_head_slice you could restrict it to T: Copy or add a note saying to only use ptr::write() when overwriting values, but unsafe is usually a pretty good red flag on its own.

Yeah I was concerned about this. I might just add a comment that the memory there is uninitialized in the Rust sense, and that if one wants to write to it, it should make sure that the objects aren't double-dropped and such.

The main concern with "uninitialized" memory is when it may have been cached by a userspace allocator after a free, so you could have confidential information in there that could leak if you fail to zero the buffer and then mistakenly read from it. It's one of those things you never think about until it gets a CVE and then everyone's asking why it happened in the first place.

The main concern with "uninitialized" memory is when it may have been cached by a userspace allocator after a free,

Aha! This makes much more sense.

This is the thing. Unless one uses tail_head_slice, which is unsafe, and then misuses this to read from the slice, which the current API says that it contains uninitialized memory, so reading from it is "undefined behavior", none of this can happen.

What the OS does here is actually irrelevant, because SliceDeque does not guarantee that the memory pages that will back the SliceDeque will come from the OS at all. As mentioned above, I have a PR that improves allocation performance significantly by caching and reused freed pages in thread-local heaps.

I don't zero this pages before giving them to a SliceDeque and that's ok as long as one only uses the safe API. Making everybody pay to make an unsafe API a bit safer feels wrong to me.

Also, whether I do this in the allocator, or you do this in your own code, doesn't really matter much here. I am not going to be able to do it in any more efficient way than you are I think (maybe I could avoid doing it when you get pages from the OS and I know the OS returns zeroed memory).

IMO if Read::read isn't supposed to read from buf, then it should take a type that doesn't allow it to do so. Also, the std library does not guarantee this for Vec either. So if you really want to be sure that the memory passed to Read::read is zeroed, I don't see a portable way of doing this that does not involve zeroing it yourself.

I'm not asking to make zero-filling the default behavior. I'm asking if you wouldn't mind providing a separate constructor that could guarantee this by going directly to the OS every time.

On Windows, MapViewOfFileEx() state that, "the initial contents of the pages in a file mapping object backed by the page file [CreateFileMapping(INVALID_HANDLE_VALUE, ...)] are 0 (zero)." On Linux, ftruncate() specifies that when it is used to extend a file, the new space will read as zeroes. vm_allocate() (not mach_vm_allocate() but I've read that their semantics are mostly the same) specifies that new allocations are zero-filled as well.

This would only be necessary when the buffer implementation can't determine that a reader can be trusted not to read buffers passed to it (on stable/beta or otherwise when the nightly feature is turned off; I had previously implemented my own trait but it required specialization to get the same semantics anyway). I don't expect most users to be allocating and dropping BufReader in tight loops so going to the OS every time isn't as big of an issue; if it is, I can make it easy for them to reuse the same buffer.

I'm asking if you wouldn't mind providing a separate constructor

One problem is that beyond the constructor, there are many APIs that allocate memory (e.g. push_front/push_back, resize, insert, ...).

by going directly to the OS every time.

Zeroing the memory manually is way faster than going through all OSes if you already have the memory pages.


The alternatives that I can think of are:

  • providing new APIs for everything that can reallocate (new,push_.., resize, reserve, insert, ...) to make sure that the memory is zeroed (e.g. new_zeroed).
  • providing a global way of customizing the allocation behavior. The PR with thread local heaps has some global knobs already, so I could add a new zeroed_memory() method that when called forces all memory allocations to be zeroed.
  • guaranteeing that tail_head_slice always returns zeroed memory, by making it part of the contract.

I don't know exactly if I can make the last one work, but I think that would be the better option if it can be done. On no_std there are no thread-local heaps, so one always calls the OS directly, and as long as I use those APIs, this is already guaranteed. On std with thread local heaps, I just need to track which memory comes from the OS and which doesn't. I might be able to store this inside the allocator somewhere.

One thing that worries me is that if you call move_tail and then tail_head_slice some elements might not be zeroed. So maybe I should only offer a weak guarantee.

If you're sure that userspace zeroing is always going to be faster than allocating a new page then I'll keep the existing zeroing logic in buf_redux which covers both SliceDeque and Vec buffer implementations.

@abonander it's only faster if you don't need to get pages from the OS. If you need to get pages from the OS, and the pages are already zeroed, then, it's going to be slower.

If you already have non-zero pages, then zeroing them manually is cheaper than freeing them and obtaining already zeroed ones from the OS.

ll keep the existing zeroing logic in buf_redux

Keep that for now. I am going to try to see how hard it is to guarantee that tail_head_slice is always zeroed. If I can guarantee that easily even for the cases in which pages are cached, then I'll change its docs to say that "as long as unsafe APIs have been used correctly, the memory pointed at by the slice is guaranteed to be zeroed" or something like that.

That would allow you to remove the zeroing logic for SliceDeque at least. How faster things will become depends on how much memory zeroing currently touches.

Buffer zeroes the entire uninitialized tail when .read_from() is called. It saves the initialized capacity and only re-zeroes if the buffer is reallocated. If it fails to reallocate in place and the buffer is empty it will free the existing allocation and allocate a new one to avoid the system moving already consumed bytes to the new allocation; in this case the buffer may have to be zeroed all over again but that's faster than a pointless copy.

@gnzlbg The final-ish API is on the 0.7 branch. The slice-deque feature adds the new_ringbuf() and with_capacity_ringbuf constructors to both BufReader and Buffer.

I originally decided not to add them to BufWriter because the default flush behavior would not leave the buffer partially filled, but WriterPolicy now supports partial flushes (required to implement LineWriter so that it matches the behavior of its std counterpart) so a ringbuffer may come in handy so I'll be adding them to both BufWriter and LineWriter.

I'm not a datastructure expert so it may not be the right term, but I figured calling SliceDeque a ringbuffer would concisely explain what it's doing internally. I also elected to use an enum internally instead of making it a type parameter or separate type, mostly for ease of use. With the slice-deque feature off the enum only has one variant and so the matches on it should be optimized out.

Actually since the variant is constant across the lifetime of the buffer/reader/writer the matches should be optimized out anyway. I could probably throw some #[inline]s in to help ensure this.

I'm not a datastructure expert so it may not be the right term, but I figured calling SliceDeque a ringbuffer would concisely explain what it's doing internally.

The technical term is a double-ended queue - a queue where you can push/pull elements from both ends. This is also the name std::collections uses (see VecDeque). Ring buffer is also a technical term for this, and many literature use that, I just decided to stick with what Rust std does here.

The final-ish API is on the 0.7 branch.

Cool! I have some time tomorrow so I will take a look at it :)

Deques and ringbuffers are very similar concepts, though deque is used more in academia. If I had to relate the two I would consider a ringbuffer to be a deque that is specialized for byte streams and bulk access, which describes my use case quite well. In fact, the Wikipedia article for ringbuffers mentions the exact virtual memory trick you're using. So I think it would be accurate to say that I'm using SliceDeque in the guise of a ringbuffer.