Switch to an arena allocator
ngoldbaum opened this issue · 5 comments
Arrow's string arrays use arrow's variable-sized binary layout, which stores the string data in a single contiguous buffer per array along with a strided array of integer offsets. This is a really nice design because you only need one heap allocation per array and the string data are contiguous in memory so it's much easier to write fast ufuncs and exploit SIMD. The down side is that string arrays need to either be immutable or there can be pathological cases where enlarging a single array element leads to reallocating the storage for an entire array. I don't think it would be a big downside for many real-world use cases for string arrays to force people to copy into a new array if they need to mutate elements, but it would be a wart in NumPy's API where one data type has restrictions that others don't.
Numpy doesn't have a concept of per-array heap storage outside the array buffer, however even if we added facilities for that, the dtype API doesn't allow access to the array object, so we would also need to add a pointer to the per-array storage to the ufunc loops in the dtype API as well. I think a more straightforward approach would be to let the per-array storage buffer live on the descriptor instance and ensure that each array has a unique descriptor object.
A first step towards figuring out how hard it would be to have unique descriptors per array would be to add a flag to the dtype that indicates the dtype is associated with an array. When we set this flag at array creation time, if the flag is already set, we raise an error. I think running the strigdtype tests with this flag turned on would probably catch most of the places in numpy or the dtype implementation that would need to be updated. Once we're confident a per-array descriptor is possible, we get per-array buffer storage for free, and I think it would be relatively straightfoward to switch to an arrow-like in-memory representation.
I don't expect it's super complicated. All array creation goes through a single function, we only need a hook that gets called whenever an array is not a view into existing data ("owns the data"), in some weird calls that might not work, but that would be some wonky extension code and small API additions can fix that when it comes up.
(to some degree we already do things like that chaging "S0" to "S1" ugg and moving subarray dimensions into the array.)
All array creation goes through a single function
I looked at this today and for my needs it's a little more complicated than that. While all newly created arrays do go through PyArray_NewFromDescr_int, what I'd really need is a callback that happens after initialization is complete, so we'd need to add a call to a post-initialization API hook in many places.
What I ended up arriving on, which doesn't require any API additions in numpy, is that you can use discover_descriptor_from_pyobject followed by common_instance to figure out the total length of all strings in a newly created numpy string array.
First, discover_descriptor_from_pyobject would create a temporary descriptor that holds the size of a single scalar. Then, common_instance could calculate a running total of the total size of all strings seen so far in the array by adding the sizes of the scalars passed in. We'll know that array initialization has started because the dtype's setitem will be called with a descriptor that has a capacity but uninitialized storage, so we can initialize storage with the capacity and then fill in the storage buffer on subsequent setitem calls. Finally, we'll know we're on the last setitem call because we'd hit the capacity for the storage (and raise an error if setitems happen after that).
So this is doable for newly created arrays without making any changes to NumPy, nice!
Ufuncs and the ufunc-likes in np.char will be harder to deal with. Going through the different categories of functions in order from easiest to hardest:
add
resolve_descriptors can add the capacities of the two input descriptorsjoin
same as add but with space for the separator- comparisons, string boolean checks,
str_len,count
output dtype is bool or int so no problem - unary operations (e.g.
center,capitalize).
can determine size of output storage a priori multiply
resolve_descriptors can't help, unfortunately, but we could initialize a new storage inside the multiply ufunc loopmaximum/minimum/partition/strip/pad/expandtabs/ other string formatting
No way to do this in one pass over the data. We either need to compute the result element-wise and save the results in a temporary array, then copy into the final contiguous storage or do two calculations, the first to find the size of the result, then the second pass actually gets stored in the contigous storage buffer. Ugly!
So now I guess I'm stuck on how to do this in a way that doesn't make it very awkward to implement some of these operations in one pass with the minimum amount of memory required to store the string data in a single allocation.
Right, we also have the casting buffers in ufunc code which do get cleared many times (although we currently know that if we clear we clear all memory associated with a dtype!).
So we know where we can put the buffer. I wonder if we could actually realloc or use a chunked scheme to grow the owning buffer during the operation (when there is no way to figure out the length ahead of time). But even that may be hard in practice since a scheme of "allocate empty array and fill it" becomes tricky and.
We can in NumPy currently guarantee that a call on a fresh array can be done in a single strided manner, so if we have a finalize() hook that we call in places where it is needed it could actually decide to do a multi-pass call by itself.
You could even do this as a user method like arr.dtype.freeze(), although the dtype needs to know the owning array for that (not a problem) and then it would be very weird if you call it from a view.
I do wonder a bit how much can we achieve without this? Some operations in arrow may work directly on the contiguous buffer and the offsets are only really an index when you need that index...
But besides that, another thing it gives you is cache locality, and I wonder if we can achieve that by using an arena/heap allocation scheme?
Currently, the code uses malloc which really tends to not be great at many small allocations, although I am not sure if it gives terrible cache locality, I would be surprised if it isn't a real performance hit. (A quick try would be to use PyMem_MALLOC/PyMem_FREE but require the GIL, or LD_PRELOAD=/path/to/libmimalloc.so if you have that.)
Maybe @mattip knows if that is nonsense or not (or knows who to ask). mimalloc can only allocate in the creating thread for a heap, but I suspect jemalloc can do this.
Asking because someone (maybe graalpython) in the HPy discussion had a memory ownership tied where container elements are tied to the container/owner, which sounds like that.
(A question here may be, if a per array/dtype arena is even useful compared to just a global arena for all string arrays, because fragmentation may be small enough that it doesn't matter too much in pratice.)
It sounds like these allocators try to achieve some memory locality, although if what we want is to operate on the contiguous buffer directly we have lost out: We still need a way to create that "frozenstring" representation.
I think I'm going to stop actively implementing this for now until we have a bigger community discussion around the NEP about what we want. I'll have a subsection in the NEP I'm writing describing this situation and offering the possible paths forward, hopefully we'll get some feedback from other devs about what the best course of action is.
I wonder if we could actually realloc or use a chunked scheme to grow the owning buffer during the operation (when there is no way to figure out the length ahead of time). But even that may be hard in practice since a scheme of "allocate empty array and fill it" becomes tricky and.
I don't think it would be too tricky, since we'd always be going through the descriptor.
If it ends up being hard to have a one-to-one mapping from descriptor instances to string storage buffers we could also add a mapping from the descriptor to memory regions owned by that descriptor, but there would still be a unique mapping from descriptors to memory regions outside the array buffer.
Agreed a finalize hook could do this too. We could also have an initialize hook that lets users bypass the loop inside numpy that calls setitem for each array item and instead lets dtype authors implement initialization however they'd wish using a ufunc-like loop over the array buffer. I think having finalize and initialize hooks might be useful for other dtypes, it's a natural place to put callbacks. Not that I'm excited to implement the hooks immediately...
It sounds like these allocators try to achieve some memory locality, although if what we want is to operate on the contiguous buffer directly we have lost out: We still need a way to create that "frozenstring" representation.
We want the string data for neighboring array elements to be likely (if not guaranteed) to be located contiguously in memory. Both so that ufunc and casting loops don't cache fault on every string access and to eventually enable string SIMD operations.
This is done now using an exponentially expanding buffer for all operations, both for simplicity's sake and because numpy doesn't give us the information we need. (#76 (comment) indicates we do but I was wrong, numpy only does stuff before __getitem__ is called while doing dtype detection, but stringdtype doesn't participate in dtype detection)
The main uncompleted thing is to resize the arena buffers after the array is filled. This will either require a new dtype API hook in numpy or for numpy to participate in cyclic garbage collection. For the latter option, the dtype could compactify the arena whenever a GC pass touches the dtype instance.