xnd-project/libxnd

What is the use case of xnd over apache arrow?

Opened this issue · 16 comments

Hello, was wondering what you plan on doing with xnd that isn't well supported by the arrow format. A github issue is probably not the best place for this discussion, but I couldn't find a mailing list. Thanks.

Ah ok, it seems most of the underlying types and structured types (lists, structs, ragged hierarchies) are already well supported in arrow. Anyway looking at the docs it should be pretty cheap to convert from one format to another through memory copying.

skrah commented

Are they? I thought Arrow was limited to int32_t in sizes. Xnd is for in-memory computations, so we definitely need int64_t.

Also, the types that xnd uses (ndt_t) are a standard algebraic datatype that is relatively easy to use for traversing memory, which is needed in gumath. There is no dependency on an external C++ library like flatbuffers.

skrah commented

That said, we might translate Arrow to ndt_t in the future, but it is not an immediate priority (gumath is).

wesm commented

Are they? I thought Arrow was limited to int32_t in sizes. Xnd is for in-memory computations, so we definitely need int64_t.

This isn't quite true, see https://issues.apache.org/jira/browse/ARROW-750 -- support for very large variable-length collections is something we will eventually need to add to the format whenever there is demand for it.

In general, datasets will not be expected to be in a contiguous columnar memory block, but instead split across a collection of smaller chunks. We have discussed the 32- vs 64-byte issue for encoding collection lengths and the consensus has been that it is not worth the extra 4 bytes of overhead per value when the "large collection" case represents a very small percentage of use cases.

wesm commented

Additionally, we have changed 1-dimensional array sizes to use int64 almost a year ago apache/arrow@ced9d76#diff-520b20e87eb508faa3cc7aa9855030d7

skrah commented
wesm commented

Makes sense. In our experience, Tensors are a different beast and use case from structured columnar data, so we are handling ndarrays / tensors with metadata separate from 1D record batches: https://github.com/apache/arrow/blob/master/format/Tensor.fbs#L35. These use 64-bit shape and strides. This is used actively by the Ray project

wesm commented

There is overlap but the tradeoffs are quite different.

Agreed. We should look for opportunities to share code and infrastructure where possible. Note that the Arrow columnar format is but one type of data structure that we support -- it's a very important one for databases, Spark, pandas, etc. In order to implement zero-overhead memory sharing for structured datasets, many lower-levels of platform tooling must be created. I want to make sure we don't miss out on the collaboration opportunities for not having agreed on a "universal" data structure. The Arrow columnar format was never intended to be a universal data structure.

skrah commented

[Repost because of broken markdown in email replies.]

I agree it would be nice to have a standard low-level data structure. For C, Ndtypes is pretty standard: It describes all basic C types (including nested types, pointer types) using a regular algebraic data type. One could use it e.g. for the type part in "Modern Compiler Implementation in C" (Appel et al.) without changes.

The tagged union convention is also the same as in the quoted book, and incidentally also the same as in Python's own compiler (whose author probably also read Appel, given that he used ASDL to describe the AST :).

I think columnar data can be modeled in ndtypes as a record of arrays. The example from the Arrow home page:

>>> data = {'session_id': [1331247700, 1331247702, 1331247709, 1331247799],
...         'timestamp': [1515529735.4895875, 1515529746.2128427, 1515529756.4485607, 1515529766.2181058],
...         'source_ip': ['8.8.8.100', '100.2.0.11', '99.101.22.222', '12.100.111.200']}
x = xnd(data)
>>> x.type
ndt("{session_id : 4 * int64, timestamp : 4 * float64, source_ip : 4 * string}")

There is categorical data, the representation of which is an array of indices into the categories:

>>> levels = ['January', 'August', 'December', None]
>>> x = xnd(['January', 'January', None, 'December', 'August', 'December', 'December'], levels=levels)
>>> x.value
['January', 'January', None, 'December', 'August', 'December', 'December']
>>> x.type
ndt("7 * categorical('January', 'August', 'December', NA)")

There are nested tuples, which are more general than ragged arrays:

>>> unbalanced_tree = (((1.0, 2.0), (3.0)), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x = xnd(unbalanced_tree)
>>> x.value
(((1.0, 2.0), 3.0), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x.type
ndt("(((float64, float64), float64), float64, ((float64, float64, float64), ()))")
>>>
>>> x[0]
xnd(((1.0, 2.0), 3.0), type="((float64, float64), float64)")
>>> x[0][0]
xnd((1.0, 2.0), type="(float64, float64)")

In general, xnd just takes any basic Python value -- nested or not -- and unpacks
it to typed memory.

wesm commented

I am skeptical about the idea of an all-powerful / can-describe-anything data structure. With generalization comes added complexity for computational frameworks and producers/consumers.

@teoliphant stated "I have not seen that arrow is general enough." What does this mean? At this point, the Arrow columnar format is only one part of a much larger project. I think this means "the Arrow columnar format is not a universal data structure", which I agree with, but that was never the goal. I see the work here in libndtypes / xnd as complementary and not in conflict -- there are problems being solved (extending the notion of NumPy's structured dtypes to support things like variable-length cells and pointers) that were never in scope for Arrow's columnar format.

The columnar format was the focus of the project at the outset because that was the most immediate and high value problem to solve around data interoperability and in-memory analytics. The rapid uptake of the project and developer community growth suggests we made a good bet on this.

At this point Arrow a multi-layered project of memory management, shared memory (Plasma), metadata serialization, IO, streaming messaging, memory formats (including the columnar format), file format interop, computation kernels, etc. The work that is being done here could even become an additional component of Apache Arrow if you wanted to work with a larger developer community. At minimum it would be helpful to have a broader design/architecture discourse about problems and use cases in a public venue.

pearu commented

Ideally we can convert without memory copying.

As demonstrated in ArrayViews, one can wrap Arrow arrays with xnd, and vice versa, without memory copying. However, currently the wrapping does not support null buffers (Arrow) or bitmaps (xnd) because xnd does not expose bitmaps.

wesm commented

@pearu the cases where the memory is compatible IMHO reflect a minority (and a small minority at that) of real world use of Arrow. To suggest "compatible, with some exceptions" will mislead people

pearu commented

@wesm, I am not sure that I follow your comments meaning. If you refer to the fact xnd does not expose bitmaps, then this issue can be easily fixed as xnd bitmap is compatible with Arrow null buffer. I guess the reason of not exposing xnd bitmaps is that it is consider as internal structure while in Arrow null buffer is not that.

wesm commented

You stated

As demonstrated in ArrayViews, one can wrap Arrow arrays with xnd, and vice versa, without memory copying

I think it's worth making a list of different Arrow use cases:

  • Primitive (C-like) arrays with no nulls
  • Primitive arrays with nulls
  • Dictionary-encoded arrays
  • Varbinary / utf8 (variable-length) arrays with nulls
  • Nested type arrays (e.g. list<binary> or list<int32>)
  • Unions

Do you support them all and export all of their semantics in xnd? If the answer is "no", then I think you need to qualify the statement to say that "In certain limited cases, one can wrap Arrow arrays [and expose their semantics], without memory copying"