modularml/mojo

[Feature Request] memoryview builtin and support for python buffer protocol

Opened this issue · 5 comments

Review Mojo's priorities

What is your request?

This enhancement request is to add support for Python's memoryview builtin and support for python buffer protocol. Here are some ideas about what kind of tasks and level of effort might be involved:

  • Add a new Mojo trait (Bufferable?) which has dunder methods: __buffer__ and __release_buffer__.
  • Add support for python's builtin memoryview() on Mojo structs. __buffer__ returns memoryview so this has to be builtin to Mojo (not a python module import).
  • Add support for the C data interface defined here python buffer protocol. This would allow Mojo structs implementing the C data interface as Py_buffer https://docs.python.org/3/c-api/buffer.html to be called from Python. Or maybe they could be wrapped in a PythonObject and returned as a memoryview?

What is your motivation for this change?

Currently Mojo 0.6 has poor (nonexistent?) support for zero-copy shared memory buffers with Python.

For example in Mojo's documentation the Ray Tracing notebook has an example of raster imagery being copied into a numpy array, using MLIR ops. Not only is this an unnecessary memory copy, it's also too verbose, undocumented, and not pythonic. See def to_numpy_image(self) -> PythonObject: in source notebook.

Mojo should enable and encourage interop with existing scientific computing packages in the most efficient manner. For example the Apache Arrow format.

The Arrow C data interface is inspired by the Python buffer protocol, which has proven immensely successful in allowing various Python libraries exchange numerical data with no knowledge of each other and near-zero adaptation cost. Arrow Spec

This enhancement would also lay the groundwork for supporting the Python array API standard.

Any other details?

Related Discussions/Issues:

Reference PEPs:

As a struct, it should be named MemoryView. Please be consistent and avoid Python's mess in naming!

Good suggestion! The naming is a bit confusing- there is the type Py_buffer at the C level, MemoryView in Python land, and memoryview() constructor, also in Python land. Definitely would not want to add new names or concepts if that can be avoided.

Also, I thought maybe this python example with comments may help to illustrate the idea little more:

# made up example (chatbot)
import array

arr = array.array('i', [1, 2, 3, 4, 5]) 

mem_view = memoryview(arr)

# Access properties of the memoryview  
print(mem_view.nbytes)
print(mem_view.itemsize)

# Indexing and slicing like NumPy array
print(mem_view[0])
print(mem_view[-1])
print(mem_view[1:3]) 

# Iterate through the memoryview
for num in mem_view:
    print(num)

# Get a NumPy array from the memoryview 
import numpy as np
num_arr = np.frombuffer(mem_view, dtype=np.int32)
print(num_arr)

output

20
4
1
5
<memory at 0x1011590c0>
1
2
3
4
5
[1 2 3 4 5]

I think this enhancement would open up numerous use cases like:

  • Mojo <-> C ABI
  • Mojo <-> Python modules/packages
  • Mojo <-> Python <-> C/Rust/Fortran etc backed packages

I am aware, that I am quite pedantic, but if Mojo would like to implement this, it will be IMHO better to just sacrifice one character more and name this constructor "memory_view". I don't like Python's style to blend words together without any separator. Keeping names strongly synchronized with Python is also not the best, cause it will also require to directly follow its behaviour which may be painful in some cases.

If Mojo will be Python++ instead of its compiled copy, it will gain its own identity and this small improvements will be in this case very noticeable

Linking to a neat related project here: Arrow implementation in Mojo https://github.com/kszucs/firebolt
It unlocks the case where mojo is the consumer of arrow data structures.