Blosc/python-blosc2

__pack_tensor__ should be in the beginning of the file to avoid seeking the whole file

dmikushin opened this issue · 4 comments

Hi @FrancescAlted ,

I have another concern about __pack_tensor__. According to hexedit, the __pack_tensor__ entry is located in the end of .bl2 file. I think this is an inefficient choice for large files. Suppose I have a 10 GiB bl2 file. I don't want to read it entirely, but knowing its shapes is essential for almost any usecase. So in order to read the shape, the c-blosc2 would need to fseek() up to the end of file. Of course, seeking is much faster than reading the content, but the file I/O would still need to hop over the inodes of the fragmented representation of big file in the filesystem. So why not to eliminate all this extra load on the filesystem by always placing metadata nodes in the beginning of the file? Is there an industry standard or practice that requires metadata to be placed in the end of file?

Variable length metadata (aka user metadata) is at the end of the file on purpose; reason is that you always need to provide space for including more meta, and doing so at the beginning of the actual data would require a rewrite of it.

Now, __pack_tensor__ is really meant to provide kind of a quick and dirty support for different flavor of tensors (Torch, TF, NumPy). Isn't the b2nd metalayer (which is at the beginning of the file) enough for you? What are you trying to achieve?

What are you trying to achieve?

I'm just trying to work in C++ with the existing .bl2 files that were created in the following way:

         filename = os.path.join(folder, f'{name}.bl2')
         with open(filename, 'wb') as f:
             blosc2.save_array(mat, filename, mode="w")

Therefore, I am given __pack_tensor__, it's not my choice. Should I try to create the .bl2 files in Python is some different way? Is there a similar Python way to use b2nd out of the box? It's highly desirable that I create .bl2 files the same way likely to be used by others, that's why I use the standard way above that you provide.

OK, __pack_tensor__ is a variable-length metadata. But is it really varying that much? I think it's size could have a good estimate.

I just wanted to check whether you want to store the kind of tensor or if storing and retrieving a multidimensional dataset would be enough. If the latter, you can store your NDim dataset (see e.g. https://www.blosc.org/python-blosc2/getting_started/tutorials/02.ndarray-basics.html) and retrieve it from the C side quite easily too (see e.g. https://github.com/Blosc/c-blosc2/blob/main/examples/b2nd/example_serialize.c).

If you still want to get the additional info about the kind of tensor you are storing and you don't want to do seeks (although my experience is that they are very effective, and you should not need more speed for most of the cases), then you can still create your own fixed-length metalayer (e.g. https://www.blosc.org/python-blosc2/getting_started/tutorials/02.ndarray-basics.html#Metalayers-and-variable-length-metalayers) and read it from the C side withouth the additional seek(s). Mind that this could be a bit too involved for the (small) benefits you can get.

Oh cool, perhaps ndarray is what I need, will go and study it, thanks a lot! 🙏