glencoesoftware/bioformats2raw

bioformats2raw produces big-endian data type (>u2) from MRC and CZI of unsigned int 16

Closed this issue · 4 comments

The input which are unsigned int 16 are being written and big-endian unsigned int 16 in the zarr arrays when converted with bioformats to raw. I am using version 0.6.1 of bioformats2raw, and this is occurring on both a Mac M1, and a Linux Intel ( both little endian ).

$ ~/scratch/bioformats2raw-0.6.1/bin/bioformats2raw --compression null EBOV_VSV_1.mrc test.zarr
$ cat test.zarr//0/0/.zarray 
{
  "chunks" : [ 1, 1, 1, 1024, 1024 ],
  "compressor" : null,
  "dtype" : ">u2",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [ 1, 1, 35, 4096, 4096 ],
  "dimension_separator" : "/",
  "zarr_format" : 2
}

The '>' character in the dtype field indicates the pixel data type is non-native big-endian.

The conversion from a czi file was also to a big-endian unsigned 16 bit in the zarr array too.

The expectation was that the byte order would match the system.

This behavior is expected; bioformats2raw does not consider the system's native order, so will produce the same output independent of the system on which it is run.

@blowekamp : is this causing a problem, or is it something we just need to better document?

@melissalinkert Thank you for the response. It is something that certainly can be handled.

What was the reasoning to choosing big-endian over the more common CPU order of little endian?

P.S. Where I ran into the issue was with numpy/dask checking if the dtype was np.uint16 and it was not matching.

The choice to write only big-endian dates to a time when bioformats2raw wrote n5, which only supports big-endian data - see f9bfa06 and https://github.com/saalfeldlab/n5#file-system-specification. We've left that in place for simplicity, particularly as Java ByteBuffers default to big-endian.

I'll assume that choices were made to follow the "Network Byte Order" of big endian at some point.