rcsb/mmtf-python

Output `mmtf` uses 64bit floats which violates the mmtf specification.

zacharyrs opened this issue · 0 comments

The specification outlines the float type as 32bit. Python has 64bit floats, hence when packing these per the template are dumped to the output file. Other parsers (e.g. mmtf-java) try to load these as 32bit floats, and hence fail. We can overcome this easily by updating the msgpack.packb call to include use_single_float=True.

However, it seems mmtf-java also violates the standard, and uses doubles (64bit floats) for the ncsOperatorList, thus the above change means it can't parse the output still. Given mmtf-java is used for the RCSB files, we can assume they won't shift to 32bit floats - it'll break their parsing for even more files.

Additionally, the msgpack-python implementation does not support selecting doubles for only one field - msgpack/msgpack-python#326. Instead you have to pack the biological assemblies list separately and then combine it, as in the collapsed snipped below.

Code for packing separately.
# The mmtf standard expects everything as 32bit - hence use_single_float.
# Note the encode_data no longer includes bioAssemblyList.
main = msgpack.packb(self.encode_data(), use_bin_type=True, use_single_float=True)

# Assemblies need to be 64bit for Java compatibility.
assemblies = msgpack.packb(
    {"bioAssemblyList": self.bio_assembly},
    use_bin_type=True,
    use_single_float=False,
)

# In msgpack, the first three bytes of a map (over 15 elements) are `\xde\x12\x34`, where
# 1234 gives the map length.

# Our `main` map has 30-something elements, hence only the `\x34` matters.

# Get the new length indicator, prepended with the map indicator and a `\x00`.
new_map_length: bytes = b"\xde\x00" + chr(main[2] + 1).encode()

# Strip the first three bytes from `main` (the map indicator byte and two bytes for length).
main = main[3:]

# Strip the first byte from `assemblies` (it's less than 15 elements, has a single byte indicator).
assemblies = assemblies[1:]

# Finally put it all back together.
new_data = new_map_length + main + assemblies

For reference I have raised this issue in the mmtf-java repo too - rcsb/mmtf-java#53.