project-gemmi/gemmi

crash with very large mmCIF files (and large number of datablock) in gemmi grep

CV-GPhL opened this issue · 11 comments

Looking at SF mmCIF files for e.g. 5sds (and also 5smj 5smk) using e.g.

wget https://files.rcsb.org/download/5SDS-sf.cif.gz
gemmi grep "_symmetry.*" -t 5SDS-sf.cif.gz

causes a complete crash (and reboot) of a Ubuntu 22.04 Linux box (64Gb memory) - most likely because it tries to first load the whole file with its 319 data blocks into memory before doing the grep?

Would it be possible to change this so that it reads and processes each datablock in turn? After all, each datablock is completely independent of the other and is basically processed as if they are independent files anyway, right?

I tried this command and it ran successfully for me, using 3GB of memory:

Maximum resident set size (kbytes): 2940204

The only thing loaded into memory here was the uncompressed content of gz file (2936157 kB).

What version/build of gemmi do you use?

How much swap space do you have?

To test it without running out of memory, set ulimit -m to some safe value.

I'm trying to run multiple similar jobs in parallel - since most of
them are for very small, single-datablock examples (and I don't want
to run one file at a time if going e.g. through the whole PDB). When
it then hits one of those very large files, it crashes my machine.

Is it using also 3GB of memory on your computer?

I've been using /bin/time -v command to check peak memory usage, although I don't know how reliable it is.

If you want to convert the first block only with cif2mtz, don't specify -B 1, just the filenames. It will be much faster, because this case is optimized – but it works only if the first block is up to 1GB.

Then all the blocks need to be read anyway, to check what items they have.
You could, if it's worth it, write a Python script that converts only selected blocks, see:
https://gemmi.readthedocs.io/en/latest/hkl.html#mtz-mmcif