ccpem/mrcfile

Permissive validation causes memory errors

luke-c-sargent opened this issue · 2 comments

when using validate on a file that is not an mrcfile, being permissive by default can lead to problematic memory allocations; in my case, a file that produced a number of warnings eventually attempted to allocate a 6TB(!) bytearray which would of course cause an error.

as mitigation, i caught the warnings as exceptions and counted that as a False result, but this would leave the file handler open, which could corrupt data. I can skip validator in my code and go straight to:

try:
  with load_functions.open(filepath, permissive=False) as mrc:
    return mrc.validate(print_file=print_file)
except ValueError:
    return False

... but thought it might be useful to allow validate() to set the permissiveness level; PR #32

Happy to make any alterations if this is useful, or stick to the initial validate avoidance strategy, or whatever is best. My first pass at docstring text could probably use a review.

context: Galaxy sniffs input data to determine it's type; mrcfile is one, and the sniffer currently calls validate(). my guess is validator is hard-coded to permissive=True because it's intended for a file you strongly suspect to be valid, not any possible file (as Galaxy uses it).

Thanks for raising this, I can certainly see how it would be a problem! And thankd for the suggested fix too. Allowing permissive=False in validate() might not be a bad way to deal with it, but I feel like there ought to be a cleaner way that would fix the problem more generally. Ideally, we'd want to avoid allocating 6 TB arrays when someone calls open(file, permissive=True) on a bad file too! And if we fix it there then the problem with validate() should be fixed automatically.

For simple files, the solution could be as simple as checking the file size and comparing to the expected size of the data block before trying to read it, but the situation is more complicated for compressed files since they could legitimately be much larger than the file itself (and as far as I understand there's no way to know how big they might be without decompressing the whole thing). I'll have a look into some options and think about what's best.

This should now be fixed. The size of the array allocated for the data is now limited to the size of the file minus the size of the headers, so even if the data block size parameters appear to specify an enormous data block it won't allocate that much memory for it.

As an aside, from a look at your Galaxy code that uses this, I would suggest that you probably don't actually want to be using the validate() function. validate() ensures that a file strictly adheres to the MRC2014 standard, and it's mainly intended for developers to make sure files they create are fully compliant with the spec. Many files "in the wild" have one or more very minor problems meaning they will fail strict validation but are still perfectly usable. If your intention is to see if a particular file can be successfully read as an MRC file, it's probably better to just use mrcfile.open() with the default permissive=False and don't worry about validation - if it opens without raising an exception, it's "valid enough" to be used.