ccpem/mrcfile

Option to calculate header.rms in float32?

bforsbe opened this issue · 7 comments

I'm trying to optimize a python module that writes some files using mrcfile, and frequent mrc output hits a strange limit.

self.header.rms = np.float32(self.data.std(dtype=np.float64))

This line takes 80% of the time in setting data in a new mrc. I appreciate that the rms field needs to be set, but casting this to double just to prevent overflow is a bit much, no? If the map shows overflow in calculating the rms, the map values are probably crazy, unless I'm missing something obvious.

How does it fit in the scheme of things to either omit rms and set it to NaN, or not cast to f64 and raise an erro (or exception with a f64 fallback)?

Hmm, interesting. I remember putting in the f64 accumulator because I had seen some overflows occur, but with some quick testing on normal files now it does look like it's probably not necessary, and using f64 seems to take about twice as long and twice as much memory as f32 which is a big price to pay. Those lines pre-date this git repo so I can't find anything more about why and when I made the change.

I appreciate that the rms field needs to be set

Actually, it doesn't. Some programs might mis-handle files without it slightly (e.g. using inappropriate display contrast settings) but it shouldn't cause anything worse than a minor inconvenience. I wouldn't set it to NaN though, that really could cause problems for other software. The standard convention is to set the rms to a negative value to indicate that it hasn't been calculated.

I think it's probably a good idea to switch back to f32 for the rms calculation anyway, so I'll do that, and maybe add a check so it can either raise an error or just leave rms un-set if there is an overflow.

The other question is, would it be useful for you to have a flag to turn off the stats calculations completely? Or still calculate min/max/mean but not rms?

In situations where the rms field is unimportant one could probably just use the _set_data method, so if code clutter is a priority then a flag seems excessive. On the other hand it would be nice to have the validation on set. But this is really a design question that requires a bigger perspective than I have. I'll work with what you provide, which is excellent software fit for purpose 😉

In short, you tell me. Should I be able to skip the rms field? Will any software out there care that the rms field is unset? If so, you should basically not enable me to skip it 😉

I just got a related "issue", where I use set_data() and it warns me that

Error in data statistics: RMS deviation is 1.7162400484085083 but the value in the header is 1.7162177562713623

Presumably due to the difference in calculation precision between update_header_stats()

...
self.header.rms = np.float32(self.data.std(dtype=np.float64))
...

and validate()

...
real_rms = self.data.std()
...
if (self.header.rms >= 0 and not np.isclose(real_rms, self.header.rms)):
...

Would be cool if I could bypass or catch this printout, since users might expect something went wrong if the word "Error" shows up, even if it's in the 5th decimal of the rms of the map :)

Yes, that makes sense - though if you're creating the files yourself and we update the RMS calculation so it uses float32 in both places, then there shouldn't be a mismatch. I might also loosen the comparison a bit more. I can't remember off-hand how close the values have to be for np.isclose but maybe it's overly restrictive. I'll probably also soften the wording of the message a bit, "Error in data statistics" seems quite harsh so maybe something like "Data statistics appear to be incorrect" would be better.

Sorry it's taking a while to deal with this one, hopefully I'll have some time to get this done and released this week.

I've just tried quickly implementing this, but I do get some overflow warnings from the mrcfile test suite. I don't have time to investigate that today and I won't be able to look at it next week, but I will hopefully figure out what's going on and put in a more sophisticated fix as soon as I can after that.

This is done now, and released in mrcfile v1.4.3. I've changed the calculations to just use float32. The overflow warnings were from some test arrays that deliberately used very large values, but since this kind of thing seems to occur very rarely in real files I've just made the values in those arrays smaller so they don't trigger an overflow. If someone does get an overflow (typically by using data with values $> 10^{19}$ ) they will see a RuntimeWarning and the header field will end up set as inf.