cerfacs-globc/icclim

Missing _FillValue should be handled nicely

Closed this issue · 7 comments

Missing _FillValue attribute of the input variable in an input file results in a crash. This is not desirable as _FillValue is not a required attribute according to NUG and CF.
A possible solution could as follows:
In icclim.py, near line 245, change code to read

try: ####----- LB 20170818
    fill_val = ncVar._FillValue.astype('float32') # fill_value must be the same type as "ind_type", i.e. 'float32'
except AttributeError: ####----- LB 20170818
    fill_val = numpy.float32(1.e20) # just an example literal constant ####----- LB 20170818

and similar changes near line 410. This seems to work for some test files, but I leave it to experts to check more carefully, then decide and implement suitable code.

I agree that this code would be a good solution. However in the except we must take care also of the variable type, as it may not be float32.

Agreed in principle, but two observations:

  1. If the input file(s) does not have a _FillValue there are probably no missing values (or it is not a well-behaved nc file --- see below for a further thought), in which case the value assigned to fill_val will (only) be used in the output file.
  2. Given the first observation, it seems more important to handle this in the try section as fill_value is used to 'interpret' the input data. And, then, if the input _FillValue is not of the same type as the intended output _FillValue the conversion should done at the appropriate place. Moreover, in the output file fill_val is used to set both the _FillValue attribute and the missing_value attribute.
    On further thoughts, while considering this part of the code, it should also be amended to handle also a missing_value attribute in the input file, which might occur instead of -- or in combination with -- a _FillValue attribute. Often/typically/normally they are intended to serve the same role, even though their intended function are in principle not necessarily the same.

Comments on the 2 observations:
1- Yes, but if you look at NCAR NCL, when you have a NetCDF variable, it always have a _FillValue associated with it. This is why in icclim we did the same. Having a very large value is usually set by default if no _FillValue is provided. This is why in your example this value would depend on the variable type, since 1e20 is not valid for int, for example. Also, all over the code it is assumed that a _FillValue is available, because in CMIP5 data we always have a _FillValue. I think the best way would be to stick to set fill_val to a very high value if it is not available in the variable metadata.
2- The input _FillValue must always be the same type as the variable type in the NetCDF file. missing_value attribute is the former attribute name that was used, but to support legacy files it is better to assume that we can have either _FillValue, missing_value, or both, and that they all mean the same thing.

On the first point:
i) I am not (very) familiar with NCAR NCL, but it seems that they impose additional requirements compared to NUG and CF. Thus I am not (at all) sure that NCAR NCL is always something that can be regarded as a reference for standard requirements.
ii) I always thought that icclim should have a broader scope than CMIP5, e.g. ECAD, EOBS, various reanalyses, and more ...
iii) I have a hands-on use case where GCM data produced by the HELIX project do not have _FillVal defined, and those files where cmorized.... Whether that was an initial mistake by the HELIX people is another thing, and they can always point to NUG and CF, which are the two key standards.
iv) The netCDF4 library defines default _FillValues for all datatypes (see below). The float32 value 1.e20 is not among them, and I am not sure where it came from, but it seems to be used throughout CMIP5, and in HELIX.

Now, having said this, I do understand that the code needs it, hence the quick fix I suggested. I will see what I come up with.

On the second point:
i) I am not sure what happens if the input data is of another type than float32 (eg. float64, if that ever happens) and the calculations and output data is done in float32. This should preferably be checked and handled, at least with an error message stating that this cannot currently be handled.
ii) This reminds me of issue #24, which concludes that (some) calculations should be done in float64. If the input data is float32, this may/will break the type agreement between input data and internal data if not taken care of.

python
Python 2.7.9 (default, Feb 19 2015, 15:41:07)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import netCDF4
netCDF4.default_fillvals
{'i8': -9223372036854775806,
'f4': 9.969209968386869e+36,
'u8': 18446744073709551614L,
'i1': -127,
'u4': 4294967295L,
'S1': '\x00',
'i2': -32767,
'u1': 255,
'i4': -2147483647,
'u2': 65535,
'f8': 9.969209968386869e+36}

On the first point:
Yes I agree that we should not be too strict apart from the CF conventions. In the current implementation of icclim given the time decisions had to be taken to have a working code, so being totally generic was difficult to achieve at this time.
What we can do is to also accepts files with no _FillValue or missing_value attribute, and give the netCDF4 default fillvals when reading the data, with the proper datatype.
The importance of having the same type, e.g. float32 vs float64, is only when writing the file. Within the code if all calculations are done in float64, when if the fill value is float32 that does not produce any problem, except if you would have the exact value in the field itself.
So, to conclude, I suggest to:
For the indices, we know that we will always have floatxx for the input variables, never integer or other imho. The fill value, if available in the attributes, we should use it in the proper type, and if not there, we should to check what is the type of the variable. Some answers on how to check the type is here:
https://stackoverflow.com/questions/28292542/how-to-check-if-a-number-is-a-np-float64-or-np-float32-or-np-float16
and assign a corresponding default fill value using the netCDF4 defaults.
What do you think?

Sounds good, I will have a go at this.

With the new commit this issue seems to have been solved