Unidata/netcdf4-python

Interpretation of reserved _Unsigned attribute written by netCDF-Java

deeplycloudy opened this issue · 19 comments

Unsigned integer data written by NetCDF-Java does not use an unsigned integer NetCDF type (e.g. uint16), but rather sets the reserved attribute _Unsigned on a (signed) int16 type. The result for the user of netcdf4-python is data which must be manually corrected after being read in.

It is my understanding that netCDF-C tries to "do no magic" so this is not a netcdf4-python bug per-se. #493 also notes the preference of netCDF4-python for leaving metadata interpretation to downstream applications. However, this is not a metadata convention in the CF sense, but rather a documented low level implementation decision within netCDF-Java: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/Netcdf4.html

Differences between netCDF-4 C and Java libraries for netCDF4 files
Unsigned types
The C library uses unsigned integer types: NC_UBYTE, NC_USHORT, NC_UINT, NC_UINT64.
The Java library does not have separate types for unsigned integers, but adds the reserved attribute _Unsigned = "true" when the variable is unsigned. One can check this with Variable.isUnsigned(), Attribute.isUnsigned(), and Array.isUnsigned(). Conversions done by the library are aware of this convention. Java does not have unsigned types, and we dont want to double the internal memory requirements by widening the data.

I am in favor of a feature to assist the users in correcting for signendess on read when this attribute is present. One possible route:

  • Add a method set_auto_signededness(True) to request the conversion be performed.
  • If the above is set, perform the conversion if requested within the read slice method.

I would be less in favor of adding a method to handle the conversion; it means every time a user reads data they must remember to call that method. But that would be more helpful than having every user write their own when encountering unsigned data written by netCDF-Java.

Thanks to @lesserwhirls for helping me understand some of the things above; any errors remain mine!

What exactly does the java library do when it reads data with _Unsigned=True? Does the C library recognize the this special attribute? - I don't see any mention of it in the NUG.

I suppose we could return a view of the array with the appropriate unsigned dtype if the variable has _Unsigned=True. I don't like the idea of adding a set_auto_signededness - if the attribute is set there should be no ambiguity about what is expected.

On the other hand, it's not very hard for a user to just do data = data.view(np.uint32)...

I have to ask though - what is the point? Making a view or a copy to an unsigned type with the same number of bytes is not going the change the data at all.

The C library does not check for _Unsigned. That is in keeping with the "do no magic" principle relayed to me by @lesserwhirls. The string does not appear in code in the netcdf-c repository.

It looks like the java library checks for the _Unsigned attribute, and then calls either the signed or unsigned read functions as appropriate. See the code here. I believe these are calls to the (wrapped) C library functions. If you poke around the Java code, you find that thredds suppresses the _Unsigned attribute in some (all?) circumstances, perhaps because it's been dealt with by the IO code. I also found some further docs on the _Unsigned convention.

I agree that the data.view(np.uint32) operation is not tricky.

I suppose it comes down to the philosophy of this library:

  • Eliminate as many of the machine-interpretable steps as possible so as to deliver science-relevant numbers in an array.
  • Act as a mostly transparent wrapper around netcdf-c, serving as a byte delivery service.
    I think this is a case for the former, because it corrects for a result of the machine encoding process.

Recent experience indicates that several experienced scientists who regularly use NetCDF but were dealing with data from a new instrument were surprised by the current behavior, to the level that we were worried we had corrupt data.

If a decision is made to automatically correct for signedness, it might be helpful for old user code (that already checks / corrects for the attribute) to have a way to toggle automatic interpretation. That would be the main purpose of set_auto_signedeness.

@deeplycloudy Have you raised this issue with the maintainers of NetCDF-C? If this is a problem for Python users, it is certainly an issue for users of C-API as well, and the fixed would probably be more appropriately applied upstream And if the NetCDF-C maintainers don't think this is a good idea, we should consider why the Python interface should be any different. NetCDF-Python is not strictly a transparent wrapper around netCDF-C, but that's absolutely a nice property to preserve if possible.

Java does not have unsigned types, and we dont want to double the internal memory requirements by widening the data.

I don't know very much about Java, but this seems like a totally backwards solution to me. I can understand that this may have been a convenient choice, but the mere fact that Java does not have built-in unsigned types does not mean it cannot write unsigned data in the standard way -- it just makes it more difficult. By the same reasoning, Python cannot handle integers with fixed size, because the built-in int has unlimited size. Using an ad-hoc language specific convention is clearly a pretty terrible idea if you value cross-compatibility, and it isn't something that should be done silently. So conversely, have you brought this up with the netCDF-Java developers?

So I'm afraid I didn't give @deeplycloudy the whole picture on this (as I wasn't 100% clear myself). Much of this was done before my time, but here is the way I understand the situation.

The _Unsigned attribute came about when trying to represent other data models in netCDF-3 (for example, HDF5). This is a language agnostic question. In terms of reading, netCDF-Java decided to not widen unsigned data (one choice given the lack of native unsigned data types in Java), and has several methods that "do the right thing" to represent unsigned data, based on the presence of the _Unsigned attribute. When writing data in netCDF-3, the question became what do to with unsigned types, which are not directly supported by the netCDF-3 data model. The decision was made to keep the bit pattern of the unsigned type, write the data as a signed type, and add the _Unsigned attribute to indicate to the reader that the data are unsigned. Note that all of this predates netCDF-4 and CDF5, both of which support the unsigned types in their data model and can be read by netCDF-C. The _Unsigned convention is also used in the netcdf4 classic model. While the _Unsigned attribute is mentioned in the netCDF "Best Practices" guide (part of the netCDF-C docs), it is only one of a few choices. netCDF-C leaves it up to the downstream application as to how to handle unsigned data.

Fast-forward several years, and netCDF-Java bit the bullet and wrapped the c library to support writing netCDF-4 files (it became clear that the HDF C library would be the only way forward to write HDF5 files, for several reasons). Although Java does not have native unsigned types, netCDF-Java does write unsigned types properly as long as they are defined as such in the CDM. That is, when writing a netCDF-4 (extended data model) file (which is done through the C library), netCDF-Java will write unsigned types and does not need to use the _Unsigned convention.

It is possible, of course, that someone could use that convention when writing out a netCDF-4 extended data model file, but they certainly do not need to do so. If a developer is upgrading I/O code to write netCDF-4 extended model files, it's possible they overlooked the code for handling unsigned types that was needed in netCDF-3 (or netCDF-4 classic) and continue to write out unsigned data in the same way they used to.

In talks with the netCDF-C group about adding the ability to convert unsigned data in netCDF-3 (netCDF-4 classic) at the C level, we decided against it as the library would be changing the data type as encoded in the file - netCDF-C wants to represent as closely as possible what is actually encoded in the file, and leave the rest to downstream application (that is, do no magic or as little as possible).

So, bottom line (tl;dr): the _Unsigned attribute came about when trying to represent other data models in netCDF-3 (before netCDF-4, CDF5 existed), and was chosen over widening the data. Once netCDF-Java gained the ability to write data in a format that supported unsigned types, it did the correct thing (i.e. write unsigned types as unsigned without the need to use the _Unsigned attribute). Even though they are using data formats that support unsigned types, it appears some data providers continue to write unsigned data using the _Unsigned convention and @deeplycloudy is lucky enough to have some of those data.

The question at hand is what support, if any, should netCDF4-python give users in dealing with unsigned types written as signed types in netCDF-3 or netCDF-4 classic, as well as unsigned types encoded using the _Unsigned convention in netCDF-4 extended (even though it's completely unnecessary). To me, this sort of feels like the set_auto_* methods available on Dataset.

@lesserwhirls, in the Writing NetCDF files: Best Practices section of the NetCDF Users Manual (NUG) it says:

A new proposed convention is to create a variable attribute _Unsigned = "true" to indicate that integer data should be treated as unsigned.

Does "new proposed convention" mean it's not actually an approved NUG convention yet?

Does anyone have a sample file with _Unsigned=True so I can see what happens when netcdf4-python reads the data as is?

OK, this makes more sense. Thanks @lesserwhirls for providing context here. NetCDF3 doesn't support unsigned integers, but NetCDF-Java does. For reference, if you try to save an unsigned integer to a NetCDF3 file with NetCDF4-Python or SciPy, you get an error message.

Here are two very simple netCDF files (netCDF-3 and netCDF-4 extended) with a byte and unsigned byte variable, as generated by netCDF-Java:

ubyte.nc3.zip
ubyte.nc4.zip

Here is the CDL for ubyte.nc4:

netcdf ubyte {
dimensions:
	d = 2 ;
variables:
	ubyte ub(d) ;
	byte sb(d) ;
data:

 ub = 0, 255 ;

 sb = -128, 127 ;
}

and for ubyte.nc3:

netcdf ubyte {
dimensions:
	d = 2 ;
variables:
	byte ub(d) ;
		ub:_Unsigned = "true" ;
	byte sb(d) ;
data:

 ub = 0, -1 ;

 sb = -128, 127 ;
}

OK, it looks like creating a view of the data with dtype = uint8 does the right thing.

    >>> from netCDF4 import Dataset
    >>> f = Dataset('ubyte.nc3')
    >>> f
    <type 'netCDF4._netCDF4.Dataset'>
    root group (NETCDF3_CLASSIC data model, file format NETCDF3):
        dimensions(sizes): d(2)
        variables(dimensions): int8 ub(d), int8 sb(d)
        groups:
    >>> v = f['ub']
    >>> d = v[:]
    >>> d
    array([ 0, -1], dtype=int8)
    >>> import numpy as np
    >>> d.view(np.uint8)
    array([  0, 255], dtype=uint8)
    >>>

So, if the variable has an attribute _Unsigned=True and has a signed integer type, netcdf4-python could return a view of the data with the corresponding unsigned integer type.

I don't see any downsides to this right off the bat - surely this is what any user would want?

went ahead and implemented this in pull request #658

I'll echo @shoyer's thanks for the context, @lesserwhirls. I agree with @jswhit that this is what any user would want. I'd be glad to test the PR with the dataset where I originally noticed this problem.

Is there a way to have conda use the travis builds locally so I can test this without setting up a build environment for the NetCDF stack?

@deeplycloudy sadly, no. It should be enough just to build netcdf4-python though, which isn't too bad IIRC.

@dopplershift Yeah, I was able to build it from source. Thanks for the reply!

Two things are making me reconsider this pull request.

  1. @shoyer pointed out that xarray wants the Variable dtype to be the same as the dtype of the data returned.

  2. @deeplycloudy pointed out that if there is an interaction between the conversion to unsigned data and the application of scale and offset unpacking of integers.

Since it's so simple for the user to create a view with an unsigned dtype, perhaps it's better to avoid adding this 'magic' in the library. I'm sure there will be other unintended consequences.

Where there is documented, machine-readable, actionable metadata it is a delightful thing when software works to eliminate it for the end user. That has generally been my experience with netcdf4-python and xarray, so I'm hesitant to give up on this.

In cases where there is a scale factor and offset, it's then up to the user to both know to take an unsigned view of the data and apply the scaling. It's no longer a one-liner.

I noted a fix for the scale factor problem on the PR.

Not to hijack this issue, but...

Interestingly enough, the NUG (as well as CF) mentions that the attributes valid_min and valid_max can be used to indicate the signedness (signedness is a deprecated attribute for this). Also, under the best practices doc, although it's stated that

A new proposed convention is to create a variable attribute _Unsigned = "true" to indicate that integer data should be treated as unsigned.

the text goes on to say, later in the same document:

A conventional way to indicate whether a byte, short, or int variable is meant to be interpreted as unsigned, even for the netCDF-3 classic model that has no external unsigned integer type, is by providing the special variable attribute _Unsigned with value "true".

(this section also talks about what to do to packed data). Then again, _Unsigned isn't listed in the attribute conventions appendix. Not quite sure what to think.

@WardF, @DennisHeimbigner - what do you guys think?

@lesserwhirls the valid_min and valid_max issue has been mentioned in #493 as well.

pull request merged - data is now returned as a view to an unsigned int by default if _Unsigned=True, can be disabled with set_autoscale(False).