Dataset class should support encoding parameter to override global attribute
thehesiod opened this issue · 20 comments
In fact the whole idea of having a global encoding
property is a bad idea because then you can't support multiple Datasets with different encodings. The real bug should be to deprecate netCDF4.encoding
as an example, the MADIS meso files are encoded in what appears to be cp1252.
if someone from the team can give feedback on how they'd like this achieved I can work on this.
I suggest adding a set_encoding
method to override the global value. There are already too many kwargs in Dataset.__init__
. The new method should have a kwarg to optionally override the global value of unicode_error
also.
not possible because we need encoding during init, for example when it calls _get_dims
. Will looking into unicode_error
ok added support for encoding_errors
looking into why unittests are failing
Does netCDF really support arbitrary encodings for strings but not have any way of indicating them in the data model? That seems like a disaster waiting to happen...
from what I understand yes! I ran into this with MADIS mesonet dataset that had the "Annœullin" string in it (has character \x9c in it), From what I saw there was nothing specifying the encoding :( After some grepping around it seemed like this was most likely a CP1252 format from someone generating the files on a windows box
basically netCDF 3.x was not designed with a true "string" type, only added in netCDF 4.x. For being a "self-describing" format this was a huge oversight.
Even with the netcdf-4 NC_STRING
type there is no concept of an encoding in the data model. It's just stored as a string of bytes in the file, and the client has to know how to encode it into a string. I can't find anything in the CF metadata standard that relates to string encoding, so there's no standard way for the client to figure out how to encode it. The python interface always returns a string, and the encoding is currently defined by a global module variable. I gather the purpose of pull request #655 is to at least allow the user to change the encoding on a per-Dataset basis, so multiple Datasets can be accessed at once with different encodings specified.
When it comes to names of variables, dimensions, attributes, groups, and types netcdf-c always uses UTF-8 encoding.
Just noticed this at http://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html
Note on char data: Although the characters used in netCDF names must be encoded as UTF-8, character data may use other encodings. The variable attribute “_Encoding” is reserved for this purpose in future implementations
and here http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_utilities_guide.html
The netCDF char type contains uninterpreted characters, one character per byte. Typically these contain 7-bit ASCII characters, but the character encoding is application specific. For this reason, applications writing data using the enhanced data model are encouraged to use the netCDF-4 string data type in preference to the char data type. Applications writing string data using the char data type are encouraged to add the special variable attribute "_Encoding" with a value that the netCDF libraries recognize. Currently those valid values are "UTF-8" or "ASCII", case insensitive.
which suggests that for NC_STRING variables (and attributes?) we should look for an attribute _Encoding
.
hmm, still need to support files which don't specify the _Encoding
. But sounds like enforcing _Encoding
needs to be added to netcdf4-python? In my particular case there was no _Encoding
attribute.
I doubt that it is really used much. Perhaps we should check for it though. @WardF or @DennisHeimbigner - if you get a chance, could you read through this thread and comment?
btw another thing is my PR applies the encoding to all places the default_encoding
was used before, based on what you found it sounds like some things like attribute names should always be UTF8 with some restrictions (which I'm guessing aren't checked for in the cython code).
You are correct: all netcdf names are assumed to be utf8, except that the character '/' is always
disallowed. When names occur inside, say, a cdl file, then certain other characters must be
back-slash escaped.
The default encoding for strings (as opposed to characters) is utf8, and ncdump, for example,
will assume that in the absence of any _Encoding attribute. The _Encoding attribute is currently ignored
in the netcdf-c code (reminder to self: add issue about at least recognizing it).
Character data is the real problem. Technically, is defaults also to utf8, which means the ascii
subset of utf8. In practice, characters can have any 8-bit bit pattern.
For character data, ASCII with errors='surrogateescape'
can be a good option for decoding into Python strings, when there is some possibility that somebody has stuffed arbitrary bytes in there. At least then, you can safely encode back into bytes and decode with the right encoding.
One more point. Character typed and String typed attributes must always be UTF8
because there is no way to specify _Encoding for an attribute.
The _Encoding attribute was under discussion recently on the CF mailing list. @rsignell-usgs and I were just this morning discussing this topic and decided to create an issue on the Unidata/netcdf-c repo to update the NUG wording around the _Encoding attribute.
hah, as a side-note, I just found that some MADIS mesonet files are NOT in cp1252 as they fail to decode with that encoding, so it seems the files are a mixture of encodings without specifying what the encodings are :(
update: even worse, has garbage data as it doesn't seem to be in any reasonable encoding...another idea then is adding encoding validation when setting string data.
So to summarize...
- we should look for an
_Encoding
attribute for NC_STRING variable data, and use it for encoding, otherwise either use a default value or the value specified by a newset_encoding
Dataset method. - for encoding character data into python strings we should use ASCII with
errors=surrogateescape
(i.e. in thechartostring
utility function). Instringtochar
we should decode using ascii. - For names of variables, dimensions, attributes and groups, always use UTF-8.
- For attributes (either string or character) Dennis suggests we should always use UTF-8.
Does this sound reasonable?
For (1) do we really need a set_encoding
method, or should we just rely on an _Encoding
attribute and use UTF-8 if it's not there?
For (4), could we look for a Dataset or Variable _Encoding
attributes and assume it applies to NC_STRING and NC_CHAR attributes?
Feedback from parts affecting me
- prefer having an
encoding
__init__
parameter as the current code relies on it during initialization also it makes more sense, I don't think you want to be able to change encoding modes of the Dataset after it's been opened, I think logically it should be fixed once it's been opened. - not sure what this means, I know of CDF "classic" files that aren't utf-8 and doing this will make consumers of these files more difficult (will have to parse each string 2x)...instead of simply passing an
encoding
parameter, or perhaps could be namedfallback_encoding
(if not specified).
3 + 4. After doing this I think it will greatly simplify where encoding is used.