Permit errors='surrogateescape' and others
candlerb opened this issue · 8 comments
Relates to commit #82
This introduces the errors
attribute, but it rejects all values apart from a hard-coded few. In particular, errors='surrogateescape'
should be allowed (PEP 383, introduced in python 3.1) but is blocked.
However there are other defined error handlers, plus it's possible to add more dynamically with register_error.
Therefore I think it's better not to perform any validation here, but allow the provided value to be passed through to bytes.decode
. Unknown values are rejected as soon as you try to use them.
>>> b'foo\xff'.decode("utf-8", "surrogateescape")
'foo\udcff'
>>> b'foo\xff'.decode("utf-8", "flurble")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown error handler name 'flurble'
>>>
Or, if you must validate before use, then use codecs.lookup_error
instead of a hard-coded list.
>>> codecs.lookup_error("surrogateescape")
<built-in function surrogateescape>
>>> codecs.lookup_error("flurble")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown error handler name 'flurble'
Fair point. Would you like to do this and send a PR?
Which behaviour do you want: don't validate the encoding, or validate it via codecs.lookup_error
? The former seems more natural to me, but is a minor behaviour change.
I prefer the latter one. It's the current behavior.
If we don't validate the coding in the Reader_init
phase, it is possible that reader.gets
will raise a LookupError
. Maybe I'm a little pedantic but it's kind of an API change.
OK - can you tell me the correct way to call codecs.lookup_error
from C ?
OK, done. However I'm not sure if my unit test test_decode_error_with_surrogateescape_errors
will work for python2.7; it could be made conditional or removed entirely.
Fixed it
I have now added Py_DECREF on the temporary objects. Otherwise, if I add
fprintf(stderr, "codecs refs=%ld", (long)Py_REFCNT(codecs));
then I see the reference count increasing on each call.