JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors

Question

JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors

Closed this issue 5 years ago · 24 comments

Hi!

I'm working on a DRF-based backend and we get 500 errors caused by specific Unicode characters…
I'm opening an issue here because it's not specific to our code or setup. I would also like to share a workaround and hear your thoughts about it. I'm aware of #6895 and #6633 and checked that DRF 3.10.3 is still affected by this issue.

The problem

Make sure you use rest_framework.parsers.JSONParser in REST_FRAMEWORK['DEFAULT_PARSER_CLASSES'] in settings. Now pass JSON to an API endpoint of yours that looks like this: {"title": "\ud83d"}. Instead of title use some field that your serializer supports and that is backed by a CharField, explicitly or implicitly. That input is ASCII technically but the JSON decoder will interpret \ud83d and turn it into an str instance equal to chr(0xd83d), i.e. a string with is a code point from the surrogates block which cannot be encoded to UTF-8 (or UTF-16 or ..) — because Isolated surrogate code points have no general interpretation —, see:

In [1]: import sys; sys.version_info.major                                                                                                                                                                           
Out[1]: 3

In [2]: chr(0xd83d).encode('utf-8')                                                                                                                                                                                  
[..]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

In [3]: import json; json.loads('{"title": "\\ud83d"}')['title'].encode('utf-8')                                                                                                                                     
[..]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

So my CharField title now contains a Python 3 string '\ud83d' and the code in the serializer starts working with it and we will only learn that we received malformed data in the first place once we try to store it into a database or when we use it while rendering the reply. That's rather late — maybe too late?

To write a test for this case for your own API, you could do something like this:

def test_..............(self):
    detail_url = reverse(...................)
    data = '{"some-charfield-of-yours": "\\ud83d"}'

    response = self.client.patch(detail_url, data, content_type='application/json')

    ...........

Workaround

One way to workaround this problem globally and deny malformed input from even getting to your serializers is to use a derived JSON parser for REST_FRAMEWORK['DEFAULT_PARSER_CLASSES'] like this:

# This code is from https://github.com/encode/django-rest-framework/issues/7026#issue-514872212
# Licensed under the BSD license as DRF itself

import os

from rest_framework.exceptions import ParseError
from rest_framework.parsers import JSONParser
from rest_framework.utils import json


class JsonParserThatRejectsSurrogateCodePoints(JSONParser):

    @staticmethod
    def _reject_surrogate_code_points(parsed):
        try:
            with open(os.devnull, 'w') as f:
                json.dump(parsed, f, ensure_ascii=False)
        except UnicodeEncodeError as e:
            raise ParseError(f'Parsed JSON contains surrogates (code points 0xD800 to 0xDFFF) - {e}')

    def parse(self, stream, media_type=None, parser_context=None):
        parsed = super().parse(stream, media_type=media_type, parser_context=parser_context)
        self._reject_surrogate_code_points(parsed)
        return parsed

I have not measured the performance penalty if this approach, yet. The upside is that only once single place of code needs to be touched to get all API endpoints on dry ground.

Discussion

I would love to hear how you handled this situation in your backend, if this is something you expect DRF users to handle themselves or would want to protect against upstream, and what other approaches come to your mind.

Many thanks in advance,

Sebastian

moseb commented 5 years ago

Any news?

Answer 1 · 2019-10-30T18:17:38.000Z

That input is ASCII technically

Your input encoding should be provided by the client. If none was sent, DRF will fall back to Django's default DEFAULT_CHARSET which - unless overridden - is utf-8.

Since you are expecting string to be ASCII, you could:

enforce the client to be sending an explicit encoding matching the request body
configure Django to use ASCII as fallback
override the JSONParser to use another default encoding.

Answer 2 · 2019-10-31T09:45:33.000Z

@xordoquy I think there is a misunderstanding here: My "That input is ASCII technically" was trying to emphasize that there are no funky bytes in the request payload: Things only get funky once the JSON parser comes into play. The string {"title": "\\ud83d"} is both well-formed ASCII and well-formed UTF-8 (due to use of a rather limited set of characters). Let me demonstrate:

# ipython3
[..]
In [1]: b'{"title": "\\ud83d"}'.decode('ascii')
Out[1]: '{"title": "\\ud83d"}'

In [2]: b'{"title": "\\ud83d"}'.decode('utf-8')
Out[2]: '{"title": "\\ud83d"}'

The client does send a UTF-8 encoding header Content-Type: application/json;charset=UTF-8 matching its payload and the server also has DEFAULT_CHARSET set to default UTF-8, so the setup is sane and mainstream.

If you want to see the bug in action:

$ curl --data '{"username":"drfbug7026","password":"\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
Internal Server Error

The server encountered an unexpected internal server error

(generated by waitress)

Could you re-open this ticket please? Thank you!

Answer 3 · 2019-10-31T10:07:03.000Z

curl --data '{"username":"drfbug7026","password":"\\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
{"non_field_errors":["Unable to log in with provided credentials."]}

"\ud83d" is not a valid utf8 content.

Answer 4 · 2019-10-31T10:08:45.000Z

You have one backslash too many in there. Please copy my curl line 1:1.

Answer 5 · 2019-10-31T10:09:11.000Z

Scratch my previous comment.

Answer 6 · 2019-10-31T10:14:09.000Z

What troubles me is:

json.dumps({"title": chr(0xd83d)}).encode('utf8')
b'{"title": "\\ud83d"}'

Which means the double backslash is expected somehow though and I trust Python for those.

Answer 7 · 2019-10-31T10:41:43.000Z

The double backslash on Python level makes a single backslash on JSON level, so it's only displayed as two:

In [7]: len(b'\\')                                                                                                                                                                                                   
Out[7]: 1

It's interesting that the JSON encoder is waterproof to this problem.

Answer 8 · 2019-10-31T11:45:54.000Z

There is a pull request #7028 now. It has a different approach to a fix, happy to discuss.

Answer 9 · 2019-11-13T16:54:33.000Z

Python upstream knows about this issue and mentions it in their Python 3 JSON documentation:

The RFC does not explicitly forbid JSON strings which contain byte sequences that don’t correspond to valid Unicode characters (e.g. unpaired UTF-16 surrogates), but it does note that they may cause interoperability problems. By default, this module accepts and outputs (when present in the original str) code points for such sequences.

So as far as Python is concerned, this is a feature and not a bug. (This Python issue/fix seems related: https://bugs.python.org/issue17906)

I think for django-rest-framework that means that if we want to protect users from this issue we can (a) detect and deny such input or (b) auto-correct it (to the extend possible).

I can think of three different approaches/places to attack this problem:

Globally, before parsing JSON (e.g. using a regex replace loop) — the rejecting version is demo'ed in #7028
Globally, while parsing JSON (but object_pairs_hook and object_hook during parsing are not enough to cover all cases)
Globally, after parsing JSON (e.g. using a recursive post-processor)
Selectively, during input validation in most field classes in rest_framework.fields (not just CharField) (e.g. using a code-point string post-processor)

What do you think?

Answer 10 · 2019-11-19T10:00:52.000Z

Any thoughts?

Answer 11 · 2019-11-19T10:16:09.000Z

I'd be interested to start by seeing how this would look if we just applied it against CharField, and nothing else. It seems to me that there's a reasonable validation rule that we might want to apply at that layer of interface which is something like "this string must be able to encode into x", where x defaults to utf-8, but could also be some other character set.

Answer 12 · 2019-11-23T21:41:49.000Z

I'd be interested to start by seeing how this would look if we just applied it against CharField, and nothing else. [..]

I have just created a pull request #7067 to demo and for discussion.

Answer 13 · 2019-12-17T13:51:51.000Z

Any thoughts?

Answer 14 · 2020-01-02T10:31:34.000Z

Happy new year! Any thoughts about pull request #7067 for a fix?

Answer 15 · 2020-01-02T12:33:09.000Z

@tomchristie

Answer 16 · 2020-01-02T12:48:00.000Z

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

Answer 17 · 2020-01-02T13:09:37.000Z

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

I didn't spam. I felt like the guy is not having a proper response or attention for the work he is doing. So I thought it might be better to mention the core maintainer of the project who is sponsored to work on this particular project for part/full-time.

Answer 18 · 2020-01-02T13:21:16.000Z

Pinging folks directly without adding anything of value is spammy behavior.
You have a track record here, and I'm asking you to please be more aware of that.

FWIW I'm still working flat-out: https://github.com/tomchristie/

That time happens to be largely on httpx right at this moment, which our sponsors are fully aware of, via the monthly reports.

Answer 19 · 2020-01-02T13:31:31.000Z

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

Speaking for myself, there were more than 7 days between my pings which I would consider okay in my own projects. (If that frequency of pinging is too high for DRF, please share what frequency of pinging is considered okay.)

Answer 20 · 2020-01-02T13:43:41.000Z

It can occasionally be useful to get a helpful nudge, but generally it never adds value and if it's done repeatedly it's not likely to be appreciated.

There's a follow-up on #7067 now, I'd suggest any further discussions should be against that.

Answer 21 · 2020-01-02T14:14:33.000Z

It can occasionally be useful to get a helpful nudge, but generally it never adds value and if it's done repeatedly it's not likely to be appreciated.

There's a follow-up on #7067 now, I'd suggest any further discussions should be against that.

I do myself a sponsored open source maintainer. And I found your comment very disrespectful. We never consider any one mentioning us and don't show ego. please don't try to teach me what is spammy behavior and what is not. I do participate in more open-source projects but never got triggered by someone mentioning me directly. I rather found your comment useless and adding no real value.

Answer 22 · 2020-01-06T17:21:19.000Z

With #7067 merged — is this ready to be closed?

Answer 23 · 2020-01-06T17:41:07.000Z

Yup, thanks!