ActivityWatch/aw-server-rust

Importing of buckets with invalid utf-8 strings fails

johan-bjareholt opened this issue · 2 comments

I once visited a website which on purpose created invalid utf-16 strings in the title. This should be very rare as this site went to great lengths with messing up the characters (it essentially did so to be artistic, the site is virtualself.co)

When I try to import such buckets from aw-server-python to aw-server-rust it fails with the following

thread '' panicked at 'Failed to deserialize import data as JSON to bucket format: Error("unexpected end of hex escape", line: 24, column: 291)', aw-server/src/endpoints/import.rs:61:38

If someone wants to reproduce this is a minimal export file:

{
    "buckets": {
        "my-test-bucket": {
            "id": "my-test-bucket",
            "hostname": "johan-desktop",
            "client": "test-client",
            "created": "2020-01-28T12:43:29.137000+00:00",
            "type": "test",
            "events": [
                {
                    "timestamp": "2020-01-28T12:43:14.141000+00:00",
                    "duration": 4.993,
                    "data": {
                        "url": "https://virtualself.co/",
                        "title": "\ud835\udc47\ud835\ude29\ud835\udc56\ud835\udc60 \ud835\udc64\ud835\udc56\ud835\udc59\ud835\udc59 \ud835\udc4f\ud835\udc52\ud835\udc50\ud835\udc5c\ud835\udc5a\ud835\udc52 \ud835\udc61\ud835\ude29\ud835\udc52 \ud835\udc53\ud835\udc56\ud835\udc5b\ud835\udc4e\ud835\udc59 \ud835\udc5d\ud835\udc4e\ud835\udc5f\ud835\udc61\ud835\udc56\ud835\udc50\ud835\udc59\ud835\udc52-\ud835\udc38\ud835\udc65\ud835\udc5d\ud835\udc52\ud835\udc5f\ud835",
                        "audible": true,
                        "incognito": false,
                        "tabCount": 57
                    }
                }
            ]
        }
    }
}

The issue is because the string ends with a lone "\ud835" which is not allowed in UTF-8/16. An "\ud835" is required to be followed by a second escape.
Our python server happily allowed this invalid string while serde_json does not.

I made a stack overflow question of this

https://stackoverflow.com/questions/64114043/serde-json-ignore-parts-of-data-if-parsing-it-fails

I also realized that it might be possible to fix it in the legacy_import.rs as that parses each event seperately, if it fails to parse a specific event we can just skip that then. I think that might be a good enough solution.