JakobOvrum/Dirk

non-UTF-8 text encodings make the framework crash

Closed this issue · 4 comments

Example output from putting a debugging writeln into the client code:

PRIVMSG "#xxxxxxx",x"54 68 65 20 49 50 20 72 61 6E 67 65 20 69 73 20 6F 77 6E 65 64 20 62 79 20 48 65 74 7A 6E 65 72 2C 20 61 20 73 65 72 76 65 72 20 68 6F 73 74 65 72 2C 20 73 6F 20 69 74 27 73 20 6F 62 76 69 6F 75 73 6C 79 20 6E 6F 74 20 61 20 6C 65 67 69 74 20 68 6F 6D 65 20 49 C3 50"c
src/rt/util/utf.d, 290, invalid UTF-8 sequence

IRC has no specific encoding for the messages. It's a binary protocol so to speak. Therefore, not crashing on invalid UTF-8 sequences would be preferential.

Crash can easily be triggered by using telnet to send some fun metacharacters.

The best course of action would be to fall back to CP1252 if UTF-8 decoding fails, as it is a superset of ISO-8859-1 used by Microsoft. Diggler which is built on Dirk should also not crash if this fails too.

I've planned adding a flag that determines text encoding, either 8-bit agnostic or UTF-8. The reason it needs to know about UTF-8 is for things like codepoint-aware and grapheme-aware splitting of lines for outgoing text (which is not yet implemented).

It shouldn't crash for incoming text, it should just pass it onto the user indiscriminately. Where exactly is the "crash" happening? Is it an exception or a segfault?

It appears to be an exception. On second thought, it might be in my code which has a client.onMessage callback. I'm running a debug build now to see if I can reproduce this and get a trace.

Okay, so even with a debug build I don't seem to get a trace that shows me where it encountered the exception. But I think I've found where the issue lies.

As I said, I do have some custom code being called through client.onMessage. I've wrapped this in a try-catch block catching UnicodeException. After testing this, I've found that it appears to fix the issue.

I think it may be worth considering to only pass well-formed strings to the onMessage delegates, so that not every delegate has to account for non-UTF-8 data being passed in as char[]s.

Alternatively, the type of the message should be ubyte[] instead, to make it clear that it is not guaranteed to be valid plain-text data.

Part of this would be fixed by a mode that transcodes to UTF-8. I filed this as #13.

Making the parameter type ubyte[] would be a big breaking change and complicates common user code quite a bit. Even with a transcoding mode, invalid UTF-8 is always a possibility, so some kind of formalized handling of it is an idea, but should be proposed in a separate enhancement request.