mjackson/mach

Mach 1.0 truncating UTF characters

Closed this issue · 11 comments

I'm getting a bunch of garbage characters in my app. I reduced it down to the following test:

var mach = require("mach");

var stack = new mach.stack();
stack.use(mach.charset, "utf-8");

stack.route(
    "*",
    function (connection) {
        connection.html("<!DOCTYPE html><body>This should be a triangle: \"▼\"</body>");
    }
);

mach.serve(stack, 8089);

Here's the hex result, as viewed through Charles:

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Date: Fri, 14 Nov 2014 21:36:45 GMT
Connection: keep-alive
Transfer-Encoding: chunked

00000000  3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 3c   <!DOCTYPE html><
00000010  62 6f 64 79 3e 54 68 69 73 20 73 68 6f 75 6c 64   body>This should
00000020  20 62 65 20 61 20 74 72 69 61 6e 67 6c 65 3a 20    be a triangle: 
00000030  22 bc 22 3c 2f 62 6f 64 79 3e                     " "</body>      

As you can see, 0x25bc is being returned as 0xbc. In the browser, it is rendered as char code 0xbc if I use no charset and 0xfffd if I do. (0xfffd is literally "replacement character" - e.g. the browser doesn't know WTF it's supposed to be).

I've never used Charles to convert an HTTP response to hex before, but based on what I'm seeing, it looks like only the last two characters of a four-character hex code are being returned:

expected received
0x 25 bc 0x bc
0x 11 cd 0x cd
0x 12 34 0x 34

Any idea what would cause this? I'm certain it worked before the 1.0 push.

Hmm, this could be due to the fact that we're now using bodec to encode binary data instead of node's Buffer module directly. I did this because I wanted a way to do binary data in the browser without importing a ton of code that we don't need (ala Browserify's Buffer shim).

Maybe this has something to do with that?

@appsforartists Are you still seeing this issue? If so, I'd love to get a failing test case for it and squash this bug.

Yeah I am. I just installed a fresh copy of all my npm dependencies (including mach@1.0.2) on the server, and I still see it. Oddly, I don't always see it on my local machine. Could be worth investigating if there are any differences between the two. I'll let you know.

I looked into it. The npm stuff is all identical. The difference is that on the dev server, I inline the scripts (so they are passed through mach). On my local machine, they are served by Webpack, which serves them correctly.

FWIW, here's the result of npm list:

├─┬ mach@1.0.2
│ ├── bodec@0.1.0 (git://github.com/mjackson/bodec#1c31ee1311969f4471e3582cdcefe1a93b711288)
│ ├─┬ bufferedstream@3.0.7
│ │ ├── bodec@0.1.0 (git://github.com/mjackson/bodec#1c31ee1311969f4471e3582cdcefe1a93b711288)
│ │ └── events@1.0.2
│ ├── describe-property@1.0.1
│ ├── object-assign@2.0.0
│ ├── qs@2.3.3
│ ├── redis@0.11.0
│ ├── strftime@0.8.2
│ └── when@3.6.4

(FWIW, bodec is now tagged at 1.0.0, so you can update your dependency.)

I am having a similar problem. If I read a JSON file with unicode characters, parse it on the server, then send it to the browser with conn.json(), the browser cannot parse the response. If I serve the file up directly with mach.file, however, everything works fine.

Here is a tiny app that demonstrates the problem: https://dl.dropboxusercontent.com/u/22414/mach-unicode-testcase.zip

Just npm install then browse to http://localhost:5000/index.html

EDIT: To clarify, in the first case the response is delivered to the browser with HTTP status 200, but the body can't be parsed with JSON.parse().

@appsforartists @nicholascloud thanks for the report. i'll look into this today.

@appsforartists @nicholascloud Just FYI, this is fixed and published in version 1.1.0. The issue was with how I was treating strings in the BufferedStream library as binary by default, instead of utf-8. Silly me.

Thanks for your help!

Thanks @mjackson. Might I recommend augmenting your Message unit tests with non-ASCII characters to prevent regressions? 😃

@appsforartists excellent idea.

@mjackson Excellent ❗