Non UTF-8 byte sequences in body causes crashes

Question

Non UTF-8 byte sequences in body causes crashes

Closed this issue 2 years ago · 5 comments

When running GTM in UTF-8 mode we experience crashes when HTTP request and response bodies contain non UTF-8 byte sequences.

This is because in UTF-8 mode the string manipulation routines, for example $length and $extract consider such byte sequences to be invalid and generate errors.

For GTM the solution is to use for example $zlength and $zextract instead. I will submit a pull request with these changes for GTM. I'm not sure if these changes mess up the Cache support however, so I don't know if then will be useable. But maybe they can serve as a starting point.

Example:

GTM>w $l($zc(255))
%GTM-E-BADCHAR, $ZCHAR(255) is not a valid character in the UTF-8 encoding form

GTM>w $zl($zc(255))
1

Environment

M Web Server version: 1.1.2

GTM>w $zversion
GT.M V7.0-000 Linux x86_64
GTM>w $zchset
UTF-8

Answer 1 · 2022-07-12T16:19:45.000Z

@jensli Letting you know that I acknowledge receiving this issue.

This issue is a bit weird, because if you look at line 175, we are specifically expecting to be in M mode at the point this communication happens. Did you try another version of GT.M or YottaDB to see if there is a regression in GT.M?

175  X:%WOS="GT.M" "U %WTCP:(delim=$C(13,10):chset=""M"")" ; VEN/SMH - GT.M Delimiters

As I mentioned to you before, if you are customers, I can spend work time on the issue so that I can get it resolved. For now, I can tell you that your fix is good enough for what you need to do.

Answer 2 · 2022-07-12T16:47:38.000Z

And my apologies... let me thank you for your efforts to tell me about the issues. They will get fixed in due time.

Answer 3 · 2022-07-13T08:57:53.000Z

This issue is a bit weird, because if you look at line 175, we are specifically expecting to be in M mode at the point this communication happens. Did you try another version of GT.M or YottaDB to see if there is a regression in GT.M?
175  X:%WOS="GT.M" "U %WTCP:(delim=$C(13,10):chset=""M"")" ; VEN/SMH - GT.M Delimiters

I'm having trouble finding exactly what the device parameter chset=""M"" does in the documentation, but I guess it works like this: It sets the expected character encoding for the input data. Bytes are read from the device to a string, without any checks or transformations. One byte in the input is written as one byte in the resulting string. Later, when the resulting string is passed to $length, the invalid UTP-8 byte sequence is detected and the error is generated.

I basically just know that I got %GTM-E-BADCHAR, and when I switched from $l to $zl then it works.

I have only tested with GTM 7.0.

I have made the changes locally in our product, and that solves the problem for us.

Some further observations:

I can see that $length is used in many more places in the code base. That has not caused problems for us, probably because those uses are for the HTTP header, and the HTTP header always is in ASCII. (I have not checked this.)
The JSON routines also use $length. But that should not be a problem, since JSON should always use UTF-8.

Answer 4 · 2022-07-13T09:02:58.000Z

As I mentioned to you before, if you are customers, I can spend work time on the issue so that I can get it resolved.

We have had an initial discussion with Bhaskar about a support contract. Hopefully, in the autumn we will have time to move forward with that.

Also, we have managed to fix all problems locally, so this is not a blocker and urgent for us. I mostly report to try to help the project a little. :)

Answer 5 · 2022-08-15T15:14:02.000Z

Fixed in commit 4f6107a.