Wrong end_pos for chinese charecters
debanjandhar12 opened this issue · 2 comments
debanjandhar12 commented
Description:
The end_pos
data is calculated incorrectly when input has chinese charecters.
Example:
Code:
let mldocsOptions = {
"toc": false,
"heading_number": false,
"keep_line_break": false,
"format": "Org",
"heading_to_list": false,
"exporting_keep_properties": false,
"inline_type_with_pos": true,
"export_md_remove_options": [],
"hiccup_in_block": true
};
Mldoc.parseJson("我能做的任何我想要做到的事情",
JSON.stringify(mldocsOptions),
JSON.stringify({})
);
Output:
[[["Plain","我能做的任何我想要做到的事情"],{"start_pos":0,"end_pos":42}]]
The actual output should have been [[["Plain","我能做的任何我想要做到的事情"],{"start_pos":0,"end_pos":14}]]
as the string "我能做的任何我想要做到的事情"
has a length of 14.
RCmerci commented
start_pos&end_pos here is calculated in byte-based.
e = new TextEncoder("utf-8")
// TextEncoder {encoding: 'utf-8'}
e.encode("我能做的任何我想要做到的事情")
// Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array']
e.encode("我能做的任何我想要做到的事情").length
// 42
debanjandhar12 commented
start_pos&end_pos here is calculated in byte-based.
e = new TextEncoder("utf-8") // TextEncoder {encoding: 'utf-8'} e.encode("我能做的任何我想要做到的事情") // Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array'] e.encode("我能做的任何我想要做到的事情").length // 42
I see. Thanks a lot for the help.
I looked into ocaml after posting the issue and it seems it works with 8-bit character array. So I guess this makes sense.