logseq/mldoc

Wrong end_pos for chinese charecters

debanjandhar12 opened this issue · 2 comments

Description:

The end_pos data is calculated incorrectly when input has chinese charecters.

Example:

Code:

let mldocsOptions = {
       "toc": false,
       "heading_number": false,
       "keep_line_break": false,
       "format": "Org",
       "heading_to_list": false,
       "exporting_keep_properties": false,
       "inline_type_with_pos": true,
       "export_md_remove_options": [],
       "hiccup_in_block": true
   };

Mldoc.parseJson("我能做的任何我想要做到的事情",
   JSON.stringify(mldocsOptions),
   JSON.stringify({})
);

Output:

[[["Plain","我能做的任何我想要做到的事情"],{"start_pos":0,"end_pos":42}]]

The actual output should have been [[["Plain","我能做的任何我想要做到的事情"],{"start_pos":0,"end_pos":14}]] as the string "我能做的任何我想要做到的事情" has a length of 14.

start_pos&end_pos here is calculated in byte-based.

e = new TextEncoder("utf-8")
// TextEncoder {encoding: 'utf-8'}
e.encode("我能做的任何我想要做到的事情")
// Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array']
e.encode("我能做的任何我想要做到的事情").length
// 42

start_pos&end_pos here is calculated in byte-based.

e = new TextEncoder("utf-8")
// TextEncoder {encoding: 'utf-8'}
e.encode("我能做的任何我想要做到的事情")
// Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array']
e.encode("我能做的任何我想要做到的事情").length
// 42

I see. Thanks a lot for the help.
I looked into ocaml after posting the issue and it seems it works with 8-bit character array. So I guess this makes sense.