index() function returns wrong offset for non-ascii chars
atschabu opened this issue · 3 comments
I'm trying to strip away some text from part of a text. Trying to use something like sub("!.*"; "")
doesn't work, as it is giving me a Segmentation fault when text is too long. So I tried to go this route:
$ jq '.msg | .[0:index("!")]'
which works fine with input like:
{"msg": "hello world!"}
but fails when text contains wide characters:
{"msg": "здравствуй мир!"}
$ echo '{"msg": "здравствуй мир!"}' | jq '.msg | index("!")'
27
$ echo '{"msg": "hello world!"}' | jq '.msg | index("!")'
11
$ jq --version
jq-1.5
$ uname -a
Darwin atschabu-C02SF0UTG8WM 15.6.0 Darwin Kernel Version 15.6.0: Tue Apr 11 16:00:51 PDT 2017; root:xnu-3248.60.11.5.3~1/RELEASE_X86_64 x86_64
There is some documentation about this on the "Pitfalls" page (https://github.com/stedolan/jq/wiki/How-to:-Avoid-Pitfalls)
In brief, you can use match/1
:
echo '{"msg": "здравствуй мир!"}' | jq '.msg | match("!").offset'
14
This works in jq 1.5 and later.
By the way, could you please give more details about the failure of sub/2
. Here is an illustration that it does not always fail when given a long string:
jq1.5 -n '[range(0;100000) | "a"] | join("") + "!xx" | sub("!.*";"") | length'
100000
My bad. I haven't even realized there is a wiki. I took all the information from the manual, which didn't mention anything about index being byte wise. I'll give match a go.
I still haven't figured out when exactly the Segmentation fault is happening, as I couldn't find the input yet which is producing it. But I went by the assumption it is related to issue 922 until I can proof the opposite.
I guess we can close this one, and I'll open a new ticket, in case my segmentation fault issue is not related to 922.
No, this is a bug. We should fix it.