index() function returns wrong offset for non-ascii chars

Question

index() function returns wrong offset for non-ascii chars

atschabu opened this issue 7 years ago · 3 comments

I'm trying to strip away some text from part of a text. Trying to use something like sub("!.*"; "") doesn't work, as it is giving me a Segmentation fault when text is too long. So I tried to go this route:

$ jq '.msg | .[0:index("!")]'

which works fine with input like:
{"msg": "hello world!"}
but fails when text contains wide characters:
{"msg": "здравствуй мир!"}

$ echo '{"msg": "здравствуй мир!"}' | jq '.msg | index("!")'
27

$ echo '{"msg": "hello world!"}' | jq '.msg | index("!")'
11

$ jq --version
jq-1.5
$ uname -a
Darwin atschabu-C02SF0UTG8WM 15.6.0 Darwin Kernel Version 15.6.0: Tue Apr 11 16:00:51 PDT 2017; root:xnu-3248.60.11.5.3~1/RELEASE_X86_64 x86_64

Answer 1 · 2017-06-19T03:15:30.000Z

There is some documentation about this on the "Pitfalls" page (https://github.com/stedolan/jq/wiki/How-to:-Avoid-Pitfalls)

In brief, you can use match/1:

echo '{"msg": "здравствуй мир!"}' | jq '.msg | match("!").offset'
14

This works in jq 1.5 and later.

By the way, could you please give more details about the failure of sub/2. Here is an illustration that it does not always fail when given a long string:

 jq1.5 -n '[range(0;100000) | "a"] | join("") + "!xx" | sub("!.*";"") | length'
100000

Answer 2 · 2017-06-19T17:03:49.000Z

My bad. I haven't even realized there is a wiki. I took all the information from the manual, which didn't mention anything about index being byte wise. I'll give match a go.

I still haven't figured out when exactly the Segmentation fault is happening, as I couldn't find the input yet which is producing it. But I went by the assumption it is related to issue 922 until I can proof the opposite.

I guess we can close this one, and I'll open a new ticket, in case my segmentation fault issue is not related to 922.

Answer 3 · 2017-11-28T19:00:25.000Z

No, this is a bug. We should fix it.