Pretty printing code (esp. indent) has issues with non-ASCII unicode strings

Question

Pretty printing code (esp. indent) has issues with non-ASCII unicode strings

fingolfin opened this issue 8 months ago · 6 comments

While adding indentation to some printing code that prints some non-ASCII text (here: a "wedge" symbol), I run into this error (I'd try to produce a minimal reproducer, but I don't have time right now, and at least don't want to forget about this; but it shouldn't be hard to generate another, just write a show method printing a bunch of ä or whatnot and then indent-print it?)

│   exception =
│    StringIndexError: invalid index [14], valid nearby indices [12]=>'∧', [15]=>'e'
│    Stacktrace:
│      [1] string_index_err(s::String, i::Int64)
│        @ Base ./strings/string.jl:12
│      [2] SubString{String}(s::String, i::Int64, j::Int64)
│        @ Base ./strings/substring.jl:35
│      [3] SubString
│        @ ./strings/substring.jl:41 [inlined]
│      [4] SubString
│        @ ./strings/substring.jl:47 [inlined]
│      [5] SubString
│        @ ./strings/substring.jl:43 [inlined]
│      [6] getindex
│        @ ./strings/substring.jl:281 [inlined]
│      [7] _write_line(io::AbstractAlgebra.PrettyPrinting.IOCustom{IOContext{IOBuffer}}, str::SubString{String})
│        @ AbstractAlgebra.PrettyPrinting ~/.julia/packages/AbstractAlgebra/R29qD/src/PrettyPrinting.jl:1595
│      [8] write(io::AbstractAlgebra.PrettyPrinting.IOCustom{IOContext{IOBuffer}}, str::String)
│        @ AbstractAlgebra.PrettyPrinting ~/.julia/packages/AbstractAlgebra/R29qD/src/PrettyPrinting.jl:1634
│      [9] print(io::AbstractAlgebra.PrettyPrinting.IOCustom{IOContext{IOBuffer}}, s::String)
│        @ Base ./strings/io.jl:246

The problem is that in a non-ASCII string, not every index is valid, but the code in _write_line(io::IOCustom, str::AbstractString) implicitly assumes this can be done. The crash is here:

...
  firstlen = min(spaceleft, length(str))
  firststr = str[1:firstlen]  # <-- here
...

What we really want to do is to take the first couple "graphemes" (?), not bytes in the string. Perhaps using Unicode.graphemes would be a way to resolve this. Dunno.

Answer 1 · 2024-01-22T15:04:05.000Z

Not a bug, but a feature (I disregarded unicode on purpose).

Answer 2 · 2024-01-22T16:51:07.000Z

Some further issue with this Unicode stuff is that the width of a character/grapheme/whatever is not of fixed width, so having the automatic linebreaks seems not to be that easy

Answer 3 · 2024-01-23T00:55:32.000Z

For width, there is Base.Unicode.textwidth

Answer 4 · 2024-01-23T01:03:28.000Z

For that matter, there is also chop which can be used to remove characters (not bytes) from the start and/or end of a string; and its various "siblings". Alas, this does not handle full graphemes, meaning that if one looks at composite characters (e.g. an "a" with an accent combiner), they get broken up by this.

Indeed, consider:

julia> s = "ά"
"ά"

julia> chop(s)  # removes last character = accent
"α"

julia> collect(s)
2-element Vector{Char}:
 'α': Unicode U+03B1 (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)       # <- here's the accent

julia> using Unicode ; Unicode.graphemes(s)
length-1 GraphemeIterator{String} for "ά"

Yes, handling Unicode is messy and complex (like most real world things).

Answer 5 · 2024-01-24T10:40:23.000Z

I will look into this

Answer 6 · 2024-01-24T10:56:08.000Z

some more notes: There is nextind (https://docs.julialang.org/en/v1/base/strings/#Base.nextind) in julia and siblings that allow indexing, and there is eachindex that gives all valid indices