Tokenizer incorrect handling of newlines and line continuation following ?

Question

Tokenizer incorrect handling of newlines and line continuation following ?

Closed this issue 2 months ago · 5 comments

lukaszsamson commented 3 months ago

Elixir and Erlang/OTP versions

Erlang/OTP 27 [erts-15.2.5] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Elixir 1.18.4 (compiled with Erlang/OTP 27)

Operating system

any

Current behavior

There are 4 cases involving char parsing where the behaviour is wrong or questionable

Raw <LF> following ?: tokenizer consumes newline and does not advance line, following tokens have invalid position meta

iex(3)> :elixir_tokenizer.tokenize(~c"?\n1", 1, 1, []) |> elem(4) |> Enum.reverse
[{:char, {1, 1, ~c"?\n"}, 10}, {:int, {1, 3, 1}, ~c"1"}]

Raw <CR><LF> following ?: tokenizer consumes carriage return, emits EoL, advances line

iex(4)> :elixir_tokenizer.tokenize(~c"?\r\n1", 1, 1, []) |> elem(4) |> Enum.reverse
[
  {:char, {1, 1, ~c"?\r"}, 13},
  {:eol, {1, 3, 1}},
  {:int, {2, 1, 1}, ~c"1"}
]

Escaped <LF> (line continuation): tokenizer consumes newline despite the escape, does not advance line, following tokens have invalid position meta

iex(5)> :elixir_tokenizer.tokenize(~c"?\\\n1", 1, 1, []) |> elem(4) |> Enum.reverse
[{:char, {1, 1, ~c"?\\\n"}, 10}, {:int, {1, 4, 1}, ~c"1"}]

Escaped <CR><LF> (line continuation): tokenizer consumes carriage return despite escape, emits EoL despite escape, advances line

iex(6)> :elixir_tokenizer.tokenize(~c"?\\\r\n1", 1, 1, []) |> elem(4) |> Enum.reverse
[
  {:char, {1, 1, ~c"?\\\r"}, 13},
  {:eol, {1, 4, 1}},
  {:int, {2, 1, 1}, ~c"1"}
]

Expected behavior

All the cases emit warnings and are not exactly correct code. Proper escape sequences should be used in valid elixir (?\n and ?\r).

I don't have a strong opinion on how the code should work. I feel that char parsing should not consume raw <LF> and '' but that may be required for backwards compatibility. It shouldn't also consume line continuations.

One thing I'm certain is that when a newline is consumed, line should be advanced

Answer 1 · 2025-08-19T17:06:55.000Z

It might be worth to consider all new line code points as unicode has a few, I’ve implemented a similar solution for my library, with all known new line and whitespace code points. Although my library works on binaries.

Answer 2 · 2025-08-31T08:24:49.000Z

Actually, inconsistent supporting of them can come with security issues and spoofing attacks. Unicode recommends supporting a subset of all defined ones and raising for the undefined ones: https://www.unicode.org/reports/tr55/

We raise it exactly because supporting all of them can lead to formatting issues, incompatibility with editors, and so forth.

Answer 3 · 2025-08-31T13:07:49.000Z

Actually, inconsistent supporting of them can come with security issues and spoofing attacks. Unicode recommends supporting a subset of all defined ones and raising for the undefined ones: https://www.unicode.org/reports/tr55/

We raise it exactly because supporting all of them can lead to formatting issues, incompatibility with editors, and so forth.

Yes, it's important to normalize whitespace or line breaks, in most cases the lexer or tokenizer would increment an integer or output: \n

Answer 4 · 2025-08-31T13:54:20.000Z

To be clear, that's not what I meant at all. I am saying that supporting all Unicode line breaks is not the actual recommendation for source code, regardless of normalization. Furthermore, if your language supports some, but that's not mirrored in editors, then it can be the source of security issues. Normalization does not matter because the editors will see the raw text, not what you normalize. Whitespace, bidirectional controls, can also be used to hide intent in the actual code. So considering all known line break, whitespace, bidi, etc should not be a blanket recommendation.

Answer 5 · 2025-08-31T21:04:08.000Z

To be clear, that's not what I meant at all. I am saying that supporting all Unicode line breaks is not the actual recommendation for source code, regardless of normalization. Furthermore, if your language supports some, but that's not mirrored in editors, then it can be the source of security issues. Normalization does not matter because the editors will see the raw text, not what you normalize. Whitespace, bidirectional controls, can also be used to hide intent in the actual code. So considering all known line break, whitespace, bidi, etc should not be a blanket recommendation.

Yes, normalization would require formatting. I don't believe there is any way to normalize zero width, bidi or other special control characters and they should not be allowed in keywords (raising is right). Although line breaks aka new line vertical space and whitespace is trivial to normalize.