Using backtick in byte sequences

Question

Using backtick in byte sequences

wlammen opened this issue 4 years ago · 6 comments

Hi,

This is my first post to this group, so please forgive me any violations of rules I might have done.

I refer to section 4.4 of the infra standard. In the beginning I read

A byte sequence is a sequence of bytes, represented as a space-separated sequence of bytes. Byte sequences with bytes in the range 0x20 (SP) to 0x7E (~), inclusive, can alternately be written as a string, but using backticks instead of quotation marks...

If I strictly follow this rule, a sequence of 5 octets 0x30 0x60 0x20 0x60 0x30 (0x60 = ` 0x30 = 0) is alternatively encoded as `0` `0`. How is this distinguished from two sequences containing just the character 0 juxtaposed? Usually one would use an escaping scheme to include a backtick in a backtick delimited display variant, or exclude it altogether. I miss some clarifying note on this situation.

Note: parsing a byte sequence (actually octet sequence outside of this standard) in this format is difficult, if a backtick is not recognized as a delimiting character at once. In fact, currently, the last occurring instance finally delimits the string, meaning you have to input a whole octet stream, search backwards for the last `, and you MUST never let two such encodings appear in the same stream twice.

Wolf Lammen

PS I originally wrote this post using plain backticks ` instead of the github conformant \` variant, yielding an illegible text. This illustrates the problem immediately.

Answer 1 · 2020-10-03T19:49:36.000Z

This is a fair point. The situation isn't quite as bad as you point out, for two reasons:

There's no valid meaning given in Infra for two byte sequences juxtaposed next to each other. So the only valid interpretation of `0` `0` is as the byte sequence 0x30 0x60 0x20 0x60 0x30.
The inner backticks are in code font, while the outer backticks are not. (This is very subtle, and more noticeable on GitHub than it is in WHATWG specs with the default stylesheet.)

Nevertheless, it might be a good idea for us to prohibit this.

I guess the same issue applies to string literals (https://infra.spec.whatwg.org/#strings) and the " character, hmm.

I'm unsure whether we should proactively change Infra to prohibit this, or first wait for an actual spec to run into this problem in the wild.

Answer 2 · 2020-10-04T06:26:31.000Z

Let me give a quick answer to your two points:

I am pretty sure that in real applications octet sequences are built up from fragments through concatenation. If you want to document this process in a text, juxtaposition is a suggestive means to illustrate this. Or think of a sequence containing two lines. Isn't `line`0x0D`other line` a viable option then?

Let me elaborate on this a bit further.

Your post was sent in copy to my email account with all formatting stripped. I saw a naked

...only valid interpretation of `0` `0` is...

If you have rich formatting available, there are better ways to display octet sequences with text contents than a backtick delimited string. Using a particular background colour is one such option.

Let's assume this is not the case, as demonstrated in the email example. Or think of disabled persons with access to textual information only. Here is where the backtick string format should unfold its strength.

Unfortunately, it is too limited (unspecified) to fulfil its purpose because you cannot embed it in other text without running into conflicts I depicted in my previous post. Sad. Usually a standard helps out here, and IMHO you should not wait until a rich set of incompatible interpretations grow up.

Wolf Lammen

Answer 3 · 2020-10-04T15:36:45.000Z

juxtaposition is a suggestive means to illustrate this

It's not a valid one according to Infra, though. You can use words, like "concatenated with". Or you can just do the concatenation yourself. You cannot just put two things next to each other and assume that people understand that means concatenation.

Answer 4 · 2020-10-04T16:54:06.000Z

Of course, you must somewhere explain your syntax and methods, I agree to that. But this won't help, because I can replace the space character in my example and work with `0` concatenated with `0` and you run exactly into the same trouble.

The problem is principal. A backtick indicates a context switch from normal text to quoted text and vice versa, ...or not, if it is itself a quoted character. But there is no precise way to decide, which is the case.

"It's not a valid one according to Infra, though."

I cannot find anything objecting to using juxtaposition in the Infra standard. What exactly is violated? Not explicitly explained, does not prohibit a usage as long as it matches at least your rules extending the standard.

Answer 5 · 2020-10-05T08:07:35.000Z

We already have quotes appearing inside strings, e.g., https://fetch.spec.whatwg.org/#example-header-list-get-decode-split. I wouldn't mind trying to emphasize the difference more between what's inside and outside though.

I'm not sure I understand the other problem. Unless something is explained to have meaning, it doesn't.

Answer 6 · 2020-10-05T09:31:50.000Z

On what occasions do you need a string representation of a byte (octet) sequence?

Not during machine - machine communication. Here you better transport the octets as are, no tampering needed, no string translation needed.
In a machine - human communication: Yes, it helps having data displayed in a legible form. Such communication occurs for example

when data is displayed for debugging or protocolling purposes;
when human input is fed into a machine, simple data input, or a program.

In a human - human context. This is e.g. the case with documentation, manuals and the like. Or the URL standard.

In case 2 and 3 the string will most likely be embedded in some outer text. And here is the Infra standard IMO flawed, because the suggested encoding does not reliably mark the delimiting points.

In case 3 something (like juxtaposition) can have a meaning without detailed explanation, because humans are often capable to fill in the gaps. In fact, the English language itself is not exactly defined. I can tell, it is not my mother tongue.

In general, written text uses standards, but is not under full control of them.