w3c/csswg-drafts

[css-text-3] Segment Break Transformation Rules around CJK Punctuation

MurakamiShinyu opened this issue · 17 comments

(There are related discussion in #4992, #5017, and w3c/jlreq#211)

(I wrote about this topic in Japanese at https://lists.w3.org/Archives/Public/public-i18n-japanese/2020AprJun/0232.html and its thread)

When I write Japanese text with manual line breaks, I prefer to insert line breaks after ideographic/fullwidth full stop or comma [。、.,] rather than between Kanji/Hiragana/Katakana letters, because full stop and comma are break points in thought and I can naturally press the Enter key there. So it is very important to be able to put line breaks after CJK punctuation, without causing extra space. This is not just my personal preference, but common to many people, I guess. (I believe it's same for Chinese, and also for Korean when using CJK punctuation.)

e.g.,

日本語のテキストに、
English textを埋め込む。

should be transformed to

日本語のテキストに、English textを埋め込む。

and not to

日本語のテキストに、 English textを埋め込む。

However, the current draft's Segment Break Transformation Rules do not meet this requirement. According to these rules, the segment break is discarded only if both the characters before and after the segment break belong to the space-discarding character set, and converted to a space otherwise.

Line break treatment in TeX with CJK support

TeX has been used for Japanese typesetting since a Japanese TeX, pTeX, was developed in 1987. The pTeX and its derivatives and successors have the following line break treatment:

  • If the character before the line break is a Japanese character, then the line break is removed.
  • Otherwise, the line break is converted to a space.

This is the de facto standard for Japanese TeX users over the last 30 years.

(See LuaTEX-ja document, "13 Linebreak after a Japanese Character", for details)

With this rule, authors can put line break after Japanese punctuation, without causing extra space when a non-Japanese character follows the line break. So I think this rule has an advantage over the current CSS draft.

I am not a TeX expert and only have a limited knowledge about Japanese TeX. So I asked TeX experts on twitter and got some useful information.

  • https://twitter.com/watayan/status/1260142719562731525

    (Translation from Japanese) When writing in TeX, I appreciate the rule of "depending on the character at the end of a line". And when writing HTML, I put line breaks only where extra space is tolerable, with a feeling of giving up.

Such Japanese users will be disappointed if the Segment Break Transformation Rules cause extra space between Japanese punctuation and non-Japanese character.

  • https://twitter.com/zr_tex8r/status/1260150913118818304

    (Translation from Japanese) In the case of "XeLaTeX + xeCJK package" which is a "TeX for Chinese" widely used in China, the rule (simplified) is "Ignore line break if both before and after the line break are CJK" by default. It can be changed by setting.

    • https://twitter.com/zr_tex8r/status/1261663712076685313

      (Translation from Japanese) I tried to typeset the following sources in the default settings of xeCJK. Unexpectedly, all three outputs are same: "no extra space occurs".

      中文。 English。
      
      中文。English。
      
      中文。
      English。
      

This behavior in "XeLaTeX + xeCJK package" is very interesting to me. I found the following description in the README of xeCJK:

  • Spaces automatically ignored between CJK characters.
  • Special effects on full-width CJK punctuation.
  • Automatic adjustment of the space between CJK and other characters.

In XeLaTeX + xeCJK, line breaks in the source are treated as spaces and spaces are ignored between two CJK characters. In addition, spaces are ignored between a CJK punctuation and a non-CJK character, as one of the "Special effects on full-width CJK punctuation". Same as Japanese TeX (pTeX etc.), authors can put line break after CJK punctuation without causing extra space when a non-CJK character follows the line break.

Proposal to fix Segment Break Transformation Rules

I propose to add one rule to the Segment Break Transformation Rules before the last "Otherwise … converted to a space":

  • Otherwise, if either the character before or after the segment break belongs to the space-discarding character set and is a Unicode Punctuation (P*) or Space Separator (Zs), then the segment break is removed.

(U+3000, ideographic space, is probably the only character that belongs to the space-discarding character set and is a Space Separator Zs)

With this rule, no extra space occurs in the following examples:

日本語のテキストに、
English textを埋め込む。

日本語のテキストに、English textを埋め込む。
日本語のテキストにEnglish text
(英語のテキスト)
を埋め込む。

日本語のテキストにEnglish text(英語のテキスト)を埋め込む。

(In this example, fullwidth parentheses are used)

日本語のテキスト! 
English textを埋め込む。

日本語のテキスト! English textを埋め込む。

(In this example, there is an ideographic space U+3000 after the )

xfq commented

I agree that there are indeed many people using this kind of semantic linefeeds.

(Here's a test illustrating the current behavior.)

  • Otherwise, if either the character before or after the segment break belongs to the space-discarding character set and is a Unicode Punctuation (P*) or Space Separator (Zs), then the segment break is removed.

Are you sure you want to remove the segment break for:

First sentence.
Second sentence.

to produce:

First sentence.Second sentence.

? Do you mean to also check "EAW=F or W" or something like that?

We discussed this in JLTF ML but I'm not positive. This allows line breaking after punctuation and that is nice, but when authors need to make adjustments anyway as @r12a pointed out in some other issue, adding this rule makes authors hard to predict whether a space will be inserted or not. I think easier to predict is more important than the cases it helps.

  • Otherwise, if either the character before or after the segment break belongs to the space-discarding character set and is a Unicode Punctuation (P*) or Space Separator (Zs), then the segment break is removed.

Are you sure you want to remove the segment break for:

First sentence.
Second sentence.

to produce:

First sentence.Second sentence.

No.
The rule I wrote is "… belongs to the space-discarding character set and is a Unicode Punctuation (P*)…."
The (non fullwidth) full stop U+002E (.) does not belong to the space-discarding character set, so this rule is not applied in this case.

We discussed this in JLTF ML but I'm not positive. This allows line breaking after punctuation and that is nice, but when authors need to make adjustments anyway as @r12a pointed out in some other issue, adding this rule makes authors hard to predict whether a space will be inserted or not. I think easier to predict is more important than the cases it helps.

I think it will be hard to predict that a linefeed after a CJK punctuation will cause extra space depending on the following character.
And that makes semantic linefeeds (mentioned by @xfq) impossible. Why can you ignore this requirement?

The rule I wrote is "…

Ah, thanks, I missed it.

And that makes semantic linefeeds (mentioned by @xfq) impossible.

I think easy-to-predict is more important than semantic linefeeds. IIRC @r12a said that we should expect authors to change line breaking to disambiguate, and that may require line breaking at where authors may not feel natural. I agree with that.

The semantic line feed is great if we can get perfect on that. Authors will not need to know any rules, just write text with line break at arbitrary points, and CSS is smart to handle it. We know it's technically not possible, and we have to rely on authors to disambiguate.

In that circumstance, I think easy-to-predict is more important than making it a little smarter.

Just to be clear, my position stays; i.e., if Unicode adds a property for this purpose, CSS can use it. If you want to put more rules to make it smarter, I recommend to discuss at Unicode.

The semantic line feed is great if we can get perfect on that.

Yes, we can get perfect on that because my proposal is for enabling semantic linefeeds at CJK punctuation such as [。、.,] without causing extra space. Authors will not need to know any rules but can put linefeeds at end of sentence (full stop) and at (fullwidth/ideographic) comma and other punctuations.

You may argue that it's not perfect because putting linefeeds around quotation marks [‘’“”] (not belong to the space-discarding character set) may cause unexpected space (as discussed in #5017 ). But we can say "don't break at ambiguous punctuations if you don't want extra space."

In that circumstance, I think easy-to-predict is more important than making it a little smarter.

Your "easy-to-predict" will not be easy to predict for many users. I repeat: it will be hard to predict that a linefeed after a CJK punctuation will cause extra space depending on the following character.

Adding the rule around CJK punctuation is not just "making a little smarter". Without this rule, semantic linefeeds are not possible for Japanese and Chinese.

Just to be clear, my position stays; i.e., if Unicode adds a property for this purpose, CSS can use it. If you want to put more rules to make it smarter, I recommend to discuss at Unicode.

The rule I proposed uses only existing Unicode properties and the space-discarding character set. So I don't understand why new Unicode property is necessary.

@MurakamiShinyu’s argument is convincing to me. Semantic line breaks in source code is one of the main use cases for collapsing segment breaks in general, so I think it's important to support if we can do so without creating any major problem, and consistency with TeX makes sense here.

@kojiishi Wrt Unicode, they've indicated a lack of interest in creating any property for this use case. They might revise that position in the future, but in any case its up to us, the users of such "unbreaking" behavior, to figure out what we need, draft it up, and try it out. Unicode might be more willing to establish such a property once we've established its usage better and given them a concrete starting point that they can validate and maintain.

@jfkthame Any thoughts? Would it be reasonably implementable in Gecko?

Wrt Unicode,...

Ah, that was not what I wanted to mean, sorry. I meant I'm fine to switch to a new property if Unicode adds one specific for this purpose. But it wasn't my main point, sorry for the confusion.

It looks like "easy to use" is different for me and for Murakami-san, but I'm against adding this. At least for me, and as far as I think for authors, this addition makes it difficult to author HTML. I expect to see more errors than without this addition.

Thank you @MurakamiShinyu for sharing what TeX is doing. The rule should have been polished with time and proven to work. and I agree with your argument.

I have a question may be to @fantasai? What is the reason that the current rule requires both sides be space-discarding for not inserting a space?

The reason I ask is if the rule requires you to look ahead (the character after), it is hard to determine if one can insert a semantic line break or not at a particular place until you know what you would type next. Unless you are copying some text, you will be thinking what you type while you type. You would pose when your typing catches up your thought. It would be one of the best timing to hit a return key and give your brain a bit more time. You now know what you type and will start typing. At this moment you will notice, oh, oh, I should not have hit a return key after the previous word because a space will or will not be inserted and it is not what I wanted. Rules that require look ahead is OK for a batch processing but not great for typing while thinking.

This will not typically happen for English as having two or more consecutive space-discarding characters is (extremely) rare. However, this scenario happens often in case of Japanese as many nouns are spelled in Latin letters. An extreme case is technical documents. I believe the stuation is the same for Chinese probably with a bit less extent.

What is the reason that the current rule requires both sides be space-discarding for not inserting a space?

Other people may have different ideas, but to me, I think this is for web compat. There are existing text that inserts line break between CJ and alphabet, and all existing browsers insert space there. Authors may or may not want space there, but as long as there are authors expecting that behavior, it isn't easy to change.

One possible idea to (possibly) solve both points: how about changing the criteria to:

If the previous line ends with one of following 4 characters:

U+3001 IDEOGRAPHIC COMMA
U+3002 IDEOGRAPHIC FULL STOP
U+FF0C FULLWIDTH COMMA
U+FF0E FULLWIDTH FULL STOP

I think this should cover most of cases the proposal wants to improve, while keeping the rules easy-to-understand/remember/predict.

I agree web compatibility is important. However, perfect web compatibility will be impossible unless we give up any space-discarding rules. For example, space-discarding between two Katakana/Hiragana/Kanji letters are not always safe, because some Japanese text use space (U+0020) in Katakana compound words (e.g., "エンド ユーザー" or "クイック スタート" in Microsoft's Japanese Documents), or in Japanese text with わかち書き using space between words. But those are relatively exceptional cases and we can expect that most Japanese text authors will not put line breaks where spaces are important.

We need to find the best balance between improvement and compatibility.

Thank you @kojiishi for rethinking. Yes, ideographic/fullwidth commas and full stops [、。,.] cover most of cases. However, many people will complain with it: Why line breaks after fullwidth colon, semicolon, exclamation marks and question marks [:;!?] cause extra spaces? Those characters are listed in the same Pause or Stop Punctuation Marks category in CLReq. And that does not cover the cases that I gave examples:

日本語のテキストにEnglish text
(英語のテキスト)
を埋め込む。
↓
日本語のテキストにEnglish text(英語のテキスト)を埋め込む。
(In this example, fullwidth parentheses are used)
日本語のテキスト! 
English textを埋め込む。
↓
日本語のテキスト! English textを埋め込む。
(In this example, there is an ideographic space U+3000 ' ' after the '!')

I don't think these space-discarding cases cause web compatibility problem.

So I still believe that the rule I proposed has the best balance between improvement and web compatibility.

I understand that the current draft's rule that requires both sides to belong to the space-discarding character set for not inserting a space is for web compatibility, but this rule alone cannot meet the semantic line breaks requirements. So I proposed the additional rule that requires either side to belong to the strong space-discarding character set (= a subset of the space-discarding character set limited to Unicode Punctuation and Space Separator). I think it is easy to understand that strong space-discarding character requires only one side to discard space because these characters are natural or semantic break points in CJK text.

I think this rule is easier to understand/remember/predict than limiting to only ideographic/fullwidth commas and full stops. We can understand/remember that ambiguous punctuations are not included in the space-discarding character set because such punctuations, e.g. left and right quotation marks, em-dash, ellipsis, etc., can be used in non-CJK text.

The root issue of this discussion is that we disagree on the goal of this feature. You seem to misunderstand I'm talking about web compat, sorry if I lead you to the misunderstanding.

IIUC, you and @fantasai want to make the goal of this feature a semantic line break. I disagree with that. The goal of this feature for me is to help some use cases, such as using version control system easier.

I think we should not make the semantic line break a goal, because it's heuristic. Heuristic rules makes authors harder to predict/control the insertion/removal of the space, so we have conflicting ease-of-uses.

As @r12a pointed out in his comment, there will be a lot of cases where author must adjust line break to resolve ambiguities. This happens quite often that I think making it easier is critical for ease-of-use for authors. Heuristic rules make this harder, so I'm against adding heuristic rules.

You and @fantasai look so strong that I can live with the 4 code points, but that's the maximum for me. If you don't think it's sufficient and prefer not adding at all is better than 4, I'm happy with that, but I'm against adding more.

@MurakamiShinyu’s argument is convincing to me. Semantic line breaks in source code is one of the main use cases for collapsing segment breaks in general, so I think it's important to support if we can do so without creating any major problem, and consistency with TeX makes sense here.

@kojiishi Wrt Unicode, they've indicated a lack of interest in creating any property for this use case. They might revise that position in the future, but in any case its up to us, the users of such "unbreaking" behavior, to figure out what we need, draft it up, and try it out. Unicode might be more willing to establish such a property once we've established its usage better and given them a concrete starting point that they can validate and maintain.

@jfkthame Any thoughts? Would it be reasonably implementable in Gecko?

I'm just trying to catch up with the ideas here... my first reaction on looking at the current draft is that I think it's a mistake to define a space-discarding character set in CSS in terms of Unicode blocks or ranges. This is a maintenance headache in the making -- as hinted at, I think, by the note about "For future revisions of [UNICODE]...". It's also an issue in that the contents of "blocks" are not guaranteed to be homogeneous.

(In other words, I'm inclined to disagree with the decision that was reached in #337. But I'll need to do more re-reading of the various discussions to figure out what -- if anything -- I think would be a better way forward.)

@kojiishi I'm sympathetic to the argument that the rule should be easy to understand for authors. But I'm not sure, why is it easier to list those for 4 characters only but not other CJK/fullwidth punctuation such as brackets and semicolon?

The CSS Working Group just discussed Line break collapsing in CJK, and agreed to the following:

  • RESOLVED: Leave undefinedin L3
The full IRC log of that discussion <fantasai> Topic: Line break collapsing in CJK
<myles> fantasai: instead of solving this,w e might need to leave this undefined for L3 because it's the only significant issue that is open against l3 - the rules for collapsing line breaks
<astearns> github: https://github.com//issues/5086
<myles> fantasai: it needs some work and coordination for unicode. If we want text-3 to CR in the next 6 month, we'll need to mark the behavior as undefined and work on it for l4.
<florian> q+
<myles> fantasai: If we want to discuss this, we can dig into it.
<Rossen_> q?
<Rossen_> ack florian
<myles> florian: The goal is laudable. It's about making line unbreaking useful for non-latin scripts. It's useful. But all the efforts of doing it have run into complexity
<myles> florian: so this is the best thing to do at this point.
<myles> addison: it seems like an important thing to solve. Superficially seems simple, but once you dig in, it isn't.
<myles> addison: it won't be solved quickly.
<myles> Rossen_: So: Mark as undefined in l3, work on it in l4.
<myles> Rossen_: Anyone with other ideas, or objections?
<myles> RESOLVED: Leave undefinedin L3

Prettier v3 adopts this kind of specification by chance.

A newline is trimmed instead of replaced with a space in both of the following conditions:

  • At least either of adjacent characters are punctuation
  • At least either of adjacent characters are CJK except for Korean letters

https://github.com/prettier/prettier/blob/dc8df0ab80e713f683177510fe0ce06840f5d1a5/src/language-markdown/print-whitespace.js#L159-L161

CSS segment break transformation rules are important for not only HTML but also Markdown (and AsciiDoc). Many Markdown converters preserve line breaks in document and leave them to the rules in CSS.