jgm/doclayout

Wrong character width in full-width symbol

Opened this issue · 11 comments

lazex commented

This is my source markdown.

+---------+---------+---------+
|         | column1 | column2 |
+:========+:=======:+:=======:+
| row1    | x       | a       |
+---------+---------+---------+
| row2    | ◯      | a       |
+---------+---------+---------+
| row3    | ✕      | a       |
+---------+---------+---------+
| row4    | あ      | a       |
+---------+---------+---------+

I got following result:

<table style="width:42%;">
<colgroup>
<col style="width: 13%" />
<col style="width: 13%" />
<col style="width: 13%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;"></th>
<th style="text-align: center;">column1</th>
<th style="text-align: center;">column2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">row1</td>
<td style="text-align: center;">x</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="even">
<td style="text-align: left;">row2</td>
<td style="text-align: center;">◯ |</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="odd">
<td style="text-align: left;">row3</td>
<td style="text-align: center;">✕ |</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="even">
<td style="text-align: left;">row4</td>
<td style="text-align: center;">あ</td>
<td style="text-align: center;">a</td>
</tr>
</tbody>
</table>

There is a problem on the next line.

<td style="text-align: center;">◯ |</td>

and

<td style="text-align: center;">✕ |</td>

These results include | character.

I can modify the source markdown to get the expected result as follows.

+---------+---------+---------+
|         | column1 | column2 |
+:========+:=======:+:=======:+
| row1    | x       | a       |
+---------+---------+---------+
| row2    | ◯       | a       |
+---------+---------+---------+
| row3    | ✕       | a       |
+---------+---------+---------+
| row4    | あ      | a       |
+---------+---------+---------+

However, it is not beautiful.

I think it's a half-width and full-width misjudgment.
and are full width character as well as .

Command line

sudo docker run --rm --mount type=bind,source=$(pwd),destination=/data pandoc/core -o out.html src.md

Version

# pandoc --version
pandoc 2.14.2
Compiled with pandoc-types 1.22, texmath 0.12.3.1, skylighting 0.11,
citeproc 0.5, ipynb 0.1.0.1
User data directory: /root/.local/share/pandoc
Copyright (C) 2006-2021 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
jgm commented

◯ and ✕ are full width character as well as あ.

I don't think that's true. At least, on my terminal the first two take up one space and the third two spaces. And the same is true as it displays above in the code block.

In fact this works just fine!

+---------+---------+---------+
|         | column1 | column2 |
+:========+:=======:+:=======:+
| row1    | x       | a       |
+---------+---------+---------+
| row2    | ◯       | a       |
+---------+---------+---------+
| row3    | ✕       | a       |
+---------+---------+---------+
| row4    | あ      | a       |
+---------+---------+---------+

Check it on try pandoc.

Edit: There is something a bit odd here. In the code block above (as in yours), the pipes on the last line aren't fully lined up. However, they do appear exactly lined up in my text editor. I don't know how to explain that, but what we're aiming for is proper alignment in a text editor.

Let's see what happens if we add an extra space in that last line:

+---------+---------+---------+
| row4    | あ       | a       |
+---------+---------+---------+

That's definitely not lined up. So the slight misalignment in the code block as rendered in the browser seems to be a browser rendering bug of some kind. The browser definitely isn't treating the character as single-wide, but it's not giving it full double width either.

Upshot: not a bug, as far as I can see.

lazex commented

In my environment, they are displayed as full-width characters in the text editor such as Windows Notepad and VSCode.
Maybe it depends on the locale or font.
My environment is Japanese.
The attached screenshot shows the view using Notepad.

notepad

jgm commented

OK. That explains it. In https://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt we see

25EF;A           # So         LARGE CIRCLE

The "A" means "ambiguous." "Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default." So in your locale it is wide.

doclayout is the library we use to compute "real widths" for layout. It currently just treats all ambiguous characters as narrow. I'll move this issue to doclayout as a suggestion for further improvement. (It would require some way to make doclayout's functions locale-sensitive, not a small change.)

jgm commented

@Xitian9 - I believe you mentioned the possibility that this issue would arise!

Ha! That was fast.

I guess the question is how do we accurately and reliably determine the width. If there is surrounding context then it should be straightforward: we can add a context specifier to the MatchState. However, in this situation it looks like there would be no surrounding context, just a bare character put into a table. Can we try to guess based on other characters in the column? In the row? Some other way? I sense dangerous creatures this way.

jgm commented

One approach would be to add a function that allows you to locally set the context, such as

withWideContext (literal "◯")

Pandoc could then put the whole document in withWideContext if the locale is a wide-character locale. This global setting could be overridden in parts of the document that were marked up as different languages using withNarrowContext. Or we could have withLocale locale. Just some ideas.

Good idea. Next problem: there are a lot of ambiguous characters in the unicode spec. There are 198 separate entries (which include ranges) in EastAsianWidth.txt.

It is error-prone and tedious to define these ourselves. Maybe we should teach doclayout how to read EastAsianWidth.txt and generate it itself. This could be done similarly to how the emoji are handled in emojis. Thoughts?

jgm commented

It is error-prone and tedious to define these ourselves. Maybe we should teach doclayout how to read EastAsianWidth.txt and generate it itself. This could be done similarly to how the emoji are handled in emojis.

Makes sense to me. (We should use the approach in emojis, where the parsing code isn't part of the library and thus doesn't add dependencies.)

jgm commented

@Xitian9 has now provided a context-aware realLength function.

Now it remains to figure out how to modify the rest of the library so that it can be used. It's not as easy as I'd originally thought. For example, we have a literal :: HasChars a => a -> Doc a which calls realLength. How is this going to know which context to use?

One approach would be to change the Doc a type so that it's something like Reader Context (DocT a). literal could then use ask to retrieve the right context. local could be used for local changes in the context (wide or narrow) depending on e.g. lang attributes. This would probably slow things down somewhat, but I don't currently have other ideas.

jgm commented

The Reader approach would require a lot of changes. Maybe we could do something simpler, e.g. just adding literalWide. This would require that the calling program keep track of the context and use literal or literalWide accordingly.

EDIT: The problem with this approach is that we sometimes use realLength again after re-rendering, e.g. in minOffset or when stuffing text into a block. Actually, that's a feature of the code I don't like. If there were a way to handle these things without re-rendering, things would go more smoothly (and performance would be better).

jgm commented

To be clearer, the central problem is this: we have

 data Doc a = Text Int a            -- ^ Text with specified width.
           | Block Int [a]           -- ^ A block with a width and lines.
           | VFill Int a             -- ^ A vertically expandable block;
                   -- when concatenated with a block, expands to height

and the constructors for Block and VFill take an a rather than a Doc a as stuffing. In fact, when we construct a block we render its contents and just store the rendered lines. When we merge two blocks, we can then create a superblock that combines their lines.

The problem is, even if we introduced something like literalWide, this contextual information would be lost once things got inside a block, because of things like

  -- | Like 'lblock' but aligned to the right.                          
  rblock :: HasChars a => Int -> Doc a -> Doc a                     
  rblock w = block (\s -> replicateChar (w - realLength s) ' ' <> s) w   

which makes the block left-padded with spaces depending on the real lengths of the rendered lines.

So we'd need some kind of large-scale design change in order to introduce a way of changing the context from "wide" to "narrow" for part of the rendered document. Probably the most straightforward approach is to change the type of Block and VFill so they take Doc a instead of a as stuffing, as well as an explicit horizontal alignment. But that entails a lot of other changes.