words are missing or out of order

Question

words are missing or out of order

trzhong opened this issue 4 years ago · 9 comments

I've read a epub in Chinese language using epr on macos 10.15.4, python 3.7:

窦文涛：今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的都无数次采访过您，通过电话连线。今天终于是见着真人了，我觉得您真是很有风度的一眉立目的那么一款，没想到看上去很温婉。样子的时候，会觉得您是穿着警服有点横

And the content displayed in ibooks is:

窦文涛：今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的是第一次见到您，但是在我和傅见锋[2] 做的节目当中，我们好像都无数次采访过您，通过电话连线。今天终于是见着真人了，我觉得您真是很有风度的一位女士！原来他们做点好采访，我没见到您样子的时候， 会觉得您是穿着警服有点横 眉立目的那么一款，没想到看上去很温婉。~~会觉得您是穿着警服有点横~~

Not only this paragraph or this book but also many have this problem.

Answer 1 · 2020-05-12T09:27:44.000Z

This is crucial, I will try Chinese epub when I'm free,... Since, originally this only supported english... But I will try and have a look

Answer 2 · 2020-05-12T23:07:49.000Z

Hey, there. I just tried looking it up, seems like this is out of my capability, sorry. Hope someone else make PR about this issue... It probably has something to do with HTMLtoLines(HTMLParser) class if anyone cares to help fixing this...

Answer 3 · 2020-05-15T15:53:27.000Z

Since "textwrap.wrap()" cannot handle Chinese character properly, I try to add below codes in "HTMLtoLines.get_lines":

            else:
                w = width
                l = len(i)
                cjk_l = len(i.encode(encoding='UTF-8'))
                asc_l = int((l * 3 - cjk_l) / 3)
                if cjk_l > l:
                    w = int(w * l / (l * 2 - asc_l))
                text += textwrap.wrap(i, w) + [""]
        return text, self.imgs

Although it does display the content correctly, I don't think this is the best solution. I prefer a better wrap library.

Answer 4 · 2020-05-15T22:15:25.000Z

Wow, that's impressive troubleshooting... After I read your comment, I did some googling, and found this: https://bugs.python.org/issue24665

Indeed, as you said, textwrap.wrap() cannot handle Chinese character properly. And seems like issue regarding CJK support in textwrap is closed with rejected resolution based on confusions or some stuffs. So I think we won't get any support for non latin alphabet soon. For now I will add this issue as limitation in README while we're waiting for better wrap library as you suggested.

Answer 5 · 2020-07-12T11:25:59.000Z

@trzhong hey there,you might want to try https://github.com/aeosynth/bk as an alternative...

Answer 6 · 2020-07-17T06:24:38.000Z

I added support for wide characters to bk. There may be other issues, for example I don't know the line breaking rules for asian text.

1q84 by murakami rendered to 30 columns:

Answer 7 · 2020-09-27T15:27:00.000Z

I‘m still using my patch. Thx for the information.

Answer 8 · 2021-01-17T15:09:17.000Z

Finally, I found [rich] as a solution to replace [textwrap].

from rich import cells
replace all [textwrap.text] with [cells.chop_cells]

That's all.

Answer 9 · 2021-01-18T00:43:34.000Z

Wow https://github.com/willmcgugan/rich seems so powerful and features rich, thanks for pointing that out, mate... I'll try to implement it to epy...

Finally, I found [rich] as a solution to replace [textwrap].

from rich import cells replace all [textwrap.text] with [cells.chop_cells]

from rich import cells
replace all [textwrap.text] with [cells.chop_cells]