Benature/obsidian-text-format

Req: replace ligatures in PDF text

glocalglocal opened this issue · 6 comments

Possibly related to #23, it would be good to replace ligatures with its separate characters when cleaning up text coming from a PDF file. The only time I see the ft, fl and fi ligatures is when I copy from a PDF and I have to replace them by hand. A complete list is here.

To confirm what your request is: you want to replace ligatures like to ff.

And in the Wikipedia you gave, you want to replace the text in column Ligature to text in column Non-Ligature.

image

Do I understand right?

plz try in v1.8.1. If have problems you can re-open this issue.

Unfortunately, the problem is still there. Eg take the sentence below from the wikipedia page I referenced:

Other ligatures with the letter f include fj,[a] f‌l (fl), f‌f (ff), f‌f‌i (ffi), and f‌f‌l (ffl).

In every set of brackets there is a single character. In plain text these characters should be split. Ligatures are often found in PDFs (well, the ones I use anyway) and they are meant to make certain combinations of letters look good in typography. The problem is that when pasted in plain text, these ligatures are replaced with funny looking symbols if a plain text editor can't cope with unicode, or they will be displayed properly but they won't be recognised by Search, spellchecking, content indexing etc. The latter is the problem I am having with Obsidian.

This plugin is the obvious place for fixing this. If you must be selective, almost all ligatures I see in practice start with f and s. I can't remember when I saw any other ligatures in a PDF last time.

v2.2.1

For sentence like

Other ligatures with the letter f include fj,[a] f‌l (fl), f‌f (ff), f‌f‌i (ffi), and f‌f‌l (ffl).

The result of Replace ligatures is

Other ligatures vvith the letter f include fj,[a] f‌l (fl), f‌f (ff), f‌f‌i (ffi), and f‌f‌l (ffl).

Is the result not behave as you expect? (may be the w to vv)