coolwanglu/pdf2htmlEX

Copied text from converted html completely garbled

Opened this issue · 1 comments

pdf2htmlEX is a great tool! Most of the time it works exactly as I need it

However, in some cases text copied from the resulting html is garbled, although the text is shown just fine in the browser for the html, and I can copy text from the pdf just fine also. See below and attached for an example. There is pretty much a one to one between the original and the replacing characters.
10094549.pdf

The garbled characters can be found in the converted html (how can the browser display them correctly?) I tried all "--tounicode" variants, but its not improving. Ideally, I would like the html to contain the text in correct encoding, because I need to insert highlights. Is there anything that can be done to preprecess the pdf? Postprocess the html?

Thank you so much!!!

Cheers,
Robert

Usher syndrome is a heterogeneous autosomal recessive trait and the most common cause of
!"#$%&"'()%+$&,"&-& #$.$%/$($0"&-0."+-1& %$2$"",3$& .%-,.& -()& .#$& +".& 2*++*(& 2-0"$& *4

Ok, heres a possible solution that worked for me. As you note, the replacing characters follow exactly the order of the ascii table. The reason is, that not the full fonts are embedded into the html but only partial ones, for the actually used characters.

A fix for this problem is to convert the pdf to ps and back via pdftops and pstopdf using the xpdf package. The resulting pdf can be converted by pdf2htmlEX and the resulting html contains now properly encoded text that can be copied and pasted