christian-vigh-phpclasses/PdfToText

Some Japanese characters not shown/extracted properly

destinedjagold opened this issue · 6 comments

Hello and good day.

After testing with a couple of pdf files, I have discovered that not all Japanese characters are being extracted properly. They are numerous though. I'll attach the test pdf I'm using for you to test.

Thank you for your time.
test_pdf_2.pdf

@christian-vigh-phpclasses sir can you help me test this attached pdf?
right now i use pdfminer to transform pdf to html but it fails on this pdf
https://github.com/clear-datacenter/plan/files/524831/1.pdf.zip
after download you can remove the suffix .zip to get a pdf file

@christian-vigh-phpclasses thx your sir .currently we do use ocrmypdf a wonderful toolkit to deal with these situation

I am using 1.6.7 and the PDF from post #1 gives me a gibberish output that looks mixed Japanese,Hindi. The files opens, well, sort of, in pdfparser.org