smalot/pdfparser

preg_match(): Compilation failed: regular expression is too large at offset 38605

huihuangjiuai opened this issue · 1 comments

pdfparser version:2.10.0

I have about 600,000 pdf files, all of which use pdfparser for text extraction.This kind of problem was shown to have been solved on 704, probably because of the Chinese coding problem, and now it appears again, please help to solve it, thank you.
1710747436_65f7ef2ccfac97cc01c0803eda73f23a732316a07e2ab5f2c43ec0e162ac4.pdf
1710812373_65f8ecd57a825b030437578010bf1e1aa0ae31669f11f5fae857a58f22bbf.pdf
1710898905_65fa3ed9d61a7e6d996a0d361fbfe8efa9620742d2e0560f19c51edefd6f2.pdf

1710124871_65ee6f473b7eddbd7437d2803d8f95dff1747721dc88d4035d46343d9766e.pdf

Will we ever be rid of this one? 😆 😭

This one is being caused by a specific character order in strings where there's an escaped slash immediately before an escaped parenthesis: (Sample \\\(string) The script is only checking two characters behind so it thinks there is an escaped slash before it and the parenthesis is "real", but it should be checking more characters. This way it would find out that both the slash and the parenthesis are escaped and shouldn't be counted.

Should be a simple fix, and something I should have done when accounting for pretty much this same issue in the Inline Image replacement area.