SamEdwardes/spacypdfreader

Suggestion - Improve speed perfomance extracting texts

Closed this issue · 2 comments

Is there any plans about improving speed performance? The base of its library is pdfminer.six as a base, right? Is it possible to speed up perfomance in the future, using spacy? If so, how can it be done? I'm here to help in it, if I can be useful.

Hey Victor - yes I would like to speed up performance! Currently the base is pdfminer.six. The task of converting a PDF to text is the bottleneck (as opposed to anything spaCy is doing).

I think the approach I would like to take is have a default base (like pdfminer.six), but they allow users to plug in their own PDF to text extraction function as well. For example I find pytesseract has the best accuracy, but it is really slow. Users should be able to choose their own extraction.

Closed by #4. You can now choose between different PDF parsers or implement a custom one.