SamEdwardes/spacypdfreader

Implement multi processing to speed up pytesseract

Closed this issue · 5 comments

Implement multi processing to speed up pytesseract

not my question is relevant here, but is there a way of using multiprocessing pipe functionality from spacy?
like in this:

for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100):

Hey thanks for the question! I am actually not sure how / if for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100): would speed things up.

My idea is to use multiprocessing to speed up pytesseract (e.g. performing the OCR on multiple pages at the same time).

Thanks Sam, my understanding is that it would speed things up if you are processing hundreds of pdf files at the same time.

I think the tricky thing with for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100): is that I am not sure how spaCy implemented it.

Take a look at the implementation notes for spacypdfreader:

spacypdfreader/README.md

Lines 85 to 112 in d32cf8e

## Implementation Notes
spaCyPDFreader behaves a little bit different than your typical [spaCy custom component](https://spacy.io/usage/processing-pipelines#custom-components). Typically a spaCy component should receive and return a `spacy.tokens.Doc` object.
spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead `pdf_reader` takes a path to a PDF file and a `spacy.Language` object as parameters and returns a `spacy.tokens.Doc` object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the `spacy.Language` object.
Example of a "traditional" spaCy pipeline component [negspaCy](https://spacy.io/universe/project/negspacy):
```python
import spacy
from negspacy.negation import Negex
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
doc = nlp("She does not like Steve Jobs but likes Apple products.")
```
Example of `spaCyPDFreader` usage:
```python
import spacy
from spacypdfreader import pdf_reader
nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
```
Note that the `nlp.add_pipe` is not used by spaCyPDFreader.

Because of this I am not sure if it will place nice with the nlp.pip and setting cpu_cores.

Consider using Ray to implement multiprocessing. They have a good tutorial here: https://docs.ray.io/en/latest/data/examples/ocr_example.html.