Implement multi processing to speed up pytesseract

not my question is relevant here, but is there a way of using multiprocessing pipe functionality from spacy?
like in this:

for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100):

Hey thanks for the question! I am actually not sure how / if for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100): would speed things up.

My idea is to use multiprocessing to speed up pytesseract (e.g. performing the OCR on multiple pages at the same time).

Thanks Sam, my understanding is that it would speed things up if you are processing hundreds of pdf files at the same time.

I think the tricky thing with for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100): is that I am not sure how spaCy implemented it.

Take a look at the implementation notes for spacypdfreader:

spacypdfreader/README.md

Lines 85 to 112 in d32cf8e

    
           ## Implementation Notes 
        
           spaCyPDFreader behaves a little bit different than your typical [spaCy custom component](https://spacy.io/usage/processing-pipelines#custom-components). Typically a spaCy component should receive and return a `spacy.tokens.Doc` object. 
        
           spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead `pdf_reader` takes a path to a PDF file and a `spacy.Language` object as parameters and returns a `spacy.tokens.Doc` object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the `spacy.Language` object. 
        
           Example of a "traditional" spaCy pipeline component [negspaCy](https://spacy.io/universe/project/negspacy): 
        
           ```python 
        
           import spacy 
        
           from negspacy.negation import Negex 
        
           nlp = spacy.load("en_core_web_sm") 
        
           nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]}) 
        
           doc = nlp("She does not like Steve Jobs but likes Apple products.") 
        
           ``` 
        
           Example of `spaCyPDFreader` usage: 
        
           ```python 
        
           import spacy 
        
           from spacypdfreader import pdf_reader 
        
           nlp = spacy.load("en_core_web_sm") 
        
           doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp) 
        
           ``` 
        
           Note that the `nlp.add_pipe` is not used by spaCyPDFreader.

Because of this I am not sure if it will place nice with the nlp.pip and setting cpu_cores.

Consider using Ray to implement multiprocessing. They have a good tutorial here: https://docs.ray.io/en/latest/data/examples/ocr_example.html.

	## Implementation Notes

	spaCyPDFreader behaves a little bit different than your typical [spaCy custom component](https://spacy.io/usage/processing-pipelines#custom-components). Typically a spaCy component should receive and return a `spacy.tokens.Doc` object.

	spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead `pdf_reader` takes a path to a PDF file and a `spacy.Language` object as parameters and returns a `spacy.tokens.Doc` object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the `spacy.Language` object.

	Example of a "traditional" spaCy pipeline component [negspaCy](https://spacy.io/universe/project/negspacy):

	```python
	import spacy
	from negspacy.negation import Negex

	nlp = spacy.load("en_core_web_sm")
	nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
	doc = nlp("She does not like Steve Jobs but likes Apple products.")
	```

	Example of `spaCyPDFreader` usage:

	```python
	import spacy
	from spacypdfreader import pdf_reader
	nlp = spacy.load("en_core_web_sm")

	doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
	```

	Note that the `nlp.add_pipe` is not used by spaCyPDFreader.