change API to get page-level parallelization everywhere
bertsky opened this issue ยท 23 comments
In #278 we already had the case were the METS got broken due to ctrl+C in a single process. We discussed about parallelization but agreed to defer that topic. Now this is about providing a locking mechanism that allows parallel processing.
Currently it is impossible to run in parallel over a workspace, simply because it is too dangerous: the result of one actor could be lost due to another without any indication of it. This hampers makefilization and similar efforts.
IIUC, getting FS locking in a portable way is tricky, if you want to get it right.
Does anyone have experience or solutions?
No experience let alone solutions. Sry.
Portability isn't of the essence though because I don't foresee the need to support non-POSIX filesystems.
No experience doing this with Python but the principle is simple and https://github.com/dmfrey/FileLock/blob/master/filelock/filelock.py (the accepted answer you linked to) looks like a reasonable solution.
I'd like to bring in another idea: Maybe file locking is the wrong way to think of page-level parallelization altogether. We have dependencies between successive processors in a workflow, so it does not make much sense to think of running these in parallel (without going into elaborate synchronization mechanisms like semaphores for all pages), except for a very narrow set of circumstances where workflows contain parallel branches. On the other hand, what really can be run independently (at least for most processors and workflows) are the pages within one processing step. Even now, a processor implementation can do that, e.g. by using a multiprocessing.Pool.map()
on the workspace.input_files
.
And there's the rub really: In the current design, Processor.process()
implementations must iterate over input files themselves. The responsibility for parallelization therefore is in the module provider (instead of the base class in core).
So for 3.0, what we should aim at is the following:
- deprecate
Processor.process()
- create
Processor.process_page()
as an abstract method, taking a singleOcrdFile
as input โ and change all implementations in the modules accordingly (which is a lot of work, but trivial). - create
Processor.process_workspace()
, iterating over the input files and delegating toprocess_page()
respectively. Here we have the freedom to use a CPU-parallel implementation under user control. (Of course, as in the workspace-parallel scenario, we then have to make sure GPU-enabled processors do not over-allocate the GPU resources and cause memory races; either using a CPU-fallback or retreating to a safe GPU "bottleneck".)
As a side note: One could of course always argue that workspace-parallel execution is much less complicated and ensures maximum throughput without extra design/implementation effort and synchronization overhead. But this does not cover the use-case of a single (large) workspace, or the use-case of minimum latency.
I guess this proposal should go through the spec first?
It's a good proposal, while the Processor.process
approach with a thin wrapper around the METS was a neat idea at the time, it has run its course and breaking this up will allow us more flexibility. I also agree that iterating over the files from within the processor is not a scalable approach and also leads to unnecessary and error-prone copy-paste boilerplate in the processors. We should also make sure that the process_*
methods are implemented in a way that there are ways to augment them at runtime with hooks or a method convention like "processor implements a method _process_page
" while the BaseProcessor offers flexible process_page
method wrappers.
I guess this proposal should go through the spec first?
Not really, the workspace-processor approach in core is just one implementation and while it will be some work to refactor the processors, they will still follow the same CLI, just that the code of the processors can get rid of a lot of loops and inconsistent implementations across processors. From what I see this change should not result in any changed behavior for users? I will think about it some more though, quite possible I'm missing something here.
One thing we have to take into account are processors that run on multiple input file groups. For those, process_page
will have to take a tuple of OcrdFile
s โ for the same pageId. Our base implementation process_workspace
will have to provide for that. (And if no full matches can be found for some pages, then either skip with an error or provide the processor with a None
input.) This is what ocrd-cor-asv-ann-evaluate
or ocrd-cis-align
do on their own now. But there might be additional concerns in the processor, like MIME type filtering (if there is more than 1 file per page in some fileGrp).
There might even be processors which read multi-page files (like COCO segmentation, TIFF or PDF). For example, https://github.com/OCR-D/ocrd_segment/blob/master/ocrd_segment/import_coco_segmentation.py. See OCR-D/spec#142
My toughts about parallelization:
- would it be possible to parallelize single processing steps automatically within given bounds (CPU, RAM)
- this would make it easier to start multiple workflows at a time
- there could be negative effects: CPU+Disk Cache trashing, RAM bandwidth, slowed down i/o due to disk latency ...
- e. g. Imagemagick-Q16 needs about 2Copies * Pixels * 2BytePerChannel * 4 ChannelsPerPixel Bytes of RAM. Parallelization could be limited by environment variables
It is desirable to support this kind of parallel processing, because it allows to get a much faster OCR result for a certain book.
But we should not forget that it is also possible to parallelize on the book level, and that already works of course. As long as there is still a huge number of books waiting for OCR, that kind of parallelization might be sufficient for many users.
Nevertheless thinking about future improvements is reasonable, and thoughts discussed here could also trigger actions starting now. If a processor should process several pages parallelized, it is important to know the required resources. This is something which the maintainers of each processor know best. So the processor description (JSON) could be extended by information like required CPU cores, maximum RAM, number of GPUs and maybe more which allows a best use of the available resources. Some of those values might depend on image size or other factors, so they will be only rough estimates, but even that can help. That additional information is also useful for parallelization on the book level which would overwise have to guess or work with try and error.
A central problem for the page level parallelization seems to be updating the METS file.
Locking that file is only one of the possible solutions. Ideally it requires that the individual page results may be added to the METS file in random order. Writing them ordered could result in processes for higher page numbers which have to wait until lower page numbers are finished, so RAM would have to be large enough for running plus waiting processors.
If each processing of a single page would not write directly to the METS file, but delegate that to a central software instance, that central instance could manage orderly updates of the METS file without any locking.
Or some kind of database, perhaps sqlite3, with conversion from/to mets as first/final step.
* would it be possible to parallelize single processing steps automatically within given bounds (CPU, RAM)
Yes, that's what the above proposal is about. If we change the API for modules, moving from process
to process_page
, then we can do everything multiprocessing
and other libraries can give us in core.
* this would make it easier to start multiple workflows at a time
What do you mean by that? You still cannot process multiple workflows on the same document, because we would need a METS locking/synchronization mechanism for that. And you can already run the same workflow on multiple documents with workflow-configuration now.
* there could be negative effects: CPU+Disk Cache trashing, RAM bandwidth, slowed down i/o due to disk latency ...
Of course, when running page-parallel, we must offer the user ways to say specifically what resources (not) to allocate. And the more intelligence we invest in automatic flow optimization, the less the user has to experiment with the settings.
* e. g. Imagemagick-Q16 needs about 2Copies * Pixels * 2BytePerChannel * 4 ChannelsPerPixel Bytes of RAM. Parallelization could be limited by environment variables
I don't think we should use environment variables for that (at least not exclusively). See here for various mechanisms we could use for global options like that.
If a processor should process several pages parallelized, it is important to know the required resources. This is something which the maintainers of each processor know best. So the processor description (JSON) could be extended by information like required CPU cores, maximum RAM, number of GPUs and maybe more which allows a best use of the available resources. Some of those values might depend on image size or other factors, so they will be only rough estimates, but even that can help. That additional information is also useful for parallelization on the book level which would overwise have to guess or work with try and error.
That's a hard problem indeed. Even estimating the (usually non-linear) dependency between input size and resource allocation is a daunting task for implementors. And it also depends a lot on the parameters set by the workflow (e.g. model files or search depth). That's why I think realistically we should be agnostic about resource usage and let the user or workflow engine do the guesswork in real time. (But if there's one information we do need implementors to specify statically, it's whether or not the processor tries to make use of a GPU. See GPU = 1
in workflow-configuration
rules for the GPU semaphore.)
A central problem for the page level parallelization seems to be updating the METS file.
That's not true if core itself provides the parallelization (as outline above). The Processor
class can serialize the page results according to their pageId and only write back to the METS once afterwards.
This problem only arises if you want to run different workflows concurrently on the same data.
@bertsky: I've meant a completely separate parallel workflow (on other data), if it does not make sense to assign all free resources to 1 single workflow (e. g. if most processors do not use more than 8 CPUs and you have 32 of them).
I've meant a completely separate parallel workflow (on other data), if it does not make sense to assign all free resources to 1 single workflow (e. g. if most processors do not use more than 8 CPUs and you have 32 of them).
In that case everything @stweil and I said about book/document-parallel processing with ocrd-make (or plain ocrd process
) applies.
- would it be possible to parallelize single processing steps automatically within given bounds (CPU, RAM)
Yes, that's what the above proposal is about. If we change the API for modules, moving from
process
toprocess_page
, then we can do everythingmultiprocessing
and other libraries can give us in core.
BTW this is not necessarily true for bashlib based processors. Currently these have to bake some kind of shell loop around ocrd workspace find
. Whatever we "invent" for the multiprocessing
-enabled new Processor.process
, this would have to be emulated by these shell recipes via shell job control or GNU parallel constructs. (Perhaps we could standardize this usefully in bashlib though: delegating to Processor.process
via a new entrypoint ocrd bashlib process
to be called something like ocrd bashlib process myfunc "$@"
with myfunc
as page-processing function defined somewhere in the script and exported to bashlib with export -f myfunc
.)
Today I had an eye on the resources used by the "workflow for slower processors" (https://ocr-d.de/en/workflows). I was surprised that there already exist several processors which create lots (> 10) of threads. They definitely occupy more than a single CPU core. Especially processors with TensorFlow seem to use all available cores (but very inefficiently). That makes efficient parallelization on the book level currently difficult or even impossible.
Ideally it would be possible to force each processor to use only a single core. Then we could process one book per core.
Especially processors with TensorFlow seem to use all available cores (but very inefficiently). That makes efficient parallelization on the book level currently difficult or even impossible.
Ideally it would be possible to force each processor to use only a single core. Then we could process one book per core.
Yes, that kind of behaviour is very undesirable. It should at least be possible to disable automatic multiprocessing/multithreading when running in parallel. This is especially problematic when running on clusters with explicit resource allocation: You'll get oversubscription problems when e.g. Python processors simply use multiprocessing.Pool
(which internally uses os.cpu_count()
unless multiscalarity is specified explicitly).
We've had this already with other libraries like OpenMP, OpenBLAS and numpy. For Tensorflow I am afraid that (again) we are in a very uncomfortable situation: For one, there is no simple environment variable to control CPU allocation โ it must be done in code. Next, there have been issues with that, IIUC related to TF1 vs TF2 migration. But that ultimately means that we would have to introduce an environment variable to control this in every TF-enabled processor implementation...
But that ultimately means that we would have to introduce an environment variable to control this in every TF-enabled processor implementation...
... or new parameters for the processors which use TF. I think that's also necessary in environments with one or more GPUs: parallel runs need to know which CPU core(s) and which GPU(s) to use. That sounds like a real challenging task, and optimized usage of all resources cannot be achieved with simple environment variables or parameters unless each processor uses only a single CPU core.
... or new parameters for the processors which use TF.
Yes. My point was: changes to the modules require maximal effort and are difficult to enforce/maintain uniformly.
I think that's also necessary in environments with one or more GPUs
GPU resources can be fully controlled via CUDA_VISIBLE_DEVICES
(regardless of the framework, TF / Torch / ...), though. (Also, for multi-GPU, the code needs to be prepared anyway.)
parallel runs need to know which CPU core(s) and which GPU(s) to use. That sounds like a real challenging task, and optimized usage of all resources cannot be achieved with simple environment variables or parameters unless each processor uses only a single CPU core.
I think the only useful and general restriction we can request modules to provide is running 1 CPU core and 0/1 CUDA device when needed (via envvar / param / cfg) โ instead of automatic resource maximization. This would at least allow some parallelism at the workflow level. The runtime will have to guess optimal multi-scalarity (without over/under-allocation) from that anyway, because modules themselves cannot be expected to give reliable estimates under all circumstances. And as long as enough documents are run in parallel, the differences in resource utilisation among the various processors in a workflow (vertically) will even out (horizontally) anyway.
Ideally it would be possible to force each processor to use only a single core. Then we could process one book per core.
From the view of heidelberg digitalization dept. (final check) it would be good to process a single book as fast as possible.
@jbarth-ubhd, tearing out the pages of the book to split it into many is not a solution?
To be serious: technically it is fine to split a book into many books. That means you could make several METS files mets-1.xml
... mets-n.xml
from a single mets.xml
, with n
being the number of parallel workflows. Each of those new METS files would only include a part (about 1/n
) of the original page range. Then n
workflows can run parallel with n
different METS files (parallelization on book level). After the last one has finished, the results could be merged into a single METS file again. Or you can use ocrd workspace bulk-add
to add only the required results (for example ALTO files) to your final METS file.
Has anybody some experience with OCR servers which allow many parallel threads? Maybe something like an AMD EPYC with 64 CPU cores / 128 threads, or even better a server with dual processors and 256 threads? We only have a small Intel machine with 16 cores / 32 threads ...
tearing out the pages of the book to split it into many is not a solution?
This has been my suggestion in the 2000s as I saw the large cutting machines from the bookbinding dept :-) .
Yes, could be done this way in parallel. I'll think sbb_textline will be the bottleneck, if it needs 1 GPU and 10 GB GPU-RAM per instance.
We have a server with 2* Xeon 6150 (36 Cores, 72 Threads, 128 GB RAM), but no GPU unfortunately
Getting back onto the original topic of changing the process API to support catching page-level processor failures or employ page-level parallelism in core, besides the concerns raised already about multi-fileGrp input and multi-page input processors and about bashlib processors in that regard, I want to point out another aspect which we forgot that will likely complicate our porting efforts among processor implementations.
The current API does not make it easy to set up the processor state (resolving file paths, loading large files or initializing large memory objects, setting up configuration and informing about all of that): the constructor is not only used before entering .process()
, it is also needed for the other runtime contexts (help, dump, version, list resources). To differentiate the former from the latter, you need to check whether the processor instance already has an output_file_grp
attribute or the like. And that's more writing effort than just placing all initialization straight into the process
method before the loop. So most implementations have not bothered to spell out which parts pertain to setup and which to processing.
Under the new API, however, that distinction will become obligatory.
EDIT Perhaps the new API should include this directly with a new abstract method Processor.setup()
. Then process_workspace
would call this once before entering its process_page
loop/pool, and call it again whenever the latter came back with an exception. (Not sure how to deal with outright sys.exit
calls though.)