chrissimpkins/crypto

Parallel execution of gpg subprocess.

Opened this issue · 6 comments

It seems parallel execution of multiple instances of gpg could vastly increase the performance over the current iterative approach. Since the same passcode is used for all files anyway, it could make sense to allow them to be en-/decrypted at the same time. (The same obviously is true for my proposed creation of tar archives). Of course the user can already just start crypto multiple times and this idea might be overly ambitious, but it sounds like a good-to-have feature.

Definitely has been on the to do list. This is high priority but is going to require some refactoring of the encryption/decryption code because the class methods that I am using to compress (https://github.com/chrissimpkins/crypto/blob/master/lib/crypto/library/cryptor.py#L32) don't pickle which prevents you from using the multiprocessing module. This is very doable and not terribly difficult, but will require some refactoring.

queued for future version TBD

Let's make this the default and spawn a process count that is determined by comparing multiprocessing.cpu_count() and the number of requested files. We can pool the files in ~ equal lots over separate subprocesses using the worker pools like you are doing in PR #16. The CPU count returned by the Python function will be the upper limit of spawned worker processes.

IMO, this should be transparent to the user and not require a command line flag or explicit definition of the process count. We can (and should) play around with performance tuning in the code. For this application, I don't believe that performance tuning on the user side is going to be in high demand. Let's provide a simple, automated approach that addresses CPU bound compression and encryption when they are on a system that supports it.

Links for future reference:

  • multiprocessing.cpu_count(): Link
  • multiprocessing.Pool: Link

Yes, making it parallel by default and using the CPU count is a good idea in my opinion. We could argue that it should actually be cpu_count() - 1 to allow the user to do other things, but I don't think gpg generally will max out the CPU and thus cpu_count() should work just fine.

I agree

Let's chat more about this before you go through the trouble of a large refactoring of the code.