danielgatis/rembg

[FEATURE] Add --gpu 0 --gpu 1

zackees opened this issue · 4 comments

Is your feature request related to a problem? Please describe.
I would like to request parallel execution of rembg by exposing the GPU card specified with --gpu 0, --gpu 1

Implicitly, --gpu 0 is specified when using the rembg. --gpu 1 uses the second selected GPU device... etc. This would allow a front end tool (like removebackground in https://github.com/zackees/zcmds) to run transparent png generation for video, which is very bottlenecked with the ML processing step.

Describe the solution you'd like
~2x speed for dual+ GPU owners.

For anyone else coming here for the exact same issue. Some interesting notes that I made while researching this:

The onyxruntime does not allow querying of active cards. You need to have pytorch or tensorflow, but this will only work if they have the very large GPU deps loaded (yuck). You can query otherwise with nvidia-smi.

You can manually set the GPU device via environmental variables

export CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=1

Okay, i was able to implement this in my front end with a --gpu-count 2

If anyone wants to check that out see removebackground cmdlet in https://github.com/zackees/zcmds

To close the loop on this, what was going on is that despite the install instructions, I could not get rembg to use the GPU on windows using the pipx install. I think it's possible to do an isolated environment and conditionally inject the needed torch files.

The work around for performance is to recognize that the CPU version of the executor does not seems to utilize threads effectively so you can actually get a lot more performance if you split up the number of files into separate folders and run the p tool on each, then merge back into one folder.

On my 12 CPU machine I was able to max out the performance at about 5 processes. Each process will report that it's running 4.5 seconds per iteration, so in AVERAGE that is under 1 second per image, so I consider that a massive win. The single threaded performance is about 2.8 seconds per image on my machine. So running parallel rembg p gave me a speedup of about 3x.