Seems GPU was not used?

Question

Seems GPU was not used?

SHuang-Broad opened this issue 5 years ago · 6 comments

So I've built racon (a docker image) with GPU support, and tested the docker with the provided racon_test successfully.
For example, in running racon_test, one GPU-specific test outputs

[ RUN      ] RaconPolishingTest.FragmentCorrectionWithQualitiesFullMhapCUDA
Using 1 GPU(s) to perform polishing
Initialize device 0
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.041138 s
[racon::Polisher::initialize] loaded sequences 0.039576 s
[racon::Polisher::initialize] loaded overlaps 0.009344 s
[racon::Polisher::initialize] aligning overlaps [====================] 4.996705 s
[racon::Polisher::initialize] transformed data into windows 0.053175 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 1.388905 s
[racon::CUDAPolisher::polish] polished windows on GPU 1.949515 s======> ] 1.273181 s
[racon::CUDAPolisher::polish] polished remaining windows on CPU 0.007528 s
[racon::CUDAPolisher::polish] generated consensus 0.003166 s
[racon::Polisher::] total = 8.896212 s
[       OK ] RaconPolishingTest.FragmentCorrectionWithQualitiesFullMhapCUDA (8898 ms)

However, it seems that when running with actual data, this is what I get, and GPU doesn't seem to be used?

[racon::Polisher::initialize] loaded target sequences 0.198220 s
[racon::Polisher::initialize] loaded sequences 247.850106 s
[racon::Polisher::initialize] loaded overlaps 5.450893 s
[racon::Polisher::initialize] aligning overlaps [====================] 823.672149 s
[racon::Polisher::initialize] transformed data into windows 14.626365 s
[racon::Polisher::polish] generating consensus [====================] 5420.680712 s
[racon::Polisher::] total = 6514.470406 s

Thanks!

Answer 1 · 2020-02-28T09:28:57.000Z

Hi Steve,
in order to run racon on GPU you need to specify the number of batches for the POA part and optionally the number of batches for the alignment part. You can also use banded approach on the GPU if you want to. Bellow are the options you need to set. I have updated the README as some of them were incomplete. POA batches take around 2.2GB of memory so you have to set that accordingly with your GPU. For the alignment batches I am not sure. Maybe @tijyojwad and @vellamike can help out.

-c, --cudapoa-batches <int>
    default: 0
    number of batches for CUDA accelerated polishing
-b, --cuda-banded-alignment
    use banding approximation for polishing on GPU. Only applicable when -c is used.
--cudaaligner-batches <int>
    default: 0
    number of batches for CUDA accelerated alignment

Best regards,
Robert

Answer 2 · 2020-02-28T15:55:05.000Z

Thanks Robert.
My understanding is that I need to set both -c and --cudaaligner-batches to non-zero values to make better use of the GPU, right?
And the values to be set to is at least limited by the amount of GPU memory available.

Thanks,
Steve

Answer 3 · 2020-02-28T16:15:07.000Z

Hi Steve, you got that right. Setting -c and --cudaaligner-batches to non-zero will enable GPU acceleration for those steps accordingly.

For accelerated POA racon uses 90% of available memory and then distributes that within the specified batches.
For accelerated alignment each batch takes up a fixed amount of memory, so memory scales up as more batches are added.

Answer 4 · 2020-02-28T17:27:13.000Z

Thanks Joyjit!

I'm monitoring my GPU (P100) now on my test data (Malaria, ~300X coverage ONT reads).
It seems that the memory usage per GPU-accelerated alignment batch is around 700MB?
Or is that related to other variables like N50, GC, etc. hence will be a case-by-case fix amount?

Thanks!

Answer 5 · 2020-03-01T21:19:34.000Z

The batch size for alignment on the GPU is currently hard coded, so each batch will take around 700MB irrespective of the properties of the data.
The CUDA accelerated alignment I'd say is still in beta phase, and undergoing some significant improvements. We'll update the racon integration to be more aware of the coverage levels and average read lengths, but for now it's static.
The batch size can be updated in code (https://github.com/lbcb-sci/racon/blob/master/src/cuda/cudapolisher.cpp#L91) and the upper limit on read length per batch is specified at https://github.com/lbcb-sci/racon/blob/master/src/cuda/cudapolisher.cpp#L169

Answer 6 · 2020-03-01T21:33:23.000Z

Thanks Joyjit!

I've indeed observed that the CUDA alignment step is slightly slower than the CPU version, so it's re-assuring that you are working on it!

Closing since my questions are answered, fast!
Thanks all!