efficiency of the projection algorithm / G-arm
WilsonFok2020 opened this issue · 14 comments
Hello,
I am writing to see if you can give me a hand on improving the efficiency of the projection algorithm. Here is what I think how the projection algorithm works.
First, the c-arm is moved to a specific location/ orientation in space, with camera coordinates updated.
Second, the direction of the x-ray is computed.
Third, the path of the ray is mapped and the ray is shone on CT objects.
Fourth, the energy of the ray is absorbed by material standing on its path.
Fifth, whatever left of it is computed at the detector / sensor.
Sixth, any relative difference is shown as the difference in intensity of the projection image (ranging between 0 to 1)
In the case where we have to move the c-arm to multiple locations or orientations, each projection is performed in a sequential manner (step 1-6). However, if hardware is capable, i.e. several CT volumes can be fitted on a GPU, I would like to perform multiple projections together.
As the CT data is not large, hardware appears not to be the issue here. More crucially, I am unsure how to parallize the existing algorithm.
What I have succeeded to do is to divide the input argument space and to parallize the task on separate processes (each with its own unique PID) up to 4 times. The method is to run 4 separate python scripts simultaneously, each carrying its own mutually exclusive set of input parameters. The shortcoming is that the same CT volume is copied 4 times on the GPU, and that these processes do not talk to each other, and that this method falls short in resolving my next problem.
My next problem is to emulate a g-arm, a device that is like 2 c-arms joined at right angle from each other. A g-arm can take the frontal and lateral view simultaneously because it carries 2 sets of sensor and source, at right angle to each other.
One approach is to run step 1-6 twice, with one setting being orthogonal to another. However, complete projection takes too long. In other words, it is too slow for a “real-time” widget on python jupyter notebook that illustrates how changes in alpha, beta, distances and offset affect the appearances of the projected images.
My thinking is that if I can keep a main process and spawn 2 subprocesses, each carrying a separate projector object, then I can pass from main process 2 sets of orthogonal setting to each object in subprocesses and get back the frontal and lateral views simultaneously on the main process.
While the Process from the standard python multiprocess library typically lets me to spawn processes, I face numerous errors concerning pycuda, memory initialization, etc. These errors do not arise in a single/ main process. As a note, it is possible to create multiple projector objects on the SAME / main process. This is verified by the increased amount of memory usage on the GPU under an identical PID. Of course, this is no use whatsoever except in showing that errors do not arise from having multiple objects alone.
Any ideas on how the parallization could be achieve for the notebook?
Thanks
Hi @WilsonFok2020,
Thank you for your interest in using our tools. Here are a few points, though not sure I will be able to address all of them. @benjamindkilleen and @gaocong13 may answer as well, though we are currently working around a few important deadlines so thank you for your patience.
- The way you describe the algorithm is approximately correct from a high level; there are several more things happening. Depending on the volume size you work with and your GPU, GPU memory may in fact be scarce. But again, this depends on your specific setup.
- Because the above point was way worse when we first released the tool, we did not consider working on multiple volumes simultaneously. Rather, we were trying to minimize memory transfer times, i.e., having one volume (a 3D scene) uploaded on GPU and then interacting with it by manipulating the geometry (which only requires transfer of a new projection matrix). The exact reason for this has not been publicized yet, but keep an eye out ;)
- For simulating a G-arm, I would recommend to simply extend some of the functionality we provide and either, rather than calling the projector once with a matrix call it twice with the two matrices of the two cameras; or, re-write the projector to perform two projections immediately. The memory footprint for the latter will not be much larger, but it will also not be much faster since you really only save one memory transfer (unless of course you parallelize over pixels in both detector planes which would require you to spawn 2x NumPixel threads, which may be possible with contemporary hardware).
- Other than that, the bug I think you are describing seem to be associated with memory management where your threads try to allocate the same memory on the GPU. So I think a different approach is necessary, and I think that you could do this even without multiple CPU threads.
- Your interface with sliders for adjusting spatial parameters is neat, we wanted to do something similar but never got around to it. Do you want to contribute it to the public repo?
Cheers,
m
Hi @mathiasunberath ,
As I have been busy working on other projects, I have not managed to try out the above suggestions.
After reading your reply, I reckon there is not much I can re-write to speed up the projection as the possible ways of speeding up need to run on more cpu cores or gpu with much larger memory capacity. Neither is available for this project at the moment.
Having that said, I have not given up on making it run in “real time”.
My latest idea to speed up the projection is to do less of it.
Let me explain.
No matter which scanners we are talking about (ie c-arm or g-arm), we need to transfer the volume (CT) onto the gpu. It is hard to save the time there.
Whenever we are using a single volume, the easiest way is for the camera to move to different position and orientation to take the shot. This is what I have been trying and is what mentioned in ITEM 3.
Parallelization may give us some speed up if and only if we can afford to load multiple of the same volume on the gpu and have 2xnumPixel threads.
In addition, the resolution of a projected image contributes to the speed of the projection significantly.
Empirically, the time taken to project an image decreases exponentially with shorter height / width (a square detector panel) due to the fact that area = height x width.
That implies whenever we increase the pixel spacing or decrease the resolution, projection runs much faster at the expense of resolution.
Occasionally, low resolution images can be enhanced via neural network. An example of such work can be found at (https://arxiv.org/abs/2008.08767)
I have read several of such work and come away with the hunch that these networks may be suitable for enhancing the low resolution projection to ok-resolution projection.
These networks appear to be good at filling in repetitive patterns, which are abundant in spine CT because of the spine anatomy.
Therefore, a “quick” projection may come from combining the use of a network along with low resolution projection.
Of course, this at best a bridging solution. It would be great to see a solution provided on the fundamental projection algorithm.
Cheers,
Changing the width and height of a projection image will not in fact accelerate projection because the algorithm is parallelized over this dimension, i.e.: provided your GPU has enough threads (which all contemporary ones should), the projection time will not improve by doing this.
You can change the resolution of sampling the rays during casting, but I'd not necessarily recommend doing that.
Adding to what Mathias has suggested, there is one area where further parallelization may enable what you are envisioning. I would not recommend loading the volume twice. However, there's no reason you couldn't generalize the kernel over NUM_VIEWS
number of projection matrices. This might accomplish what you want with simulating a G-arm in a single call.
I have done a small experiment to work out the impact of resolution on the time taken to generate a projection image.
Two settings were tested on a Telas K80 with CUDA 10.2. A large one (1024 x1024, pixel size 0.194 mm) takes about
1min 21s ± 157 ms per loop (mean ± std. dev. of 5 runs, 1 loop each); a small one (102x102, pixel size 1.9mm) takes about 4.65 s ± 19.2 ms per loop (mean ± std. dev. of 5 runs, 1 loop each).
It appears that lowering the resolution speeds the projection up. How could we maintain the speed with rising resolution?
P.S. here is the testing code:
alpha_list = np.linspace(180, 0, 36)
beta = 0
def setup_carm(factor):
min_alpha = -180
max_alpha = 180
min_beta = -180
max_beta = 180
pixel_size = 0.194 * factor
sensor = int(1024 / factor)
carm = MobileCArm(min_alpha = min_alpha,
max_alpha = max_alpha,
min_beta = min_beta,
max_beta = max_beta,
source_to_detector_distance=1110,
source_to_isocenter_vertical_distance = 600,
source_to_isocenter_horizontal_offset = 0,
sensor_width = sensor,
sensor_height = sensor,
pixel_size = pixel_size)
carm.reposition(dataset.volume.center_in_world)
print ('factor {}, pixel_size {}, sensor {}'.format(factor, pixel_size, sensor))
return carm
def filming(alpha_list, beta, projector):
for alpha in alpha_list:
projector.carm.move_to(alpha=alpha, beta=beta, degrees=True)
projection = projector()
return None
def setup_projector(dataset, carm):
projector = Projector(dataset.volume, carm=carm,
add_noise=False,
spectrum=dataset.spectrum)
projector.initialize()
return projector
carm = setup_carm(1)
projector = setup_projector(dataset, carm)
%timeit -r 5 filming(alpha_list, beta, projector)
projector.free()
The test is supposed to just measure the time required for the projection itself, excluding volume transfer to gpu.
You are performing multiple projections and while indeed the volume is only transferred once, the memory for the projection has to be transferred multiple times. So you are still measuring which clearly with a larger projection will take longer. I'm not fully aware if we currently mix the materials in the detector or not (@benjamindkilleen ?), but if we do not and transfer one projection for every material than that could be an option for speedup.
I just want to be clear that I understand what you meant.
The reason behind the huge discrepancy of the two duration measures is due to the time required to transfer a 1024x1024 (float32) array from the gpu memory to the cpu memeory being far longer than 102x102. The speed at which this occurs depends on the size of the array. Thus, the test measures projection + transfer of the projection images.
With that said, based on the specification from NVIDIA for Tesla K80, I find it hard to figure out why 480GB/s memory bandwidth is insufficient to make the transfer "very fast".
P.S. https://www.nvidia.com/en-gb/data-center/tesla-k80/
4992 NVIDIA CUDA cores with a dual-GPU design
Up to 2.91 teraflops double-precision performance with NVIDIA GPU Boost
Up to 8.73 teraflops single-precision performance with NVIDIA GPU Boost
24 GB of GDDR5 memory
480 GB/s aggregate memory bandwidth
ECC protection for increased reliability
Server-optimised to deliver the best throughput in the data center
What I've been saying so far is that the time you have measured does include the memory transfer. Before we can conclude whether smaller projections speed up beyond memory transfer we need to benchmark differently.
Further, there may be subsequent processing (i.e. noise etc.) which is performed after projection which will also scale with image size.
The current setup mixes materials in the kernel, and the subsequent processing is turned off by default. Thus, your test is likely capturing the transfer of the final image, as Mathias suggests. Again, this might be sped up by generalizing to multiple projections in a single kernel call.
What is the bandwidth of your RAM? That is likely a bottleneck if it is less than the GPU.
For some unknown reason, I could not identify the bandwidth of the system's RAM. As our code is run on AWS, perhaps AWS does not have hardware allocated to a virtual machine?
Here is the printout:
``
sudo dmidecode --type 17
dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.
Handle 0x1100, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: 0x0000
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: Not Specified
Type: RAM
Type Detail: None
Speed: Unknown
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Rank: Unknown
Configured Clock Speed: Unknown
Handle 0x1101, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: 0x0000
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: Not Specified
Type: RAM
Type Detail: None
Speed: Unknown
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Rank: Unknown
Configured Clock Speed: Unknown
Handle 0x1102, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: 0x0000
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 2
Bank Locator: Not Specified
Type: RAM
Type Detail: None
Speed: Unknown
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Rank: Unknown
Configured Clock Speed: Unknown
Handle 0x1103, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: 0x0000
Total Width: 64 bits
Data Width: 64 bits
Size: 13312 MB
Form Factor: DIMM
Set: None
Locator: DIMM 3
Bank Locator: Not Specified
Type: RAM
Type Detail: None
Speed: Unknown
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Rank: Unknown
Configured Clock Speed: Unknown
sudo lshw -C memory
*-firmware
description: BIOS
vendor: Xen
physical id: 0
version: 4.2.amazon
date: 08/24/2006
size: 96KiB
capabilities: pci edd
*-memory
description: System Memory
physical id: 1000
size: 61GiB
capabilities: ecc
configuration: errordetection=multi-bit-ecc
*-bank:0
description: DIMM RAM
physical id: 0
slot: DIMM 0
size: 16GiB
width: 64 bits
*-bank:1
description: DIMM RAM
physical id: 1
slot: DIMM 1
size: 16GiB
width: 64 bits
*-bank:2
description: DIMM RAM
physical id: 2
slot: DIMM 2
size: 16GiB
width: 64 bits
*-bank:3
description: DIMM RAM
physical id: 3
slot: DIMM 3
size: 13GiB
width: 64 bits
I have timed the duration of the transfer of the projection image and projection separately.
For simplicity sake, I used the same camera pose for the two "orthogonal" views (stored in intesity_gpu / 2) in this test.
start_time = time.time()
self.project_kernel(
*args, offset_w, offset_h, block=block, grid=(blocks_w, blocks_h)
)
context.synchronize()
print ('taken {} s'.format(time.time() - start_time))
start_time = time.time()
self.project_kernel(
*args2, offset_w, offset_h, block=block, grid=(blocks_w, blocks_h)
)
context.synchronize()
print ('taken {} s'.format(time.time() - start_time))
With enforcing context.synchronize(), a projection takes 2s.
taken 1.801316499710083 s
taken 1.806384563446045 s
Is it supposed to take 2s ? Could it be faster ?
Memory transfer is very fast.
transfer taken 0.0005948543548583984 s
transfer taken 0.0005393028259277344 s
intensity2 = np.empty(self.output_shape, dtype=np.float32)
start_time = time.time()
cuda.memcpy_dtoh(intensity2, self.intensity_gpu2)
print ('transfer taken {} s'.format(time.time() - start_time))
intensity = np.empty(self.output_shape, dtype=np.float32)
start_time = time.time()
cuda.memcpy_dtoh(intensity, self.intensity_gpu)
print ('transfer taken {} s'.format(time.time() - start_time))
Thanks,
Just for the record, you are absolutely right. I have done further experiements to show that indeed the spatial resolution has nothing to do with the speed of projection becuase cuda parallelizes that computation on the blocks and grids.
Good to know - caveat being that it will affect render time once the number of pixels in the projections outnumber the threads.
Did you further investigate the memory transfer time issue? Is it initialization related in the first call?
from my experience, it takes longer if the array is larger. However, for this application, the time required to transfer an array of 1024x1024 is very short (< 0.001s)