ahendriksen/tomosipo

tomosipo runtime vs odl

Closed this issue · 2 comments

Hello,

Thanks for developing this interesting project! I am particularly curious about using tomosipo's operators in pytorch, much like what ODL can do. Correct me if I'm wrong, but as far as I understand it, one of tomosipo's "selling points" is the fact that data stays on the GPU as much as possible. This can improve the training runtime as CPU-GPU communication can increase the computation time unnecessarily. In this example, I'm using a GPU (Tesla T4 on google colab), and comparing the runtime between applying the operator in odl, in tomosipo, and the tomosipo autograd operator.

from time import time
import odl
import tomosipo as ts
from tomosipo.odl import (
    from_odl,
    discretized_space_2d_to_3d,
    parallel_2d_to_3d_geometry,
    fan_to_cone_beam_geometry,
)
from tomosipo.torch_support import (
    to_autograd,
)
import torch


space = odl.uniform_discr(
    min_pt=[-20, -20], 
    max_pt=[20, 20], 
    shape=[300, 300],
    dtype='float32'
)
geometry = odl.tomo.parallel_beam_geometry(space, 360)
operator_odl = odl.tomo.RayTransform(space, geometry, impl='astra_cuda')

space_ts = from_odl(space)
geometry_ts = from_odl(geometry)
operator_ts = ts.operator(space_ts, geometry_ts)

op = to_autograd(operator_ts, is_2d=True, num_extra_dims=2)

x_ts_ag = torch.ones((1,1,300,300),device='cuda')
x_ts = torch.ones((1,300,300),device='cuda')
x_odl = torch.ones((300,300))

t1 = time()
for _ in range(10000):
    op(x_ts_ag)
t2 = time()

print('autograd ts op:', t2-t1)

t1 = time()
for _ in range(10000):
    operator_ts(x_ts)
t2 = time()

print('ts op:', t2-t1)

t1 = time()
for _ in range(10000):
    y = operator_odl(x_odl) 
t2 = time()

print('odl op:', t2-t1)

I get the following results

autograd ts op: 12.633586168289185
ts op: 10.378157615661621
odl op: 12.525085687637329

My understanding of these results is that although the tomosipo operator is slightly faster over 10,000 iterations (presumably the latency of CPU-GPU communication as mentioned above?), the addition of the autograd to the operator (a necessary one at that) makes the difference between odl and tomosipo negligible. Hopefully I'm missing something, because the results from your learned_PD notebook look impressive!

Hi,

I don't have an environment that can run ODL at the moment, but I did some benchmarking comparing the ordinary tomosipo operator and the autograd version on both the GPU and CPU at different sizes.

The script was as follows:

from time import time
import tomosipo as ts
from tomosipo.torch_support import (
    to_autograd,
)
import torch

repetitions = 10000

print("size     ts     ts(cuda)     ag     ag(cuda)")
for size in [100, 200, 300, 400, 500, 750, 1000]:
    vg = ts.volume(shape=(1, size, size))
    pg = ts.parallel(angles=size, shape=(1, size))
    
    op_ts = ts.operator(vg, pg)
    op_ag = to_autograd(op_ts, is_2d=True, num_extra_dims=2)
    
    x_ts = torch.ones((1,size,size))
    x_ag = torch.ones((1,1,size,size))
    x_ts_c = torch.ones((1,size,size), device="cuda")
    x_ag_c = torch.ones((1,1,size,size), device="cuda")
    
    t1 = time()
    for _ in range(repetitions):
        op_ts(x_ts)
    d_ts = time()-t1
    
    t1 = time()
    for _ in range(repetitions):
        op_ts(x_ts_c)
    d_ts_c = time()-t1
    
    t1 = time()
    for _ in range(repetitions):
        op_ag(x_ag)
    d_ag = time()-t1
    
    t1 = time()
    for _ in range(repetitions):
        op_ag(x_ag_c)
    d_ag_c = time()-t1
    
    print(f"{size:4d}:{d_ts:8.3f}  {d_ts_c:8.3f}  {d_ag:8.3f}  {d_ag_c:8.3f}")

And the results were:

size     ts     ts(cuda)     ag     ag(cuda)
 100:   4.344     2.749     4.921     3.211
 200:   5.126     3.191     5.610     3.611
 300:   6.544     4.119     7.010     4.610
 400:   9.231     6.099     9.734     6.636
 500:  13.894     9.543    14.421    10.002
 750:  36.014    26.272    36.906    27.240
1000:  73.030    59.331    73.697    59.928

These results show that the performance gain from not doing CPU-GPU copies scales with the size of the problem. On the other hand the performance loss by using the autograd version is roughly constant (at ~0.5s in this case).

Avoiding CPU-GPU copies is not the only factor that may improve performance by using pytorch and cuda. Other computations between the projection operators can also be performed om the GPU by using pytorch, which may further improve performance. In the paper we showed that when performing the SIRT reconstruction algorithm fully on the GPU a two times performance increase was realized versus only performing the projection operators on the GPU.

I hope this helped! If you have any further questions or interesting benchmarking results, please let me know!

Kind regards,
Dirk

Thanks! That makes a lot of sense. Upon reflection, I also think my test isn't totally fair since I didn't make the odl operator pytorch compatible with odl.contrib.torch.OperatorModule, which probably would have also added some overhead. I'll let you know if I have any other thoughts in the meantime!