tomosipo runtime vs odl
Closed this issue · 2 comments
Hello,
Thanks for developing this interesting project! I am particularly curious about using tomosipo's operators in pytorch, much like what ODL can do. Correct me if I'm wrong, but as far as I understand it, one of tomosipo's "selling points" is the fact that data stays on the GPU as much as possible. This can improve the training runtime as CPU-GPU communication can increase the computation time unnecessarily. In this example, I'm using a GPU (Tesla T4 on google colab), and comparing the runtime between applying the operator in odl, in tomosipo, and the tomosipo autograd operator.
from time import time
import odl
import tomosipo as ts
from tomosipo.odl import (
from_odl,
discretized_space_2d_to_3d,
parallel_2d_to_3d_geometry,
fan_to_cone_beam_geometry,
)
from tomosipo.torch_support import (
to_autograd,
)
import torch
space = odl.uniform_discr(
min_pt=[-20, -20],
max_pt=[20, 20],
shape=[300, 300],
dtype='float32'
)
geometry = odl.tomo.parallel_beam_geometry(space, 360)
operator_odl = odl.tomo.RayTransform(space, geometry, impl='astra_cuda')
space_ts = from_odl(space)
geometry_ts = from_odl(geometry)
operator_ts = ts.operator(space_ts, geometry_ts)
op = to_autograd(operator_ts, is_2d=True, num_extra_dims=2)
x_ts_ag = torch.ones((1,1,300,300),device='cuda')
x_ts = torch.ones((1,300,300),device='cuda')
x_odl = torch.ones((300,300))
t1 = time()
for _ in range(10000):
op(x_ts_ag)
t2 = time()
print('autograd ts op:', t2-t1)
t1 = time()
for _ in range(10000):
operator_ts(x_ts)
t2 = time()
print('ts op:', t2-t1)
t1 = time()
for _ in range(10000):
y = operator_odl(x_odl)
t2 = time()
print('odl op:', t2-t1)
I get the following results
autograd ts op: 12.633586168289185
ts op: 10.378157615661621
odl op: 12.525085687637329
My understanding of these results is that although the tomosipo operator is slightly faster over 10,000 iterations (presumably the latency of CPU-GPU communication as mentioned above?), the addition of the autograd to the operator (a necessary one at that) makes the difference between odl and tomosipo negligible. Hopefully I'm missing something, because the results from your learned_PD notebook look impressive!
Hi,
I don't have an environment that can run ODL at the moment, but I did some benchmarking comparing the ordinary tomosipo operator and the autograd version on both the GPU and CPU at different sizes.
The script was as follows:
from time import time
import tomosipo as ts
from tomosipo.torch_support import (
to_autograd,
)
import torch
repetitions = 10000
print("size ts ts(cuda) ag ag(cuda)")
for size in [100, 200, 300, 400, 500, 750, 1000]:
vg = ts.volume(shape=(1, size, size))
pg = ts.parallel(angles=size, shape=(1, size))
op_ts = ts.operator(vg, pg)
op_ag = to_autograd(op_ts, is_2d=True, num_extra_dims=2)
x_ts = torch.ones((1,size,size))
x_ag = torch.ones((1,1,size,size))
x_ts_c = torch.ones((1,size,size), device="cuda")
x_ag_c = torch.ones((1,1,size,size), device="cuda")
t1 = time()
for _ in range(repetitions):
op_ts(x_ts)
d_ts = time()-t1
t1 = time()
for _ in range(repetitions):
op_ts(x_ts_c)
d_ts_c = time()-t1
t1 = time()
for _ in range(repetitions):
op_ag(x_ag)
d_ag = time()-t1
t1 = time()
for _ in range(repetitions):
op_ag(x_ag_c)
d_ag_c = time()-t1
print(f"{size:4d}:{d_ts:8.3f} {d_ts_c:8.3f} {d_ag:8.3f} {d_ag_c:8.3f}")
And the results were:
size ts ts(cuda) ag ag(cuda)
100: 4.344 2.749 4.921 3.211
200: 5.126 3.191 5.610 3.611
300: 6.544 4.119 7.010 4.610
400: 9.231 6.099 9.734 6.636
500: 13.894 9.543 14.421 10.002
750: 36.014 26.272 36.906 27.240
1000: 73.030 59.331 73.697 59.928
These results show that the performance gain from not doing CPU-GPU copies scales with the size of the problem. On the other hand the performance loss by using the autograd version is roughly constant (at ~0.5s in this case).
Avoiding CPU-GPU copies is not the only factor that may improve performance by using pytorch and cuda. Other computations between the projection operators can also be performed om the GPU by using pytorch, which may further improve performance. In the paper we showed that when performing the SIRT reconstruction algorithm fully on the GPU a two times performance increase was realized versus only performing the projection operators on the GPU.
I hope this helped! If you have any further questions or interesting benchmarking results, please let me know!
Kind regards,
Dirk
Thanks! That makes a lot of sense. Upon reflection, I also think my test isn't totally fair since I didn't make the odl operator pytorch compatible with odl.contrib.torch.OperatorModule
, which probably would have also added some overhead. I'll let you know if I have any other thoughts in the meantime!