Inference Time

Hi, I've run evaluation script and get the inference time for get 1000 samples for 1 target pose(set num_samples to 1000, num_references to 1). The time is 1.46s and my hardware is intel i7 cpu and T1000 gpu. Even though I changed to server with RTX8000 gpu, the time cost is still 1.1s, which seems to be much higher than that in paper(85ms for 1000 samples). Is there some time reduction trick?

Sorry it's been awhile so I will have to look into it and rerun on my hardware just to confirm.

Off the top of my head, that time is really large, can you make sure you are counting the time of inference for 1 solution and not the total time of multiple samples during trajectory planning?

Sorry it's been awhile so I will have to look into it and rerun on my hardware just to confirm.

Off the top of my head, that time is really large, can you make sure you are counting the time of inference for 1 solution and not the total time of multiple samples during trajectory planning?

Even though i set num_samples and num_referenes to 1 in evalutation_panda_urdf.py, the time for nodeik.inverse_kinematics(pose_sets) is still 0.7s.

How are you timing this?

If you are getting 1.1s on 1000 samples, and 0.7s on 1 sample, it seems like including the time it takes to initialize and/or transfer to the GPU, not the time of inference.

maybe @psh117 can provide more context, otherwise i'll take a look this week (i have different hardware, OS, etc. since we did this paper).

Also, did you use the fast inference setting on:

nodeik/examples/evaluation_panda_urdf.py

Line 29 in f510ff7

atol = 1e-5 # 1e-3 for fast inference, 1e-5 for accurate inference

How are you timing this?

If you are getting 1.1s on 1000 samples, and 0.7s on 1 sample, it seems like including the time it takes to initialize and/or transfer to the GPU, not the time of inference.

maybe @psh117 can provide more context, otherwise i'll take a look this week (i have different hardware, OS, etc. since we did this paper).

I just put time.time() before and after line 59 in model_wrapper..py: ik_q, delta_logp = self.model(x,c,zero, rev=True) and the time counted should be model inference time. Also, atol is set 1e-3 but the time cost is slightly reduced.

Sorry for the difficulty in these timings. We will need Suhan to provide a definitive answer on how he benchmarked the GPU versions (nodeik and ikflow).

I think it might have been using wandb. Generally there are specific ways to benchmark like https://pytorch.org/tutorials/recipes/recipes/benchmark.html
https://deci.ai/blog/measure-inference-time-deep-neural-networks/

Thanks for raising this issue though. Once I get more clarity i'll add this info to the readme.

Have you executed the model more than twice within a script? I believe the timing includes the model initialization time, as mentioned by @cadop. Performing a dummy inference after creating model could improve the inference time.

Have you executed the model more than twice within a script? I believe the timing includes the model initialization time, as mentioned by @cadop. Performing a dummy inference after creating model could improve the inference time.

I've tried running 10 times within a script and the time is reduced to 0.5s for 1000 samples. However, larger than that in paper.

The inference time may vary, but 0.5 seconds seems excessively slow I think. Could you please confirm the version of torchdiffeq you are using? I conducted my testing with torchdiffeq==0.2.3.

I made some modifications to the evaluation script. Can you do the test using these changes?

    nodeik.eval()

    t_start = time.time()
    ik_q, _ = nodeik.inverse_kinematics(pose_sets)
    t_end = time.time()
    print('time:', (t_end - t_start)*1000, 'ms')

    t_start = time.time()
    ik_q, _ = nodeik.inverse_kinematics(pose_sets)
    t_end = time.time()
    print('time:', (t_end - t_start)*1000, 'ms')
    fk_sets = nodeik.forward_kinematics(ik_q)

And the result on my system (RTX 4080). atol=1e-3 and rtol=1e-3.

Warp initialized:
   Version: 0.2.0
   Using CUDA device: NVIDIA GeForce RTX 4080
   Using CPU compiler: /usr/bin/g++
0 panda_joint1
1 panda_joint2
2 panda_joint3
3 panda_joint4
4 panda_joint5
5 panda_joint6
6 panda_joint7
link_index {'panda_link0': 0, 'panda_link0_sc': 1, 'panda_link1_sc': 2, 'panda_link1': 3, 'panda_link2_sc': 4, 'panda_link2': 5, 'panda_link3_sc': 6, 'panda_link3': 7, 'panda_link4_sc': 8, 'panda_link4': 9, 'panda_link5_sc': 10, 'panda_link5': 11, 'panda_link6_sc': 12, 'panda_link6': 13, 'panda_link7_sc': 14, 'panda_link7': 15, 'panda_link8': 16, 'panda_hand': 17}
Lightning automatically upgraded your loaded checkpoint from v1.6.0 to v2.0.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../model/panda_loss-20.ckpt`
Module warp.sim.articulation load took 1.23 ms
torch.Size([1024, 7]) torch.Size([1024, 7]) torch.Size([1024, 1])
time: 615.9012317657471 ms
time: 76.89261436462402 ms
mean position    error: 0.009137176
mean orientation error: 0.016926718011890746

My torchdiffeq is 0.2.3 and i tried the same as yours. here is the log:

Warp 0.7.2 initialized:
   CUDA Toolkit: 11.5, Driver: 12.0
   Devices:
     "cpu"    | x86_64
     "cuda:0" | Quadro T1000 (sm_75)
   Kernel cache: /home/.cache/warp/0.7.2
0 panda_joint1
1 panda_joint2
2 panda_joint3
3 panda_joint4
4 panda_joint5
5 panda_joint6
6 panda_joint7
link_index {'panda_link0': 0, 'panda_link0_sc': 1, 'panda_link1_sc': 2, 'panda_link1': 3, 'panda_link2_sc': 4, 'panda_link2': 5, 'panda_link3_sc': 6, 'panda_link3': 7, 'panda_link4_sc': 8, 'panda_link4': 9, 'panda_link5_sc': 10, 'panda_link5': 11, 'panda_link6_sc': 12, 'panda_link6': 13, 'panda_link7_sc': 14, 'panda_link7': 15, 'panda_link8': 16, 'panda_hand': 17}
Lightning automatically upgraded your loaded checkpoint from v1.6.0 to v1.9.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file model/panda_loss-20.ckpt`
Module warp.sim.articulation load on device 'cpu' took 33.29 ms
torch.Size([1024, 7]) torch.Size([1024, 7]) torch.Size([1024, 1])
time: 1119.408369064331 ms
time: 581.2697410583496 ms
mean position    error: 0.0065971585
mean orientation error: 0.011576415242703343

Here is log in server:

Warp 0.7.2 initialized:
   CUDA Toolkit: 11.5, Driver: 12.0
   Devices:
     "cpu"    | x86_64
     "cuda:0" | Quadro RTX 8000 (sm_75)
   Kernel cache: /home/junfeng/.cache/warp/0.7.2
0 panda_joint1
1 panda_joint2
2 panda_joint3
3 panda_joint4
4 panda_joint5
5 panda_joint6
6 panda_joint7
link_index {'panda_link0': 0, 'panda_link0_sc': 1, 'panda_link1_sc': 2, 'panda_link1': 3, 'panda_link2_sc': 4, 'panda_link2': 5, 'panda_link3_sc': 6, 'panda_link3': 7, 'panda_link4_sc': 8, 'panda_link4': 9, 'panda_link5_sc': 10, 'panda_link5': 11, 'panda_link6_sc': 12, 'panda_link6': 13, 'panda_link7_sc': 14, 'panda_link7': 15, 'panda_link8': 16, 'panda_hand': 17}
Lightning automatically upgraded your loaded checkpoint from v1.6.0 to v1.9.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file model/panda_loss-20.ckpt`
Module warp.sim.articulation load on device 'cpu' took 31.56 ms
torch.Size([1024, 7]) torch.Size([1024, 7]) torch.Size([1024, 1])
time: 654.2432308197021 ms
time: 169.31962966918945 ms
mean position    error: 0.006597163
mean orientation error: 0.011576218433427071

Also, I note that time cost is still 100ms+- with 100 samples. It seems not to be linear?

Also, I note that time cost is still 100ms+- with 100 samples.

This is due to batch inferences, so the inference time for fewer than 100 samples would be relatively consistent.

And for your results, I think it is somehow weird. Please check if there are sufficient GPU resources available for use, as other programs might be affecting the performance.

Also, I note that time cost is still 100ms+- with 100 samples.

This is due to batch inferences, so the inference time for fewer than 100 samples would be relatively consistent.

Could you please tell me time cost of 100 samples in your environment?

Also, I note that time cost is still 100ms+- with 100 samples.

This is due to batch inferences, so the inference time for fewer than 100 samples would be relatively consistent.

Could you please tell me time cost of 100 samples in your environment?

In my environment, it took about 68 ms with 100 samples.

time: 599.5767116546631 ms (first inference)
time: 67.74044036865234 ms (second inference)

just random thought, are you sure its running on the GPU? the log says

Module warp.sim.articulation load on device 'cpu' took 31.56 ms

you can see @psh117 log does not have the statement it was loaded on cpu