Could you release small/tiny/nano version of detector and descriptor?
zhongqiu1245 opened this issue · 12 comments
Hello, thank you for your amazing job!
I'm really interesting of your job and want to deploy DeDoDe on mobile devices(laptop, even CPU) for some self-driving works.
But I find it is too heavy for mobile device to run DeDoDeDescriptorB, DeDoDeDetectorL.
In my computer(RTX4060 mobile 8G), only 5.4 fps when inputs with 640*480 (tensorrt_fp16)
Could you release small/tiny/nano version of detector and descriptor?
Thank you in advance!
Sure, the easiest I guess would be using vgg11 and reducing layers further. Should be doable. Not sure how much performance will degrade.
about 30fps in RTX4060 mobile 8G.
@zhongqiu1245 could you try out the small detector in the branch that references this issue?
Weights can be found here: https://github.com/Parskatt/DeDoDe/releases/tag/v2
It uses a VGG11 backbone and I reduced the number of layers at each scale from 8 -> 4 and cut the dimensionality in half. I think it should be about 3-4X faster than the _L detector. Could you verify?
Depending on your application it might also be possible to increase the framerate by batching, is this an option for you?
@Parskatt
Sorry for reply so late.
I will verify this.
Thank you!
@Parskatt
Thank you for your DetectorS!
The fps increases rapidly, but still lower than 30fps (15.9fps, DetectorS + DescriptorB, 640*480, tensorrt fp16).
So I reduce the shape of img to 320 * 240, then fps=25, almost there.
Could you release a small version of Descriptor? Like DescriptorS?
Maybe this can help DoDeDo breaks up the limitation of 30fps.
Thank you!
Sure, then I think we can also reduce descriptor size. Does 128 sound better? Is descriptor dimensinality a concern?
Thank you for your reply !
128 sounds better.
Yes, dim is an important factor which can speed up/slow down the inference time of net.The dim is smaller, the speed is faster. However, if dim is too small, it will cause bad performance. I thought dim=64 before but I thought it maybe too small. 128 maybe better :)
Thank you for your generous!
some details:
resolution: (480, 640)
preprocess: 19.606828689575195ms
detectorS: 16.09945297241211ms
descriptorB: 29.36267852783203ms
dualsoftmaxmatcher: 0.6873607635498047ms
postprocess: 0.14138221740722656ms
total: 65.89770317077637ms fps: 15.207663468720314
detectorS & descriptorB are trt_fp16
Okay, so seems like around 20fps is at least possible with current sizes.
Are you able to extract the times for the encoder/decoder parts of the network? Depending on what is taking most time might need to change enc architecture.
The final thing I guess would be to distill both networks into a single network.
ok, I will try later.