ANTsX/ANTsPy

Speed difference between binaries and python wrappers

Closed this issue ยท 14 comments

Describe the bug
There is a massive difference of computation time between Terminal CLI and the Python wrappers for N4BiasFieldCorrection and DenoiseImage

From the terminal :
N4BiasFieldCorrection : 22s
DenoiseImage : 76s

From Python:
N4BiasFieldCorrection : 33s
DenoiseImage : 602s

To reproduce
I use a classic 3DT2 in 0.8mm iso:

************************************************
Image name:          "in.nii"
************************************************
  Dimensions:        256 x 320 x 320
  Voxel size:        0.8 x 0.8 x 0.8
  Data strides:      [ 1 2 3 ]
  Format:            NIfTI-1.1
  Data type:         signed 16 bit integer (little endian)
  Intensity scaling: offset = 0, multiplier = 1
  Transform:                    1           0           0      -106.6
                                0           1           0       -93.8
                                0           0           1        -136
  comments:          TE=4.2e+02;Time=164035.610;phase=1

From the terminal :

cd /tmp/test_antspy
N4BiasFieldCorrection -i in.nii -o n4_in.nii -v 1
DenoiseImage -i n4_in.nii -o dn_n4_in.nii -v 1 -n Rician

From Python :

import ants

img = ants.image_read('/tmp/test_antspy/in.nii')
img = ants.n4_bias_field_correction(img, verbose=True)
img = ants.denoise_image(img, v=1)
ants.image_write(img, '/tmp/test_antspy/out.nii')

Expected behavior
Since I built from source ANTs and ANTsPy, I would expect roughly the same computation.
x1.5 N4BiasFieldCorrection is unexpected but ok, however x8 for DenoiseImage is very weird.

ANTsPy installation (please complete the following information):

  • Hardware [ PC ] i9-11900K is 8 cores (x2 threads) CPU
  • OS: [ Linux 4.15.0-20-generic x86_64 // Linux Mint 19 ]
  • System details [ None ]
  • Sub-system: [ Built from source // commit 751fec7 ]
  • ANTsPy version: [ 0.5.4 ]
  • Installation type: [ git clone + built from source with python -m pip install . ]

Additional context
When running both tests, I can see in htop that all 16 CPUs are running at 100%, with both Terminal CLI and Python wrappers. So it's not an obvious multi-threading problem.

For DenoiseImage, it's a difference in defaults, the search radius is 2 in the CLI program and 3 in ANTsPy. If I call ants.denoise_image(img, r=2, v=1), the difference in performance is similar to that for N4.

@ntustison @stnava shall we harmonize defaults, which one to adopt? I'll go with faster (2) unless you have a preference to make antspy the standard.

Yeah, I'm all for harmonizing the defaults and I'd go with what's in the original ANTs DenoiseImage.

For denoiseimage in particular, I would hope you could normalize with the OG implementation minc_anlm from minc-toolkit-v2

Is there a usage with defaults you could paste here?

DenoiseImage was ported from Jose's original Matlab code. Minc code was not referenced at all.

Well minc_anlm was written by/with Jose, given the citation has L Collins of the MNI as one of the senior authors ๐Ÿ‘๐Ÿป

$ minc_anlm
This program implements adaptative non-local denoising algorithm published in
Jose V. Manjon, Pierrick Coupe, Luis Marti-Bonmati, D. Louis Collins, Montserrat Robles "Adaptive non-local means denoising of MR images with spatially varying noise levels" Journal of Magnetic Resonance Imaging Volume 31, Issue 1, pages 192โ€“203, January 2010
 DOI: 10.1002/jmri.22003

I profiled n4_bias_correction with the line_profiler / kernprof, the library function execution accounts for 99.4% of the execution time. So there's not a lot of work happening at the wrapper level.

Yes, I realize that. But you were referring to a specific implementation in the context of defaults and that's why I clarified that it was Jose's original Matlab code.

Just a bit more historical context---I happened to be invited by an MNI-adjacent friend for a get-together during MICCAI 2013 in Nagoya, Japan. Fortunately, I sat right across the table from Jose and, after discussing common interests (such our enjoyment of Luis Miguel), he realized I was "one of the ANTs guys" and he asked me if I would like to put his denoising algorithm in ANTs. I said sure and he pointed me to his Matlab code which I eventually ported to ITK-style. After the FreeSurfer folk began using the implementation in their pipeline a couple years ago, I asked Jose about the possibility of making it an ITK module and he was all for it.

I think there might also be optimization differences in the ITK / ANTs compilation.

It seems CMAKE_BUILD_TYPE=Release is only set on Mac?

if [[ `uname` -eq Darwin ]] ; then
CMAKE_BUILD_TYPE=Release
fi
if [[ $TRAVIS -eq true ]] ; then
CMAKE_BUILD_TYPE=Release
JTHREADS=2
fi

Also the ITK compilation has -O2 where an ANTs superbuild has -O3

-DCMAKE_C_FLAGS="${CMAKE_C_FLAGS} -Wno-c++11-long-long -fPIC -O2 -DNDEBUG "\
-DCMAKE_CXX_FLAGS="${CMAKE_CXX_FLAGS} -Wno-c++11-long-long -fPIC -O2 -DNDEBUG "\

For DenoiseImage, it's a difference in defaults, the search radius is 2 in the CLI program and 3 in ANTsPy. If I call ants.denoise_image(img, r=2, v=1), the difference in performance is similar to that for N4.

Correct, I did not notice the difference with r default parameter.

Here is what I have now :

DenoiseImage r=2 r=3
CLI 76s 179s
Python 248s 602s

Thanks for testing, @benoitberanger

Would you mind trying out #705 ? If you have the Github CLI, you can do

gh pr checkout 705

It appears to close the gap on my Mac.

Using 550400a in #705 here is the new computation times :

DenoiseImage r=2
CLI 76s
Python 73s

So it went from 248s to 73s !

Wow! Thanks for reporting this