smarco/WFA-paper

SIMD implementation of WFA

yangao07 opened this issue · 5 comments

Hi @smarco,

First of all, WFA is so amazing!

I notice that in the bioinformatics paper, you mentioned SIMD implementation should be easy for WFA.
But I did not see any SIMD instructions in the source code, did I miss anything?

Thanks!

Yan

Hi,

Thanks! Glad you like it. I believe it has a lot of potentials.

you mentioned SIMD implementation should be easy for WFA.
But I did not see any SIMD instructions in the source code, did I miss anything?

Yes, indeed. But there is no need to use intrinsics or other vectorization methods. Modern compilers (like GCC or clang) can auto vectorize the critical kernel functions. You can activate the vectorization report back on Makefile and check the compiler report:

gcc -Wall -g -fPIC -O3 -march=native -fopt-info-vec-optimized -I.. -c affine_wavefront_align.c -o ../build/affine_wavefront_align.o
affine_wavefront_align.c:166:3: note: loop vectorized
affine_wavefront_align.c:216:3: note: loop vectorized
affine_wavefront_align.c:216:3: note: loop versioned for vectorization because of possible aliasing
affine_wavefront_align.c:216:3: note: loop vectorized
affine_wavefront_align.c:245:3: note: loop vectorized
affine_wavefront_align.c:271:3: note: loop vectorized

I hope this helps. Let me know.
Cheers,

Thanks! Now I get it. I actually didn't notice the -march=native flag previously 😓 .

Then I tried different compile flags AVX2/SSE41/SSE2 and also without any SIMD flags. However, the alignment_benchmark gives me similar run time.
I expect the difference could be larger. Do you have any thoughts on this?

The benchmark dataset was generated using ./bin/generate_dataset -n 5000000 -l 500 -e 0.05 -o sample.dataset.seq.
And benchmark program was run as ./bin/align_benchmark -i sample.dataset.seq -a gap-affine-wfa.
What I modified is simply the CC_XFLAGS in ./gap_affine/Makefile.

...
gcc -Wall -g -fPIC -O3  -I.. -c affine_wavefront_align.c -o ../build/affine_wavefront_align.o
...
==> NOsse.log <==
...processed 5000000 reads (benchmark=42467.246 reads/s;alignment=53163.305 reads/s)
[Benchmark]
=> Total.reads            5000000
=> Time.Benchmark         1.96 m  (    1   call, 117.74  s/call {min117.74s,Max117.74s})
  => Time.Alignment       1.57 m  ( 79.88 %) (    5 Mcalls,  18.81 us/call {min6.46us,Max183.62us})

...
gcc -Wall -g -fPIC -O3 -msse2  -I.. -c affine_wavefront_align.c -o ../build/affine_wavefront_align.o
...
==> sse2.log <==
...processed 5000000 reads (benchmark=41654.164 reads/s;alignment=52148.012 reads/s)
[Benchmark]
=> Total.reads            5000000
=> Time.Benchmark         2.00 m  (    1   call, 120.04  s/call {min120.04s,Max120.04s})
  => Time.Alignment       1.60 m  ( 79.88 %) (    5 Mcalls,  19.18 us/call {min6.51us,Max137.83us})

...
gcc -Wall -g -fPIC -O3 -msse4.1  -I.. -c affine_wavefront_align.c -o ../build/affine_wavefront_align.o
...
==> sse41.log <==
...processed 5000000 reads (benchmark=39762.520 reads/s;alignment=49046.926 reads/s)
[Benchmark]
=> Total.reads            5000000
=> Time.Benchmark         2.10 m  (    1   call, 125.75  s/call {min125.75s,Max125.75s})
  => Time.Alignment       1.70 m  ( 81.07 %) (    5 Mcalls,  20.39 us/call {min6.58us,Max246.53us})

...
gcc -Wall -g -fPIC -O3 -mavx2  -I.. -c affine_wavefront_align.c -o ../build/affine_wavefront_align.o
...
==> avx2.log <==
...processed 5000000 reads (benchmark=41662.008 reads/s;alignment=53654.875 reads/s)
[Benchmark]
=> Total.reads            5000000
=> Time.Benchmark         2.00 m  (    1   call, 120.01  s/call {min120.01s,Max120.01s})
  => Time.Alignment       1.55 m  ( 77.65 %) (    5 Mcalls,  18.64 us/call {min6.50us,Max156.74us})

Well, I might have some thoughts. Take them with a pinch of salt because (1) there are many architectural tradeoffs at this fine-level of optimizations and (2) I don't know which processor/model are you using. Have a look at this example (WFA-kernel).

First, note that using a Skylake-512 architecture (for example), all you care about is the vector-width you want to be used. In that case, you may want to evaluate the performance using 128, 256, and 512 bits instructions.

Then, it all boils down to having enough computational workload to make the most out of these SIMD instructions. In your examples, sequences of 500nt with error e=5% will depict 25 nominal errors/differences. In turn, this means that the WFA wavefronts will not be very large (strongly depends on the score penalties used). Hence, the vectorial part will not be dominant and it will have little effect on the overall running time.

Lastly, take into account that, depending on the processor/architecture/model using wider vectors is not always better. It is well known that using 512bit SIMD instructions on some intel processors can be counterproductive. Also, using wider vectors puts more pressure on the memory hierarchies and memory latency penalties start to appear.

Just some thoughts...

Thanks, Santiago!
I think the small difference is largely due to the dataset itself.
The compiler auto-vectorization is way easier to use than SIMD instructions, I wasn't aware of this. 😢

Also, maybe I should open a new issue, not sure if you have thought about using a non-zero match score/penalty in WFA?
I feel like this is doable theoretically, and I have been trying to add some stuff to your codes to make it work.

Let me know if you are interested.

Cheers,
Yan

Good!

Feel free to open a new discussion thread on non-zero match penalty. Indeed, the idea crossed my mind and I also believe it is perfectly doable. The algorithm would only generate consecutively wavefronts without "extending" them. Now, I always had the feeling that this will kill the main advantage of the WFA, as it quickly progresses through exact matching regions. But again, it is perfectly viable and it may deliver good performance for alignments with a low score.

In any case, it is an interesting topic.
Cheers,