pq-crystals/kyber

Valgrind gives tons of errors

thomwiggers opened this issue · 3 comments

I'm running Valgrind on the latest version of the codebase (AVX2 version), and getting tons of errors like the following:

==22435== Conditional jump or move depends on uninitialised value(s)
==22435==    at 0x109BA0: pqcrystals_kyber512_avx2_gen_matrix (indcpa.c:253)
==22435==    by 0x109EC3: pqcrystals_kyber512_avx2_indcpa_enc (indcpa.c:581)
==22435==    by 0x10930E: pqcrystals_kyber512_avx2_enc (kem.c:69)
==22435==    by 0x1D9602: test_keys (test_kyber.c:23)
==22435==    by 0x1D98A1: main (test_kyber.c:98)
==22435==  Uninitialised value was created by a stack allocation
==22435==    at 0x10976C: pqcrystals_kyber512_avx2_gen_matrix (indcpa.c:210)
...
==23591== Use of uninitialised value of size 8
==23591==    at 0x110093: _mm_storeu_si128 (emmintrin.h:727)
==23591==    by 0x110093: pqcrystals_kyber512_avx2_rej_uniform_avx (rejsample.c:351)
==23591==    by 0x109A3A: pqcrystals_kyber512_avx2_gen_matrix (indcpa.c:251)
==23591==    by 0x109CC0: pqcrystals_kyber512_avx2_indcpa_keypair (indcpa.c:497)
==23591==    by 0x1091D9: pqcrystals_kyber512_avx2_keypair (kem.c:26)
==23591==    by 0x1D95E5: test_keys (test_kyber.c:20)
==23591==    by 0x1D98A1: main (test_kyber.c:98)
==23591== 

All errors go away by exchanging _mm256_xor_si256(s,s) by _mm256_setzero_si256() in fips202x4.c, see commit 8440574. I'm surprised valgrind doesn't understand this standard pattern.

Interesting. That should be reported upstream.

Hrm, I'm having a ton of trouble reproducing this in smaller examples. I wonder if there's some undefined behavior showing up again, as, of course, it's still technically reading an uninitialized value.

e.g. https://godbolt.org/z/463sPc reduces all of these to vpxor xmm0, xmm0 xmm0

The disassembly of keccak_absorb (objdump) shows:

...
    aca6:       48 c7 44 24 40 00 00    movq   $0x0,0x40(%rsp)
    acad:       00 00 
    acaf:       eb 76                   jmp    ad27 <keccakx4_absorb+0xbd>
    acb1:       48 8b 44 24 40          mov    0x40(%rsp),%rax
    acb6:       48 c1 e0 05             shl    $0x5,%rax
    acba:       48 89 c2                mov    %rax,%rdx
    acbd:       48 8b 44 24 38          mov    0x38(%rsp),%rax
    acc2:       48 01 d0                add    %rdx,%rax
    acc5:       c5 fd 6f 00             vmovdqa (%rax),%ymm0
    acc9:       48 8b 44 24 40          mov    0x40(%rsp),%rax
    acce:       48 c1 e0 05             shl    $0x5,%rax
    acd2:       48 89 c2                mov    %rax,%rdx
    acd5:       48 8b 44 24 38          mov    0x38(%rsp),%rax
    acda:       48 01 d0                add    %rdx,%rax
    acdd:       c5 fd 6f 08             vmovdqa (%rax),%ymm1
    ace1:       48 8b 44 24 40          mov    0x40(%rsp),%rax
    ace6:       48 c1 e0 05             shl    $0x5,%rax
    acea:       48 89 c2                mov    %rax,%rdx
    aced:       48 8b 44 24 38          mov    0x38(%rsp),%rax
    acf2:       48 01 d0                add    %rdx,%rax
    acf5:       c5 fd 7f 8c 24 e0 00    vmovdqa %ymm1,0xe0(%rsp)
    acfc:       00 00 
    acfe:       c5 fd 7f 84 24 00 01    vmovdqa %ymm0,0x100(%rsp)
    ad05:       00 00 
    ad07:       c5 fd 6f 8c 24 e0 00    vmovdqa 0xe0(%rsp),%ymm1
    ad0e:       00 00 
    ad10:       c5 fd 6f 84 24 00 01    vmovdqa 0x100(%rsp),%ymm0
    ad17:       00 00 
    ad19:       c5 f5 ef c0             vpxor  %ymm0,%ymm1,%ymm0
    ad1d:       90                      nop
    ad1e:       c5 fd 7f 00             vmovdqa %ymm0,(%rax)
    ad22:       48 ff 44 24 40          incq   0x40(%rsp)
    ad27:       48 83 7c 24 40 18       cmpq   $0x18,0x40(%rsp)
    ad2d:       76 82                   jbe    acb1 <keccakx4_absorb+0x47>
...

which does look like a whole bunch of loads and stores are happening that are not strictly necessary and maybe coming from werid places. Switching compiler flags from -O0 (included in your fix for #21) to -O2 also patches this problem: the loop gets unrolled to vpxor %xmm0, %xmm0, %xmm0 and a bunch of stores.

Perhaps valgrind is just setting more memory as "uninitialised", and as it compiles the xor using two separate registers (ymm1 and ymm0) it doesn't keep track anymore that the values located there are the same.

The loop looks like this with __m256_setzero_si256():

...
   acb1:       48 8b 44 24 40          mov    0x40(%rsp),%rax
   acb6:       48 c1 e0 05             shl    $0x5,%rax
   acba:       48 89 c2                mov    %rax,%rdx
   acbd:       48 8b 44 24 38          mov    0x38(%rsp),%rax
   acc2:       48 01 d0                add    %rdx,%rax
   acc5:       c5 f9 ef c0             vpxor  %xmm0,%xmm0,%xmm0
   acc9:       c5 fd 7f 00             vmovdqa %ymm0,(%rax)
   accd:       48 ff 44 24 40          incq   0x40(%rsp)
   acd2:       48 83 7c 24 40 18       cmpq   $0x18,0x40(%rsp)
   acd8:       76 d7                   jbe    acb1 <keccakx4_absorb+0x47>
...

Not sure if it's worth reporting this to Valgrind; semantically, setzero is more clear anyway.