OpenNMT/CTranslate2

Can batch translation on CPU result in different output?

robertBrnnn opened this issue · 8 comments

I have a CPU model that produces different outputs for the same strings at different times.

I think it could be related to the bug from #546 where batch translation yielded different results on GPU, I'm currently using CTranslate2 1.20.1 so there's a lot of updates I'm missing.

Alternatively, I recall that on GPU, batch translation can have slightly different numerical results, and am curious whether the same can happen with CPU models and batch translation?

Yes, the same string can have different outputs in batch translation on CPU.

I know this can happen with Intel MKL (default backend on Intel CPU) and oneDNN (default on AMD CPU). The numerical result of the dot product attention can be slightly different depending on the number of padding positions in the input.

If you are running on an Intel CPU, it is possible to work around this issue by enabling strict numerical reproducibility. Try setting this environment variable:

MKL_CBWR=AUTO,STRICT

I actually use AMD for deployment, which is unfortunate 😔

If I set CT2_USE_MKL=1 on an AMD CPU, will CTranslate2 use MKL?

With CT2_USE_MKL=1 and MKL_CBWR=AUTO,STRICT I'm guessing results would be reproducible with the caveat that it'll be slower because of how MKL handles AMD CPUs.

If I set CT2_USE_MKL=1 on an AMD CPU, will CTranslate2 use MKL?

Yes.

I requested to add a similar flag in oneDNN but they don't plan to implement it.

I've ran a couple of tests with AMD and Intel CPUs, MKL_CBWR=AUTO,STRICT doesn't seem to work with either. I can get reproducible output from both Intel and AMD using MKL_CBWR=COMPATIBLE, surprisingly the AMD CPUs perform much better than the Intel ones with this flag.
Are the ctranslate2 wheels built with MKL 2019 Update 3 or an earlier version? I'm guessing they're built against an earlier MKL version given the MKL_CBWR=AUTO,STRICT doesn't seem to work, and that's when MKL_CBWR=AUTO,STRICT flags were introduced.

I see the wheels are built using very recent versions now.

According to the Intel document, MKL_CBWR=COMPATIBLE is indeed the only configuration that is supported for non-Intel CPUs:

Only the MKL_CBWR_COMPATIBLE option is supported on non-Intel CPUs.

which would explain why MKL_CBWR=AUTO,STRICT does not work on AMD. However, it should still work as expected on Intel. Can you double-check it was correctly set in your test on Intel?

Are the ctranslate2 wheels built with MKL 2019 Update 3 or an earlier version?

They use recent MKL versions. For example CTranslate2 1.20.1 wheels were already using Intel MKL 2021.2.

This is the output with MKL_VERBOSE=1 set on Intel CPU, CNR is being set to AUTO,STRICT

MKL_VERBOSE SAXPBY(10,0x7ff56cb8eeb8,0x7ff56019f700,1,0x7ff56cb8eec0,0x7ff56019f700,1) 350ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ISAMAX(512,0x7ff56814c6c0,1) 276ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:1  NThr:1
MKL_VERBOSE SGEMM(N,N,1,10,10,0x7ff56cb8ed38,0x7ff560005440,1,0x7ff54c013880,10,0x7ff56cb8ed40,0x7ff5600e91c0,1) 9.57us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SAXPBY(10,0x7ff56cb8eeb8,0x7ff5600e91c0,1,0x7ff56cb8eec0,0x7ff5600e91c0,1) 194ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE GEMM_S8U8S32(T,N,C,1536,20,512,0x7ff56d38f4c0,0x562f12bf7d40,512,0x7ff56d38f530,0x7ff5682860c0,512,0x7ff56d38f518,0x7ff56d38f4c8,0x7ff56818fd00,1536,0x562f14807000) 125.56us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SGEMM_BATCH_STRIDED(T,N,2,1,64,0x7ff56d38f658,0x7ff56810f580,64,128,0x7ff5681edf00,64,64,0x7ff56d38f660,0x7ff568147ec0,2,2,160) 11.22us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SGEMM_BATCH_STRIDED(N,N,64,1,2,0x7ff56d38f658,0x7ff56818fd00,64,128,0x7ff5681826c0,2,2,0x7ff56d38f660,0x7ff568147ec0,64,64,160) 12.36us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SAXPBY(5120,0x7ff56cb8ec08,0x7ff5601c0300,1,0x7ff56cb8ec10,0x7ff5601c0300,1) 3.32us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ISAMAX(512,0x7ff568160fc0,1) 328ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:2  NThr:1

When it's set to COMPATIBLE on Intel the output is consistent, but significantly slower.

In one Intel doc I see:

Intel and Intel compatible CPUs have a few instructions, such as approximation instructions rcpps/rsqrtps, that may return different results. Setting the branch to MKL_CBWR_COMPATIBLE
ensures that Intel® oneAPI Math Kernel Library
does not use these instructions and forces a single Intel SSE2-only code path to be executed.

which seems to suggest that even STRICT CNR doesn't guarantee consistent results, only COMPATIBLE mode will.

Thanks for the feedback. I have not seen a case where MKL_CBWR=AUTO,STRICT is not sufficient to get the same outputs on the same CPU. Not sure it matters, but are you running a vanilla or relative Transformer?

In any case, guaranteeing consistent results is generally hard. The easiest is to accept that translations can have slight variations, but I understand it is hard to explain that to end users.

Right now I'm not aware of another workaround without a performance penalty but I will keep exploring.

Thanks Guillaume.

It's a very small subset of content that experiences this with MKL_CBWR=AUTO,STRICT, we mainly notice it for short numeric strings like currency patterns, but there are some short phrases too.

it is hard to explain that to end users.

Definitely! The most noticeable issue we get is currency strings, like €18.10 could be translated to French as 18,10 € the first time and translated as 18h10 the next.

Not sure it matters, but are you running a vanilla or relative Transformer?

It's vanilla Transformer

In any case, guaranteeing consistent results is generally hard.

Yeah, I completely understand, it's not an easy thing to fix.

We've switched to synchronous translation instead of batch for CPU without much of a performance impact, if any. So, I'm quite happy to stick with synchronous translation. We actually did the same for our GPU deployments previously too, consistent output is more of a priority for us, so we're willing to do synchronous translation over batch if it guarantees results.