Deterministic output (Sailfish 0.7.6_Linux-x86-64)

Question

Deterministic output (Sailfish 0.7.6_Linux-x86-64)

Closed this issue 8 years ago · 2 comments

Hi,

it seems like that the sailfish output (TPM and Number of Reads) is not deterministic when using multithreading (it is when using only one thread). The difference is small (Perason correlation coefficeient > 0.99), but it exisits. I guess this comes from the non-deterministic nature of the parallel processing, as one cannot control the exact order the k-mers are processed by multiple threads.

Is there any chance this can be fixed ? Or is this possibly already the case for salmon ?

Thanks!

-Martin

Answer 1 · 2016-08-22T13:56:02.000Z

This is an intrinsic features of the parallel algorithm, and I suspect cannot be easily fixed. If you really need 100% determinism, the best way is probably to run sailfish with a single thread (though of course this won’t take advantage of sailfish’s very good use of multiple threads).

Carl

On Aug 22, 2016, at 8:11 AM, martinloewer notifications@github.com wrote:

Hi,

it seems like that the sailfish output (TPM and Number of Reads) is not deterministic when using multithreading (it is when using only one thread). The difference is small (Perason correlation coefficeient > 0.99), but it exisits. I guess this comes from the non-deterministic nature of the parallel processing, as one cannot control the exact order the k-mers are processed by multiple threads.

Is there any chance this can be fixed ? Or is this possibly already the case for salmon ?

Thanks!

-Martin

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #97, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAPXzvsbgafuyLMbJ49nU0H03MmB-ZLks5qiZHbgaJpZM4JpzPN.

Answer 2 · 2016-08-22T15:28:05.000Z

Hi Martin,

In addition to what Carl mentions above, I will note that the precision of deterministic output is almost certainly beyond the measurement accuracy of the inference algorithm. What I mean by this is that the variability you get from run-to-run in the multithreaded setting is the result of minor differences in the initialization or auxiliary parameters of the inference algorithm, such that, when run to convergence, you get slightly different output. Variance in the output below this threshold (i.e. below what you get between different runs of the software) represents a level of precision that is not actually achievable via the inference algorithm used in the software (and likely not achievable in theory, at least while maintaining the same level of accuracy, due to the complex nature of the likelihood function). In fact, the Gibbs sampling and bootstrap options that we provide exist specifically to allow one to assess the variance in the point estimates returned by sailfish (or salmon).

--Rob