RECETOX/recetox-aplcms

Irreproducible results on another machine

xtrojak opened this issue · 12 comments

It seems like both hybrid and unsupervised functions are deterministic but machine specific - the same machine produces consistent results, but they are different compared to another machine. After doing some digging I conclude that this has nothing to do with the number of cores used.

Therefore the current tests are failing.

It might be related to operating system.

@martenson Both @xtrojak and I use Ubuntu 20.04 operating based systems (Linux Mint 20 / Ubuntu 20.04 WSL) and the tests pass (the expected outputs were created on the Mint 20 based machine), on abiff (Debian 10) they fail and the results are different (5 detected features +-, so not only numerical or ordering differences).

Any ideas why the OS (I assume) has an influence? They all run in conda.

We suspected that it is caused by stochastic behavior of randomForest function used in eic.pred.R file. We should try to get rid of randomness by setting the same seed.

We suspected that it is caused by stochastic behavior of randomForest function used in eic.pred.R file. We should try to get rid of randomness by setting the same seed.

This is not the case, the randomness is actually used in semi.sup.learn function which is never called in hybrid and unsupervised functions. Just to be sure we tried to set the seed using

set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion", sample.kind = "Rejection")

but it had no effect on the result.

This line calls some sampling function in the rt alignment stage, which is the one having visible differences:

if(length(this.d)>100) this.d<-sample(this.d,100)

This line calls some sampling function in the rt alignment stage, which is the one having visible differences:

if(length(this.d)>100) this.d<-sample(this.d,100)

This is another dead end, the sample function this is used in adjust.time and recover.weaker, but neither of them is actually used to compute recovered_feature_sample_table which is being compared in the tests (they are used to compute extracted_features and corrected_features).

Also changing seed before a test run has no influence on the test result (within single machine). From this I would conclude randomness is not the reason why results are different across multiple machines. What do you think @hechth ?

EDIT: recover.weaker is actually used to compute recovered_feature_sample_table in unsupervised, but still holds that setting seed has no effect.

It might be related to operating system.

This is confirmed, on the same OS (Ubuntu 20 based), the tests are passing (tried on 3 separate machines).

This is indeed weird, since Debian and Ubuntu are the same OS family. However we probably do not need to spend time on this now since UMSA is ubuntu and we will run the tool in a container anyways (betting on this not being caused by the kernel).

Since we get the biocontainers, do we actually know which underlying OS they will serve us? What will happen if that changes?

biocontainers use busybox as a minimalist base and I doubt that will change.

Using the latest biocontainer image (docker pull quay.io/biocontainers/mulled-v2-dceaa6ce3d797c3153f327c7fc15983c4e882d4d:6584615d53d9bbb9d333d3928bdd6074d82193ce-0) I get identical results from a run on UMSA and a run on my macos dev machine.

I'd treat that image as the canonical source of truth -- at least for now.