popgenmethods/SINGER

`-resume` flag fails because `ARG.bin_size` is never set

nspope opened this issue ยท 9 comments

Currently using -resume triggers the assertion:

assert(ws > 0);

AFAICT the underlying reason is that when using the -resume flag, the array ARG.rhos contains only zeros. This is because ARG::compute_rhos_thetas multiplies the recombination rate by ARG.bin_size here:

rhos.push_back(r*bin_size);

... and ARG.bin_size is initialized to 0 but never set during:

void Sampler::load_resume_arg() {

(whereas, when running from the top, Sampler::build_singleton_arg sets ARG.bin_size here).

It'd be awesome to fix this; currently the inability to resume MCMC after hitting a numerical issue really slows things down.

Hi @nspope , sorry about this. I will fix this right away. I am pulling together a new 0.1.8 version to incorporate other fixes of some bugs. But I will work on this right away.

No worries, thanks for the quick response!

Also if I could make a quick suggestion -- maybe in singer_master, there could be a flag --auto-resume or something similar, that adds the -resume flag to:

seeded_debug_cmd = debug_cmd + f" -seed {random_seeds[attempts]}"

so that the restarts can pick up from the last iteration rather than from the top?

@nspope try now the 0.1.8 version and I think the "empty bin"problem should be gone. Version 0.1.8 isn't finished on all the fixes yet, but this specific problem should be resolved.

Thanks Yun! So use the main branch from github.com/yundeng98/SINGER in the interim?

The suggestion I have for now is that, as you can tell, singer_master is all about resuming when things failed, and make that into an automatic workflow. However, sometimes it behaved outside my imagination. In this project of yours, you probably have more ideas than me about what is needed. So maybe go ahead and build your own workflow, so that my slow fixes won't delay you. I will let you know when 0.1.8 is completely finished and by then the control logic should be improved, including the resuming from last point logic. Just look for the binaries in releases/singer-0.1.8xxxxxx, and the main branch it is.

I'm sure you've got your own preferred way to do this ... but FWIW the most stable approach I've found is to restart from the second-to-last MCMC iteration, as in this Snakemake workflow.

The reason to not restart at the current MCMC iteration is that there are (rarely) instances where the ARG from the last iteration is invalid (seems like this might be floating point error wherein child age > parent age?). So, it works to delete the last "rethread" line of the log and restart from there.

Thanks @nspope I believe you are right, and something must be sub-optimal with the ARG re-scaling that causes the numerical issue of child/parent node age problem. I have enabled a function to set back a given number of iterations back, and it seems that I should simply set the number to be 1? I can go ahead and implement this into a new singer_master.

This has worked without a hitch for me, so seems like a good plan! Sounds like you've already got something implemented, but feel free to adapt the code snippet in that snakemake workflow if it looks useful.