kishwarshafin/helen

marginpolish docker stuck at 99%

Closed this issue · 13 comments

Hi,

I am running your new docker container to stream-line assembly polishing and run into some trouble with marginPolish. Looks like MP is stalling at the very end.

 singularity run /net/cn-1/mnt/SCRATCH/michelmo/Projects/CONTAINERS/helen_latest20200519.sif  marginpolish ../SimonFlye27_15K.ONTremap.0x904.bam SimonFlye27_15K.fasta /net/cn-1/mnt/SCRATCH/mic
helmo/Projects/CONTAINERS/MP_r941_guppy344_human.json -t 64 -o . -f
Running OpenMP with 64 threads.
> Parsing model parameters from file: /net/cn-1/mnt/SCRATCH/michelmo/Projects/CONTAINERS/MP_r941_guppy344_human.json
> Parsing reference sequences from file: SimonFlye27_15K.fasta
> Going to write polished reference in : ./output.fa
> Set up bam chunker with chunk size 5000 and overlap 50 (for region=all), resulting in 546365 total chunks
> Polishing  1% complete (5623/546365).  Estimated time remaining: 31h 25m
> Polishing  2% complete (10934/546365).  Estimated time remaining: 25h 50m
> Polishing  3% complete (16427/546365).  Estimated time remaining: 23h 52m
> Polishing  4% complete (21903/546365).  Estimated time remaining: 22h 37m
> Polishing  5% complete (27374/546365).  Estimated time remaining: 22h 48m
> Polishing  6% complete (32813/546365).  Estimated time remaining: 22h 32m
> Polishing  7% complete (38250/546365).  Estimated time remaining: 22h 2m
> Polishing  8% complete (43711/546365).  Estimated time remaining: 21h 55m
> Polishing  9% complete (49186/546365).  Estimated time remaining: 22h 18m
> Polishing 10% complete (54652/546365).  Estimated time remaining: 22h 35m
> Polishing 11% complete (60114/546365).  Estimated time remaining: 22h 50m
> Polishing 12% complete (65596/546365).  Estimated time remaining: 22h 55m
> Polishing 13% complete (71045/546365).  Estimated time remaining: 22h 54m
> Polishing 14% complete (76500/546365).  Estimated time remaining: 22h 52m
> Polishing 15% complete (81977/546365).  Estimated time remaining: 22h 50m
> Polishing 16% complete (87432/546365).  Estimated time remaining: 22h 48m
> Polishing 17% complete (92924/546365).  Estimated time remaining: 22h 43m
.....
> Polishing 91% complete (497222/546365).  Estimated time remaining: 2h 41m
> Polishing 92% complete (502673/546365).  Estimated time remaining: 2h 23m
> Polishing 93% complete (508160/546365).  Estimated time remaining: 2h 5m
> Polishing 94% complete (513630/546365).  Estimated time remaining: 1h 47m
> Polishing 95% complete (519110/546365).  Estimated time remaining: 1h 29m
> Polishing 96% complete (524517/546365).  Estimated time remaining: 1h 12m
> Polishing 97% complete (530110/546365).  Estimated time remaining: 54m 3s
> Polishing 98% complete (535445/546365).  Estimated time remaining: 35m 59s
> Polishing 99% complete (541585/546365).  Estimated time remaining: 17m 57s

H5 files have been created and written to but no more writing happened for the last few hours.

Process is still running but only using 1 thread for the last 5 hours.
Is this expected and does marginpolish do some final wrapup in the end which takes longer than expected?

368136 michelmo  20   0  121.9g 119.1g   1484 S 100.0  3.9 112466:10 marginPolish                                 

Thank you,
Michel

Hi @MichelMoser ,

This is unexpected behavior, did you let it run to completion. Maybe it's an issue with coverage of one region that is maybe causing this. What is the coverage of your bam file?

I will discuss this with @tpesout to see if it's something we've seen before.

Hi @kishwarshafin ,

I let it run for additional 24 hours once it reached the

> Polishing 99% complete (541585/546365).  Estimated time remaining: 17m 57s

but h5 files did not change since then.

Average coverage is about 60 x.
I thought downsampling was implemented in the .json "maxDepth" when generating images?

Best,
Michel

@MichelMoser ,

I ran two polishing runs since last night with docker and both finished correctly. Would it be possible to prune the images and run it one more time?

If you are spending too much time on this then if you share the files I can try to see what is causing the issue.

You are right, the maxDepth controls the downsampling.

What do you mean by pruning the images?
Yes will rerun marginpolish one more time and report back.

docker rmi <helen_docker_image>

remove the existing docker image of and pull it again. It would be crazy if this works.

i assume using singluarity instead of docker is not the source of error

hmm, its running for 1.5 half days now straight and still at "Polishing 99% complete". Gave it 96 threads.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
177576 michelmo  20   0  374.1g 369.8g   1488 S 291.8 12.2 125225:18 marginPolish

How long are marginpolish runtimes for human genomes normally?

@MichelMoser , sorry about this. Usually, a human genome takes about 10-15 hours, on 96 threads it should take less than that. This would be your second run where it got stuck, is that right?

yes, its the second run. still at "Polishing 99% complete". You said it might be coverage problems?

I ran polishing with PEPPER simultaneously and it ran through without a hitch within 23.5 hours (on GPUs).

@MichelMoser , at this point, with all the improvements in the basecaller, you should be able to see similar results with PEPPER and MarginPolish-HELEN.

If it's not inconvenient for you, I'd like to keep this issue open and get back to it to see if it happens to any other assemblies. This is very unusual and should be looked into.

Hi @MichelMoser, I'm sorry you're having issues running this. I have seen something like this happen (not ever for 24 extra hours) with human reads aligned to GRCh38 with minimap2, in a very deep region flanked by very shallow regions (generally satellite DNA). There are some ways to verify this which I'm happy to do if you're willing to share your data. Also, running MarginPolish with the -a info flag will produce log messages that can help diagnose.

Hi @tpesout and @kishwarshafin ,
Thank you for the help and I am happy to share files. But before transferring the 180 GByte file, i could generate some coverage stats with mosdepth and send you the results if that's helpful.
Also I can rerun with the logging option and send you the output.

@MichelMoser , coverage plots, and the log would be great!