marbl/verkko

verkko does not always continue it's run correctly after crashes in hi-c

Closed this issue · 22 comments

i.e. MissingInputException in rule alignBWA in file /data/antipovd2/devel/verkko/lib/verkko/Snakefiles/8-hicPipeline.sm, line 199:
Missing input files for rule alignBWA:
output: 8-hicPipeline/mapped002.bam
wildcards: nnnn=002
affected files:
8-hicPipeline/split/hic2002.fasta.gz
ERROR!
Not running final consensus since no rukki paths provided!

(crash was not related to verkko, disk quota).

Sovled by: cp /conda_envs/lib/verkko/bin/rukki ~/conda_envs/bin/

Hello!
@HaominLyu 's solution did not seem to work for me. Do you know if this crash is related to the disk quota or related to neither verkko or disk quota? Have you found a different solution by any chance?

Best,
Dustin

comparing my output to the snakemake script assigned to the Hi-C, I noticed that the "runRDNAMashMap" portion of the script does not run, could this be throwing the error? My hi-c pipeline directory seemed to successfully run mashmap and split the Hi-C file. In other-words, is the root of their error unknown to you or is it disk quota related? If you do know it, can you please suggest how to get around it?

Thanks again.

Screenshot 2024-03-24 at 7 25 43 PM

This issue is not related. One of your runs crashed with unknown reason, right? Let's create a separate issue for it.

Missing runRDNAMashMap is optional and is not run without --rdna-scaff option, we have two different mashmap-based rules.

Original problem is about snakemake's checkpoints, simplest workaround is just to delete corresponding stage folder 8-hicPipeline completely. At that moment there were no rdna scaffolding and no runRDNAMashMap rule in verkko.

Hey!

I am not worried about runDNAMashMap, I was just going through the snakemake script to trying and guess why the Missing input files for rule alignBWA: script. I'll delete the 8-hicPipeline and rerun and let you know if the issue comes back up.

Hey Dymitry,

Thanks again for your quick reply yesterday. I re-ran everything and there is a different error in the snakemake log:

Submitted job 320 with external jobid 'removed user hold of job 98966760'. Waiting at most 30 seconds for missing files. MissingOutputException in rule splitHIC in file /.mounts/labs/simpsonlab/users/dsokolowski/miniconda3/envs/verkko/lib/verkko/Snakefiles/8-hicPipeline.sm, line 193: Job 320 completed successfully, but some output files are missing. Missing files after 30 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait: 8-hicPipeline/splitHIC.finished Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-03-25T161455.282514.snakemake.log

I tried to rerun the main script (without deleting the 8-HiC directory) and was able to reproduce the error I originally reported:

MissingInputException in rule alignBWA in file /.mounts/labs/simpsonlab/users/dsokolowski/miniconda3/envs/verkko/lib/verkko/Snakefiles/8-hicPipeline.sm, line 268: Missing input files for rule alignBWA: output: 8-hicPipeline/mapped005.bam wildcards: nnnn=005 affected files: 8-hicPipeline/split/hic2005.fasta.gz

Furthermore, there is no '8-hicPipeline/splitHIC.finished file in my directory, further suggesting that my system is getting confused while splitting the hi-c files.

I'm deleting 8-hicPipeline and adding --snakeopts "-latency-wait 3"
I'll update to see if that's the fix.

we already have --latency-wait 30 snakemake as default option for cluster runs, so adding --snakeopts "-latency-wait 3" should not help.

I see, do you have any suggestions of what else can be going on?

Quick update here. It looks like splitHIC.sh is consistently dying while trying to switch from splitting hic1 to hic2. Do you know if the splitHIC script submits any jobs directly? I should be running on bin/bash, which all of my batch-scripts are. This being said the scripts outside of this directory are still bin/sh

The snakemake log is still by in large the same

Screenshot 2024-03-28 at 12 28 51 PM

Could you show 8-hicPipeline/splitHIC.err and 8-hicPipeline/splitHIC.sh ?
Not sure that I get the question about submitting jobs directly. Snakemake submits splitHIC.sh script, which is generated at the splitHIC checkpoint (and there'll be #!/bin/sh hardcoded in it). Do you expect problems because of running those scripts with sh and not bash?

splitHIC.zip

Please see the attached folder with the files you asked.

And the thought crossed my mind though I don't see how that could happen given steps 1-7 ran happily with #!/bin/sh hardcoded.

Hey Dmitry,

Quick update. I ran the HiC split script on it's own and it ran successfully, now alignment is occurring.

We have a huge amount of HiC data since we are using the same data to call TADs, ABC enhancers, CTCF-bound loops etc. etc. (we also generated RNAseq/ChIP-seq/ATAC-seq etc. for functional annotation), so perhaps it's just that the memory/time limits for the split HiC steps wasn't enough for the volume of data we have.

HiC1 got to 114 and HiC2 got to 113

I think I asked about the memory flag acronyms earlier, do you know the flag for this step?

splitHIC.zip

I was more interested in files from 8-hicPipeline/ folder; 8-hicPipeline/splitHIC.err is not copied to batch-scripts
If it survived the rerun, we can try to find out what was the initial source of the problem.

I think I asked about the memory flag acronyms earlier, do you know the flag for this step?

ahc_ for alignment itself, fhc_ for anything else in Hi-C pipeline.

Hey Dymitry,

Sorry about that! Running HICsplit on it's own successfully finished the pipeline. I am running it a few more times with variable numbers of ONT runs to see how that impacts assembly. I'll add the "fhc" flag with the parameters that I used to split and will send the correct file if the same error gets thrown regardless.

Hey Dymirty,

Verkko_HiC_split.zip

I ran verkko-HiC again with the two extra ONT runs and reproduced the same error. Please find the scripts and .err files attached.

Best

So, empty .err file and it looks like the splitHiC process did not even start here...
Does this problem happen with all the verkko HiC runs on your system, or some are able to finish without crash?
Is there an option to get some information about that process fate from your cluster? As far as I get, process id should be 99019439. If you run splitHiC.sh from 8-hicPipeline on its own, does it work correctly with both bash and sh or only with bash?

Hey Dmitry,

I do not think that I can find the info on the crash, though it at least partially runs since it is able to make it half way through Hic1 without crashing.

If I run splitHiC.sh on it's own with a bash heading it works fine on my (sge) system, if I qsub it as #!/bin/sh instead then it crashes immediately.
/opt/uge-2023/default/spool/ucn107-32/job_scripts/99450442: 9: source: not found. I also can't seem to find the metadata associated with the "split" step in ".snakemake/metadata" despite seeing the step before and after.

This being said, if it is a problem with our system not allocating enough memory or switching to sh then I'm happy to run verkko with the HiC step manually. Having a phased assembly in 72 hours (rather than the 48 it would otherwise take) is still unbelievable in my books.

Thanks!
Dustin

if there is a problem with sh my question is why it didn't crash before..

Is it possible that sh interpreter is not available as /bin/sh on some subset(but not all) of nodes on your system?

I think this is something due to incorrectly set up sh, the actually error message (source: not found) isn't actually part of the verkko code, it is coming from the SGE initialization. I suspect sh on your system is csh or some other non-bash variant. Does the script work if you edit it to have #!/bin/bash and submit?

We had special case code in canu to work around some of the startup weirdness in SGE (you can check with qconf -sq <queue you're using> and looking for shell entries). If it mentions posix_behavior, it's possible adding -S /bin/bash to your profiles/slurm-sge-submit.sh submit command in verkko would work:

112   jobid=$(qsub -h -terse -cwd -V -pe thread ${n_cpus} -l memory=${mem_per_thread}g -j y -o batch-scripts/${jobidx}.${rule_n}.${jobidx}.out "$@")

to

112   jobid=$(qsub -S /bin/bash -h -terse -cwd -V -pe thread ${n_cpus} -l memory=${mem_per_thread}g -j y -o batch-scripts/${jobidx}.${rule_n}.${jobidx}.out "$@")

Hello Sergey and Dmitry,

When running outside of snakemake, the script works just fine with #!/bin/bash

I will definitely add the -S /bin/bash to the header, that makes a lot of sense.

When running the splitHiC command within the full pipeline, it crashes when switching from HiC1 to HiC2 rather than throwing the sh issue. This could be a memory/runtime error though since we have a huge amount of HiC data and it takes ~20 hours with 16 threads to split the data when I run the script on our own. Thank you again!

Best,
Dustin

Idle

Sorry Sergey, I should have updated.

I ended up just running the "split" script separate and restarting after updating the header and it worked great.

Best,
Dustin