NextPolish slurmstepd: error: JOB CANCELLED DUE TO TIME LIMIT

Question

NextPolish slurmstepd: error: JOB CANCELLED DUE TO TIME LIMIT

martinmau1 opened this issue 2 years ago · 6 comments

Describe the bug
Hello,

I ran into a timeout problem with the SLURM routine within NextPolish. How can I allow more computing time? Is there a config file for slurm within NextPolish or do I alter the paralleltask cluster.cfg for that and which command would that be?
Many thanks for your help.

Best wishes
Martin

Error message
cd /globalhome/mam880/HPC/nextpolish_out/ES612/run1/00.lgs_polish/04.polish.ref.sh.work/polish_genome01

cd /globalhome/mam880/HPC/nextpolish_out/ES612/run1/00.lgs_polish/04.polish.ref.sh.work/polish_genome01
time /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx/Core/python/3.8.10/bin/python /globalhome/mam880/HPC/NextPolish/lib/nextpolis
h2.py -sp -p 3 -g /globalhome/mam880/HPC/nextpolish_out/ES612/./run1/00.lgs_polish/input.genome.fasta -b /globalhome/mam880/HPC/nextpolish_
out/ES612/./run1/00.lgs_polish/input.genome.fasta.blc -i 0 -l /globalhome/mam880/HPC/nextpolish_out/ES612/./run1/00.lgs_polish/lgs.sort.bam
.list -r ont -o genome.nextpolish.part000.fasta
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx/Core/python/3.8.10/bin/python /globalhome/mam880/HPC/NextPolish/lib/nextpolish2.
py -sp -p 3 -g /globalhome/mam880/HPC/nextpolish_out/ES612/./run1/00.lgs_polish/input.genome.fasta -b /globalhome/mam880/HPC/nextpolish_out
/ES612/./run1/00.lgs_polish/input.genome.fasta.blc -i 0 -l /globalhome/mam880/HPC/nextpolish_out/ES612/./run1/00.lgs_polish/lgs.sort.bam.li
st -r ont -o genome.nextpolish.part000.fasta
[64432 INFO] 2023-01-25 16:45:01 Corrected step options:
[64432 INFO] 2023-01-25 16:45:01
split: 0
process: 3
auto: True
read_type: 1
block_index: 0
window: 5000000
uppercase: False
alignment_score_ratio: 0.8
alignment_identity_ratio: 0.8
out: genome.nextpolish.part000.fasta
genome: /globalhome/mam880/HPC/nextpolish_out/ES612/./run1/00.lgs_polish/input.genome.fasta
bam_list: /globalhome/mam880/HPC/nextpolish_out/ES612/./run1/00.lgs_polish/lgs.sort.bam.list
block: /globalhome/mam880/HPC/nextpolish_out/ES612/./run1/00.lgs_polish/input.genome.fasta.blc
[64440 INFO] 2023-01-25 16:45:01 Start a corrected worker in 64440 from parent 64432
[64442 INFO] 2023-01-25 16:45:01 Start a corrected worker in 64442 from parent 64432
[64444 INFO] 2023-01-25 16:45:01 Start a corrected worker in 64444 from parent 64432
[bam_merge_iter_init] Region "ctg000000:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000010:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000020:0-15000000" specifies an unknown reference name
[64464 INFO] 2023-01-25 16:45:02 Start a corrected worker in 64464 from parent 64432
[64465 INFO] 2023-01-25 16:45:02 Start a corrected worker in 64465 from parent 64432
[64467 INFO] 2023-01-25 16:45:02 Start a corrected worker in 64467 from parent 64432
[bam_merge_iter_init] Region "ctg000030:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000040:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000050:0-15000000" specifies an unknown reference name
[64472 INFO] 2023-01-25 16:45:04 Start a corrected worker in 64472 from parent 64432
[64473 INFO] 2023-01-25 16:45:04 Start a corrected worker in 64473 from parent 64432
[64474 INFO] 2023-01-25 16:45:04 Start a corrected worker in 64474 from parent 64432
[bam_merge_iter_init] Region "ctg000060:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000070:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000080:0-15000000" specifies an unknown reference name
[64485 INFO] 2023-01-25 16:45:05 Start a corrected worker in 64485 from parent 64432
[64486 INFO] 2023-01-25 16:45:05 Start a corrected worker in 64486 from parent 64432
[64487 INFO] 2023-01-25 16:45:05 Start a corrected worker in 64487 from parent 64432
[bam_merge_iter_init] Region "ctg000090:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000100:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000110:0-15000000" specifies an unknown reference name
[64500 INFO] 2023-01-25 16:45:07 Start a corrected worker in 64500 from parent 64432
[64501 INFO] 2023-01-25 16:45:07 Start a corrected worker in 64501 from parent 64432
[bam_merge_iter_init] Region "ctg000120:0-15000000" specifies an unknown reference name
[64502 INFO] 2023-01-25 16:45:07 Start a corrected worker in 64502 from parent 64432
[bam_merge_iter_init] Region "ctg000130:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000140:0-15000000" specifies an unknown reference name
[64508 INFO] 2023-01-25 16:45:08 Start a corrected worker in 64508 from parent 64432
[64509 INFO] 2023-01-25 16:45:08 Start a corrected worker in 64509 from parent 64432
[bam_merge_iter_init] Region "ctg000150:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000160:0-15000000" specifies an unknown reference name
[64511 INFO] 2023-01-25 16:45:08 Start a corrected worker in 64511 from parent 64432
[bam_merge_iter_init] Region "ctg000170:0-15000000" specifies an unknown reference name
[64514 INFO] 2023-01-25 16:45:09 Start a corrected worker in 64514 from parent 64432
[bam_merge_iter_init] Region "ctg000180:0-15000000" specifies an unknown reference name
[64515 INFO] 2023-01-25 16:45:09 Start a corrected worker in 64515 from parent 64432
[bam_merge_iter_init] Region "ctg000190:0-15000000" specifies an unknown reference name
[64517 INFO] 2023-01-25 16:45:09 Start a corrected worker in 64517 from parent 64432
[bam_merge_iter_init] Region "ctg000200:0-15000000" specifies an unknown reference name
[64519 INFO] 2023-01-25 16:45:10 Start a corrected worker in 64519 from parent 64432
[64520 INFO] 2023-01-25 16:45:10 Start a corrected worker in 64520 from parent 64432
[bam_merge_iter_init] Region "ctg000210:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000220:0-15000000" specifies an unknown reference name
[64523 INFO] 2023-01-25 16:45:10 Start a corrected worker in 64523 from parent 64432
[bam_merge_iter_init] Region "ctg000230:0-15000000" specifies an unknown reference name
[64526 INFO] 2023-01-25 16:45:11 Start a corrected worker in 64526 from parent 64432
[bam_merge_iter_init] Region "ctg000240:0-15000000" specifies an unknown reference name
[64529 INFO] 2023-01-25 16:45:11 Start a corrected worker in 64529 from parent 64432
[64530 INFO] 2023-01-25 16:45:11 Start a corrected worker in 64530 from parent 64432
[bam_merge_iter_init] Region "ctg000250:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000260:0-15000000" specifies an unknown reference name
[64533 INFO] 2023-01-25 16:45:12 Start a corrected worker in 64533 from parent 64432
[64534 INFO] 2023-01-25 16:45:12 Start a corrected worker in 64534 from parent 64432
[bam_merge_iter_init] Region "ctg000280:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000270:0-15000000" specifies an unknown reference name
[64535 INFO] 2023-01-25 16:45:12 Start a corrected worker in 64535 from parent 64432
[bam_merge_iter_init] Region "ctg000290:0-15000000" specifies an unknown reference name
[64539 INFO] 2023-01-25 16:45:13 Start a corrected worker in 64539 from parent 64432
[64540 INFO] 2023-01-25 16:45:13 Start a corrected worker in 64540 from parent 64432
[bam_merge_iter_init] Region "ctg000300:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000310:0-15000000" specifies an unknown reference name
[64545 INFO] 2023-01-25 16:45:13 Start a corrected worker in 64545 from parent 64432
[bam_merge_iter_init] Region "ctg000320:0-15000000" specifies an unknown reference name
[64546 INFO] 2023-01-25 16:45:13 Start a corrected worker in 64546 from parent 64432
[64547 INFO] 2023-01-25 16:45:14 Start a corrected worker in 64547 from parent 64432
[bam_merge_iter_init] Region "ctg000330:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000340:0-15000000" specifies an unknown reference name
[64549 INFO] 2023-01-25 16:45:14 Start a corrected worker in 64549 from parent 64432
[bam_merge_iter_init] Region "ctg000350:0-15000000" specifies an unknown reference name
[64553 INFO] 2023-01-25 16:45:14 Start a corrected worker in 64553 from parent 64432
[64554 INFO] 2023-01-25 16:45:15 Start a corrected worker in 64554 from parent 64432
[bam_merge_iter_init] Region "ctg000360:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000370:0-15000000" specifies an unknown reference name
[64557 INFO] 2023-01-25 16:45:15 Start a corrected worker in 64557 from parent 64432
[bam_merge_iter_init] Region "ctg000380:0-15000000" specifies an unknown reference name
[64559 INFO] 2023-01-25 16:45:15 Start a corrected worker in 64559 from parent 64432
[bam_merge_iter_init] Region "ctg000390:0-15000000" specifies an unknown reference name
[64562 INFO] 2023-01-25 16:45:16 Start a corrected worker in 64562 from parent 64432
[bam_merge_iter_init] Region "ctg000400:0-15000000" specifies an unknown reference name
[64563 INFO] 2023-01-25 16:45:16 Start a corrected worker in 64563 from parent 64432
[bam_merge_iter_init] Region "ctg000410:0-15000000" specifies an unknown reference name
[64566 INFO] 2023-01-25 16:45:17 Start a corrected worker in 64566 from parent 64432
[bam_merge_iter_init] Region "ctg000420:0-15000000" specifies an unknown reference name
[64571 INFO] 2023-01-25 16:45:18 Start a corrected worker in 64571 from parent 64432
[64572 INFO] 2023-01-25 16:45:18 Start a corrected worker in 64572 from parent 64432
[bam_merge_iter_init] Region "ctg000430:0-15000000" specifies an unknown reference name
[64573 INFO] 2023-01-25 16:45:18 Start a corrected worker in 64573 from parent 64432
[bam_merge_iter_init] Region "ctg000440:0-15000000" specifies an unknown reference name
[bam_merge_iter_init] Region "ctg000450:0-15000000" specifies an unknown reference name
[64578 INFO] 2023-01-25 16:45:19 Start a corrected worker in 64578 from parent 64432
[64579 INFO] 2023-01-25 16:45:19 Start a corrected worker in 64579 from parent 64432
[64581 INFO] 2023-01-25 16:45:20 Start a corrected worker in 64581 from parent 64432
slurmstepd: error: *** JOB 948575 ON plato209 CANCELLED AT 2023-01-25T17:05:09 DUE TO TIME LIMIT ***

Operating system
HPC; Gentoo Base System release 2.6

GCC
gcc/11.3.0

Python
python/3.8.10/

NextPolish
v1.4.1

To Reproduce (Optional)
the run.cfg I used:

[General]
job_type = SLURM
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 12
multithread_jobs = 5
genome = /datastore/SEED1/Martin/Apomixis_Breeding/Analyses_output/ONT/nextdenovo_out/ES612/assembly_3/03.ctg_graph/01.ctg_graph.sh.work/nd.asm.fasta
genome_size = auto
workdir = ${HOME}/nextpolish_out/ES612/run1
polish_options = -p 35

[sgs_options]
sgs_fofn = ./sgs1.fofn
sgs_options = -max_depth 100 -bwa

[lgs_options]
lgs_fofn = ./lgs1.fofn
lgs_options = -min_read_len 1k -max_depth 100
lgs_minimap2_options = -x map-ont

Additional context (Optional)
Add any other context about the problem here.

Answer 1 · 2023-01-29T01:23:29.000Z

Try to use your own alignment pipeline, and then only use NextPolish to polish the genome, see here

PS: I don't know how to allow more computing time, maybe you need to ask your system administrator for help

Answer 2 · 2023-01-30T23:06:57.000Z

Hello,

thank you for your advice. I could now fix at least the timeout issue in paralleltask cluster.cfg file and nextpolish with standard run.cfg for short and long reads now continues to run. But the underlying problem seems that especially in the "polish_genome" step the program seems to not run properly or at least very slow as almost no CPU is used and also the memory usage is pretty low (please see attachment).

Do you have any idea how I can fix this in nextpolish or paralleltask?

Again many thanks for your help

Best wishes
Martin

Answer 3 · 2023-01-31T03:42:42.000Z

First, try to kill all tasks, and rerun, it will only continue to run unfinished tasks.

Answer 4 · 2023-01-31T16:44:13.000Z

Hello,

I have killed the jobs every time when trying a new run and upon restart I run into the same issue that there is now plenty of time but when checking on the node usage it seems basically idling during the polishing step although new files are created in each tasks (n=12). It seems that the next step is missing to go into the next round of polishing (I use 5,5,1,2,1,2).
Can it be that when a program uses almost no CPU, it is because one worker in a parallel computation is waiting for another. This could be the case here: perhaps your controller program (running on the login node) is not communicating with the submitted jobs properly.

Is there a verbose/debug log that I can check and where can I find it/whats the name?

Many thanks
Martin

Answer 5 · 2023-02-01T01:45:11.000Z

Hi, if you are sure it is not caused by the system/computer, then you have hit a bug where some sub-processes are crashed , which blocked the main process. So, can you extract the unfinished scaffold/contig sequences and bam and send it to me? I need to reproduce this bug to fix it.

Answer 6 · 2023-02-01T22:51:05.000Z

Hello,

its seems I have solved the issue and the program is running through producing all expected output files. For that to happen I again had to alter the cluster.cfg file from paralleltask and defined fixed values for 'memory-per-cpu' and 'cpus-per-task' overwriting the demand requested by nextdenovo subroutines when leaving this cluster.cfg file at default (e.g. memory-per-cpu = 8G instead of memory-per-cpu = {mem}). Also I've shortened the computing time request to 1hr instead of 48hrs in the paralleltask cluster.cfg as the initial time request was probably too much so some jobs were in 'PD'.

Thanks for your help anyways!
Martin