run compleasm as array job
Closed this issue · 2 comments
Hi there, I'm conducting several parallel tests with hifiasm
using specific parameters; everything works fine but along with some assembly metrics, I thought to assess the completeness for such genomes.
Basically, I'm working with a diploid plant and using the fabales_odb10 as database while running 12 experiments at the time with an array job. For some reason, only a maximum of 5 can be computed simultaneously by compleasm
and I cannot figure out the reason behind this...
Below are the two commands I run:
source /opt/miniconda3/bin/activate genomes
compleasm run -a $d/${ID}.hap1.fasta -o $d/completeness_hap1 -l fabales_odb10 -L hap1_${TEMP} -t 8 && compleasm run -a $d/${ID}.hap2.fasta -o $d/completeness_hap2 -l fabales_odb10 -L hap2_${TEMP} -t 8; rm -rf hap*_${TEMP}
After activating the conda environment for the tool, I generate for the FASTA sequence of each experiment (stored in different folders) the summary.txt
, naming the database hap1 or hap2 followed by the experiment number; this both for hap1 and hap2. The tool seems to compute only five of the experiments per run, otherwise it returns the following:
[M::worker_pipeline::135.1767.70] mapped 2856 sequences
[M::worker_pipeline::141.9747.71] mapped 2291 sequences
[M::main] Version: 0.12-r237
[M::main] CMD: /path/to/softwares/condaenvs/genomes/bin/miniprot --trans -u -I --outs=0.95 -t 8 --gff /mnt/nvme/ungaro/INLUP_00233/0.15/INLUP00233.hap1.fasta /mnt/nvme/ungaro/INLUP_00233/mb_downloads/fabales_odb10/refseq_db.faa.gz
[M::main] Real time: 142.047 sec; CPU: 1094.231 sec; Peak RSS: 3.087 GB
Searching for miniprot in the path where compleasm.py is located
Searching for miniprot in the current execution path
Searching for miniprot in $PATH
Searching for hmmsearch in the path where compleasm.py is located
Searching for hmmsearch in the current execution path
Searching for hmmsearch in $PATH
miniprot execute command:
/path/to/softwares/condaenvs/genomes/bin/miniprot
lineage: /mnt/nvme/ungaro/INLUP_00233/mb_downloads/fabales_odb10
Traceback (most recent call last):
File "/path/to/softwares/condaenvs/genomes/bin/compleasm", line 10, in
sys.exit(main())
File "/path/to/softwares/condaenvs/genomes/lib/python3.9/site-packages/compleasm.py", line 2744, in main
args.func(args)
File "/path/to/softwares/condaenvs/genomes/lib/python3.9/site-packages/compleasm.py", line 2624, in run
mr.Run()
File "/path/to/softwares/condaenvs/genomes/lib/python3.9/site-packages/compleasm.py", line 2171, in Run
miniprot_alignment_parser.Run()
File "/path/to/softwares/condaenvs/genomes/lib/python3.9/site-packages/compleasm.py", line 1174, in Run
self.Run_busco_mode()
File "/path/to/softwares/condaenvs/genomes/lib/python3.9/site-packages/compleasm.py", line 1195, in Run_busco_mode
for items in reader:
File "/path/to/softwares/condaenvs/genomes/lib/python3.9/site-packages/compleasm.py", line 915, in parse_miniprot_records
assert target_id == items.target_id and seq_id == items.contig_id
AssertionError
which I'm not really sure how to address... thanks in advance!
Hi @Overcraft90,
The error is quite unexpected. My guess is different tasks process hap1_${TEMP} and hap2_${TEMP} at the same time. My suggestion is that for all experiments you only download fabales_odb10 once.
// pre-download
compleasm download -L /path/to/lineages_folder fabales_odb10
// exp1
compleasm run -a $d/${ID}.hap1.fasta -o $d/completeness_hap1 -l fabales_odb10 -L /path/to/lineages_folder -t 8
compleasm run -a $d/${ID}.hap2.fasta -o $d/completeness_hap2 -l fabales_odb10 -L /path/to/lineages_folder -t 8
// exp2
...
// exp3
...
@huangnengCSU Indeed, thanks a lot it worked. I've done few experiments both setting up compleasm
to download and to access an already downloaded lineage database; the problem seems the &&
used to act on hap1 first and hap2 after.
In principle, the idea to download the database once is more efficient especially if dealing with the same organism; however, I was suggested to download it every time since results might differ. Aside from that, once the command for hap1 and hap2 are issued on two separate lines everything goes smooth.
I recall having to do something similar with BUSCO
where I had to set independent multistep processes within the same job to run it on multiple assemblies, but because is my first time with compleasm
I wanted to double-check whether what I was doing with the tool was formally correct. Thanks again!