How to use Bgzip
mictadlo opened this issue · 32 comments
Description of feature
Hi,
I have 5 genomes and 5 FASTA files. How to use Bgzip?
Thank you in advance,
Best wishes,
Michal
I did the following steps:
cat ../genomes/ChinaLab/NbeHZ1_genome_1.0.fa ../genomes/JapanLab/Nbe_v1_scf.fa ../genomes/LAB360/NbLab360.genome.fasta ../genomes/UsaLab/Niben261_genome.fasta > ../genomes/chnVSjapVSauVSusa.fasta
bgzip -@ 4 ../genomes/chnVSjapVSauVSusa.fasta
samtools faidx ../genomes/chnVSjapVSauVSusa.fasta.gz
nextflow run nf-core/pangenome \
-r 1.0.0 \
--wfmash_chunks 20 \
--input ../genomes/chnVSjapVSauVSusa.fasta.gz \
--n_haplotypes 1 \
--outdir results \
-profile singularity \
-resume
Unfortunately, I got the following error:
error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)
Feb-12 17:49:57.268 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)'
Caused by:
Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)
Command executed:
wfmash \
chnVSjapVSauVSusa.fasta.gz \
chnVSjapVSauVSusa.fasta.gz \
\
--threads 6 \
\
-n 0 -s 5000 -p 90 -X -l 25000 -k 19 -H 0.001 -m > chnVSjapVSauVSusa.fasta.gz.paf
cat <<-END_VERSIONS > versions.yml
"NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP":
wfmash: $(echo $(wfmash --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
[wfmash] ERROR, skch::parseandSave, the number of mappings to retain for each segment has to be grater than 0.
what did I miss?
Hi @mictadlo,
from my understanding you have 5 different haplotypes in your FASTA.
So please set --n_haplotypes 5
. This should do the trick.
Hi,
Thank you, I noticed that my FASTA files do not follow. PanSN-spec naming scheme. For each FASTA file I increased haplotype_id := number
. Are those changes correct?
> grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | head
>Chr01
>Chr02
>Chr03
>Chr04
>Chr05
>Chr06
>Chr07
>Chr08
>Chr09
>Chr10
grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | tail
>scaffold_990
>scaffold_991
>scaffold_992
>scaffold_993
>scaffold_994
>scaffold_995
>scaffold_996
>scaffold_997
>scaffold_998
>scaffold_999
Changing to:
[NbeHZ1][#][1][#][Chr01]
...
[NbeHZ1][#][1][#][scaffold_999]
> grep ">" JapanLab/Nbe_v1_scf.fa | head
>Nbe.v1.s00010
>Nbe.v1.s00020
>Nbe.v1.s00030
>Nbe.v1.s00040
>Nbe.v1.s00050
>Nbe.v1.s00060
>Nbe.v1.s00070
>Nbe.v1.s00080
>Nbe.v1.s00090
>Nbe.v1.s00100
> grep ">" JapanLab/Nbe_v1_scf.fa | tail
>Nbe.v1.s16590
>Nbe.v1.s16600
>Nbe.v1.s16610
>Nbe.v1.s16620
>Nbe.v1.s16630
>Nbe.v1.s16640
>Nbe.v1.s16650
>Nbe.v1.s16660
>Nbe.v1.s16670
>Nbe.v1.s16680
Changing to:
[Nbe_v1][#][2][#][Nbe.v1.s00010]
...
[Nbe_v1][#][2][#][Nbe.v1.s16680]
> grep ">" LAB360/NbLab360.genome.fasta | head
>NbLab360C00 64506568 NbLab350C00
>NbLab360C01 NbLab350C08
>NbLab360C02 NbLab350C02
>NbLab360C03 NbLab350C06
>NbLab360C04 NbLab350C03
>NbLab360C05 NbLab350C05
>NbLab360C06 NbLab350C04
>NbLab360C07 NbLab350C07
>NbLab360C08 NbLab350C01
>NbLab360C09 NbLab350C09
> grep ">" LAB360/NbLab360.genome.fasta | tail
>NbLab360C10 NbLab350C10
>NbLab360C11 NbLab350C18
>NbLab360C12 NbLab350C12
>NbLab360C13 NbLab350C16
>NbLab360C14 NbLab350C13
>NbLab360C15 NbLab350C15
>NbLab360C16 NbLab350C14
>NbLab360C17 NbLab350C17
>NbLab360C18 NbLab350C11
>NbLab360C19 NbLab350C19
Changing to:
[NbLab360][#][3][#][NbLab360C01]
...
[NbLab360][#][3][#][NbLab360C00]
> grep ">" UsaLab/Niben261_genome.fasta | head
>Niben261Chr01
>Niben261Chr02
>Niben261Chr03
>Niben261Chr04
>Niben261Chr05
>Niben261Chr06
>Niben261Chr07
>Niben261Chr08
>Niben261Chr09
>Niben261Chr10
> grep ">" UsaLab/Niben261_genome.fasta | tail
>Niben261Scf17630
>Niben261Scf17631
>Niben261Scf17632
>Niben261Scf17633
>Niben261Scf17634
>Niben261Scf17635
>Niben261Scf17636
>Niben261Scf17637
>Niben261Scf17638
>Niben261Scf17639
Changing to:
[Niben261][#][4][#][Niben261Chr01]
...
[Niben261][#][4][#][Niben261Scf17639]
> abyss-fac ChinaLab/NbeHZ1_genome_1.0.fa JapanLab/Nbe_v1_scf.fa LAB360/NbLab360.genome.fasta UsaLab/Niben261_genome.fasta
n n:500 L50 min N75 N50 N25 E-size max sum name
2831 2831 10 1000 136e6 143.2e6 148.2e6 142.1e6 183.5e6 2.929e9 ChinaLab/NbeHZ1_genome_1.0.fa
1668 1668 10 1000 132.1e6 141.7e6 145e6 132.2e6 184.4e6 2.926e9 JapanLab/Nbe_v1_scf.fa
20 20 10 58.94e6 132.9e6 140.6e6 145.2e6 144.4e6 180.8e6 2.806e9 LAB360/NbLab360.genome.fasta
17639 17627 10 522 130.9e6 137.1e6 142.2e6 138.5e6 176.9e6 2.751e9 UsaLab/Niben261_genome.fasta
Thank you in advance,
Michal
Your abstract renaming schemes look good to me :)
Thank you, but I am little confused with For instance, HG002#1#ctg1234 names ctg1234 on the first haplotype or phase group of HG002, while HG002#2#ctg9876 is contig ctg9876 on the other.
. Does it mean that the haplotype_id
should be 1 for each genome?
- [sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]
- [NbeHZ1][#][1][#][Chr01]
- [Nbe_v1][#][1][#][Nbe.v1.s00010]
- [NbLab360][#][1][#][NbLab360C01]
- [Niben261][#][4][#][Niben261Chr01]
Assuming you only have one haplotype (with bacteria for example, we always assume this), then setting the id to 1
should be fine here. Seeing your examples, I would do:
- [NbeHZ1][#][1][#][Chr01]
- [Nbe_v1][#][1][#][s00010]
- [NbLab360][#][1][#][C01]
- [Niben261][#][1][#][Chr01]
Does it make sense?
Hi,
Thank you. I could run the pipeline, but unfortunately, I did not get any VG
results.
nextflow run nf-core/pangenome \
-r 1.0.0 \
--wfmash_chunks 200 \
--input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
--n_haplotypes 5 \
--outdir results \
-profile singularity \
-resume
Reading the documentation I noticed that I need to use --vcf_spec
. However, I do not understand how to change "REF:DELIM[:LEN][,REF:DELIM:[LEN]]*"
for my below genomes
NbeHZ1#1#Chr02
Nbe_v1#1#s00170
GCA_034376525_1#1#CM067333.1
NbLab360#1#C11
Could you please help me?
Michal
Hi @mictadlo ,
which haplotype would you like have as the reference? I assume NbeHz1
?
Then I would set `--vcf-spec "NbeHz1". Please tell me if that worked for you.
Some more details from the PGGB Docs:
[vg]
-V, --vcf-spec SPEC specify a set of VCFs to produce with SPEC = REF[:LEN][,REF[:LEN]]*
the paths matching ^REF are used as a reference, while the sample haplotypes
are derived from path names, assuming they match the PanSN; e.g. '-V chm13',
a path named HG002#1#ctg would be assigned to sample HG002 phase 1.
If LEN is specified and greater than 0, the VCFs are decomposed, filtering
sites whose max allele length is greater than LEN. [default: off]
Thank you, it finished successfully, but I can't see any VG
(Variant Graph Output):
> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map
What did I miss?
Hi @mictadlo,
it could be that the VG_DECONSTRUCT process is buggy in the v1.0.0 (and v1.1.0). I recently pushed a fix to dev #183, but as it turns out, I still need to update bcftools to v1.19 in that process.
I will do that tomorrow. And directly make a new release. I will let you know, once this is done.
To get a glimpse of the fix, you can try out the current dev
branch. Else a new release is close: #186.
Hi, Thank you. I'm just running it and will let you know how it went.
Hi,
I ran the dev version but I can't see any VG results:
> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map
I have got some tower errors:
Mar-15 13:04:45.953 [Tower-thread] WARN i.s.tower.plugin.TowerJsonGenerator - Tower request field `tasks.script` exceeds expected size
| offending value: `
seqwish \
--threads 6 \
--paf-alns=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_173.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_174.paf,ch
nVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_176.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_175.paf,chnVSjapVSkorVSauVSusa.
PanSN-1-19.fasta.gz.chunk_178.paf
Susa.PanSN-1-19.fasta.gz.chunk_90.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_94.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.g
z.chunk_87.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_96.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_99.paf,chnVSjap
VSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_23.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_80.paf \
--seqs=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
--gfa=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.seqwish.gfa \
-k 19 -f 0.0 -B 10000000 -P
cat <<-END_VERSIONS > versions.yml
"NFCORE_PANGENOME:PANGENOME:PGGB:SEQWISH":
seqwish: $(echo $(seqwish --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
END_VERSIONS
`, size: 11710 (max: 10240)
Mar-15 13:04:46.083 [Actor Thread 30] INFO nextflow.container.SingularityCache - Pulling Singularity image https://depot.galaxyproject.org/singularity/smoothxg:0.7.2--h40c17d1_0 [cache /home/lorencm/.nextflow/NXF_SINGULARITY_CACHEDIR/depot.galaxyproject.org-singularity-smoothxg-0.7.2--h40c17d1_0.img]
I am attaching the
nextflow.log.txt
What did I miss?
That's strange. Can you please share the content of your FASTA index, so I can get a better idea if you set --vcf_spec
in the right way.
As far a I can see tower only gives out a warning ,right?
Where can I find the Fasta index?
Yes, tower seems to have given only warnings. However, when I logged into tower it showed successful but when I clicked on the run it shows that some tasked failed.
Would it be possible for you to share the tower run somehow?
As documented in https://nf-co.re/pangenome/1.1.1/docs/output#input-check, it should be in the folder samtools_faidx
.
It seems I don't have the folder
results> find . -name "samtools_faidx"
I will ask how to share the tower run with our admins.
So you created FASTA.gz.fai
before running the pipeline?
Completly, forgottten:
> ls ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz*
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gzi
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai
Please find chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai.txt
nextflow run nf-core/pangenome \
-r dev \
--wfmash_chunks 200 \
--input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
--n_haplotypes 5 \
--outdir results \
--vcf-spec NbLab360 \
-profile singularity \
-resume
I think you have to do --vcf_spec 'NbLab360'
. There is an underscore, not a hyphen. Didn't you get a warning?
I haven't seen any warnings about underscores.
I mean something like
WARN: The following invalid input values have been detected:
* --vcf-spec: gi|345525392:5000-18402
Yes, now I see it:
Mar-19 23:46:58.789 [main] WARN nextflow.validation.SchemaValidator - The following invalid input values have been detected:
* --vcf-spec: NbLab360
* --vcfSpec: NbLab360
Maybe you can extract the whole log file from tower, so no need to share more.
I will try. I ran it again with --vcf-spec 'NbLab360'
but I still got the following warning:
Mar-20 00:00:31.128 [main] WARN nextflow.validation.SchemaValidator - The following invalid input values have been detected:
* --vcf-spec: NbLab360
* --vcfSpec: NbLab360
I think you have to do
--vcf_spec 'NbLab360'
. There is an underscore, not a hyphen. Didn't you get a warning?
Please use a _
and not a -
.
--vcf
_spec 'NbLab360'
Hi, Thank you. Now, I got a new error:
The exit status of the task that caused the workflow execution to fail was: 1
Error executing process > 'NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)'
Caused by:
Process `NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)` terminated with an error exit status (1)
Command executed:
ref=$(echo "NbLab360" | cut -f 1 -d:)
delim=$(echo "NbLab360" | cut -f 2 -d:)
pop_length=$(echo "NbLab360" | cut -f 3 -d:)
if [[ -z $pop_length ]]; then
pop_length=0
fi
vcf="chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa".$(echo $ref | tr '/|' '_').vcf
vg deconstruct -P $ref -H $delim -e -a -t "6" "chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa" > $vcf
bcftools stats $vcf > $vcf.stats
if [[ $pop_length -gt 0 ]]; then
vcf_decomposed=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa.final.$(echo $ref | tr '/|' '_').decomposed.vcf
vcf_decomposed_tmp=$vcf_decomposed.tmp.vcf
bgzip -c -@ 6 $vcf > $vcf.gz
vcfbub -l 0 -a $pop_length --input $vcf.gz | vcfwave -I 1000 -t 6 > $vcf_decomposed_tmp
#TODO: to remove when vcfwave will be bug-free
# The TYPE info sometimes is wrong/missing
# There are variants without the ALT allele
bcftools annotate -x INFO/TYPE $vcf_decomposed_tmp | awk '$5 != "."' > $vcf_decomposed
rm $vcf_decomposed_tmp $vcf.gz
bcftools stats $vcf_decomposed > $vcf_decomposed.stats
fi
cat <<-END_VERSIONS > versions.yml
"NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT":
pggb: $(pggb --version 2>&1 | grep -o 'pggb .*' | cut -f2 -d ' ')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
Warning [vg deconstruct]: -H is deprecated, and will be ignored
.command.sh: line 13: NbLab360: unbound variable
Work dir:
/mnt/hpccs01/scratch/waterhouse_team/pangenome/nf-dev02/work/09/6d70b3cf40e26972132b3d245452e0
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
Hi @mictadlo I think this is because you use an older version of the pipeline which was buggy with respect to the variant calling. I would recommend to use the latest one 1.1.2. Sorry for this.
Hi @subwaystation, I got this error
Cannot find revision `1.1.2` -- Make sure that it exists in the remote repository `https://github.com/nf-core/pangenome
I don't know why, but this also always happens to me when I run a new revision of a pipeline. Please try again. Nextflow is confusing things.
Hi, Thank you for all your help. It worked successfully.
Fantastic news, thanks for letting me know @mictadlo .