nf-core/pangenome

How to use Bgzip

mictadlo opened this issue · 32 comments

Description of feature

Hi,
I have 5 genomes and 5 FASTA files. How to use Bgzip?

Thank you in advance,

Best wishes,

Michal

I did the following steps:

cat ../genomes/ChinaLab/NbeHZ1_genome_1.0.fa ../genomes/JapanLab/Nbe_v1_scf.fa ../genomes/LAB360/NbLab360.genome.fasta ../genomes/UsaLab/Niben261_genome.fasta > ../genomes/chnVSjapVSauVSusa.fasta

bgzip -@ 4 ../genomes/chnVSjapVSauVSusa.fasta
samtools faidx ../genomes/chnVSjapVSauVSusa.fasta.gz

nextflow run nf-core/pangenome \
      -r 1.0.0 \
      --wfmash_chunks 20 \
      --input ../genomes/chnVSjapVSauVSusa.fasta.gz \
      --n_haplotypes 1 \
      --outdir results \
      -profile singularity \
      -resume

Unfortunately, I got the following error:

  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)
Feb-12 17:49:57.268 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)'

Caused by:
  Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)

Command executed:

  wfmash \
      chnVSjapVSauVSusa.fasta.gz \
      chnVSjapVSauVSusa.fasta.gz \
       \
      --threads 6 \
       \
      -n 0 -s 5000 -p 90  -X  -l 25000 -k 19 -H 0.001   -m > chnVSjapVSauVSusa.fasta.gz.paf
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP":
      wfmash: $(echo $(wfmash --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [wfmash] ERROR, skch::parseandSave, the number of mappings to retain for each segment has to be grater than 0.

what did I miss?

Hi @mictadlo,

from my understanding you have 5 different haplotypes in your FASTA.
So please set --n_haplotypes 5. This should do the trick.

Hi,
Thank you, I noticed that my FASTA files do not follow. PanSN-spec naming scheme. For each FASTA file I increased haplotype_id := number. Are those changes correct?

> grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | head
>Chr01
>Chr02
>Chr03
>Chr04
>Chr05
>Chr06
>Chr07
>Chr08
>Chr09
>Chr10
grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | tail
>scaffold_990
>scaffold_991
>scaffold_992
>scaffold_993
>scaffold_994
>scaffold_995
>scaffold_996
>scaffold_997
>scaffold_998
>scaffold_999

Changing to:

[NbeHZ1][#][1][#][Chr01]
...
[NbeHZ1][#][1][#][scaffold_999]

> grep ">" JapanLab/Nbe_v1_scf.fa | head
>Nbe.v1.s00010
>Nbe.v1.s00020
>Nbe.v1.s00030
>Nbe.v1.s00040
>Nbe.v1.s00050
>Nbe.v1.s00060
>Nbe.v1.s00070
>Nbe.v1.s00080
>Nbe.v1.s00090
>Nbe.v1.s00100
> grep ">" JapanLab/Nbe_v1_scf.fa | tail
>Nbe.v1.s16590
>Nbe.v1.s16600
>Nbe.v1.s16610
>Nbe.v1.s16620
>Nbe.v1.s16630
>Nbe.v1.s16640
>Nbe.v1.s16650
>Nbe.v1.s16660
>Nbe.v1.s16670
>Nbe.v1.s16680

Changing to:

[Nbe_v1][#][2][#][Nbe.v1.s00010]
...
[Nbe_v1][#][2][#][Nbe.v1.s16680]
> grep ">" LAB360/NbLab360.genome.fasta | head
>NbLab360C00 64506568 NbLab350C00
>NbLab360C01 NbLab350C08
>NbLab360C02 NbLab350C02
>NbLab360C03 NbLab350C06
>NbLab360C04 NbLab350C03
>NbLab360C05 NbLab350C05
>NbLab360C06 NbLab350C04
>NbLab360C07 NbLab350C07
>NbLab360C08 NbLab350C01
>NbLab360C09 NbLab350C09
> grep ">" LAB360/NbLab360.genome.fasta | tail
>NbLab360C10 NbLab350C10
>NbLab360C11 NbLab350C18
>NbLab360C12 NbLab350C12
>NbLab360C13 NbLab350C16
>NbLab360C14 NbLab350C13
>NbLab360C15 NbLab350C15
>NbLab360C16 NbLab350C14
>NbLab360C17 NbLab350C17
>NbLab360C18 NbLab350C11
>NbLab360C19 NbLab350C19

Changing to:

[NbLab360][#][3][#][NbLab360C01]
...
[NbLab360][#][3][#][NbLab360C00]
> grep ">" UsaLab/Niben261_genome.fasta | head
>Niben261Chr01
>Niben261Chr02
>Niben261Chr03
>Niben261Chr04
>Niben261Chr05
>Niben261Chr06
>Niben261Chr07
>Niben261Chr08
>Niben261Chr09
>Niben261Chr10
> grep ">" UsaLab/Niben261_genome.fasta | tail
>Niben261Scf17630
>Niben261Scf17631
>Niben261Scf17632
>Niben261Scf17633
>Niben261Scf17634
>Niben261Scf17635
>Niben261Scf17636
>Niben261Scf17637
>Niben261Scf17638
>Niben261Scf17639

Changing to:

[Niben261][#][4][#][Niben261Chr01]
...
[Niben261][#][4][#][Niben261Scf17639]
> abyss-fac ChinaLab/NbeHZ1_genome_1.0.fa JapanLab/Nbe_v1_scf.fa  LAB360/NbLab360.genome.fasta UsaLab/Niben261_genome.fasta

n	n:500	L50	min	N75	N50	N25	E-size	max	sum	name
2831	2831	10	1000	136e6	143.2e6	148.2e6	142.1e6	183.5e6	2.929e9	ChinaLab/NbeHZ1_genome_1.0.fa
1668	1668	10	1000	132.1e6	141.7e6	145e6	132.2e6	184.4e6	2.926e9	JapanLab/Nbe_v1_scf.fa
20	20	10	58.94e6	132.9e6	140.6e6	145.2e6	144.4e6	180.8e6	2.806e9	LAB360/NbLab360.genome.fasta
17639	17627	10	522	130.9e6	137.1e6	142.2e6	138.5e6	176.9e6	2.751e9	UsaLab/Niben261_genome.fasta

Thank you in advance,

Michal

Your abstract renaming schemes look good to me :)

Thank you, but I am little confused with For instance, HG002#1#ctg1234 names ctg1234 on the first haplotype or phase group of HG002, while HG002#2#ctg9876 is contig ctg9876 on the other. . Does it mean that the haplotype_id should be 1 for each genome?

  • [sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]
  • [NbeHZ1][#][1][#][Chr01]
  • [Nbe_v1][#][1][#][Nbe.v1.s00010]
  • [NbLab360][#][1][#][NbLab360C01]
  • [Niben261][#][4][#][Niben261Chr01]

Assuming you only have one haplotype (with bacteria for example, we always assume this), then setting the id to 1 should be fine here. Seeing your examples, I would do:

  • [NbeHZ1][#][1][#][Chr01]
  • [Nbe_v1][#][1][#][s00010]
  • [NbLab360][#][1][#][C01]
  • [Niben261][#][1][#][Chr01]

Does it make sense?

Hi,
Thank you. I could run the pipeline, but unfortunately, I did not get any VG results.

nextflow run nf-core/pangenome \
      -r 1.0.0 \
      --wfmash_chunks 200 \
      --input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
      --n_haplotypes 5 \
      --outdir results \
      -profile singularity \
      -resume

Reading the documentation I noticed that I need to use --vcf_spec. However, I do not understand how to change "REF:DELIM[:LEN][,REF:DELIM:[LEN]]*" for my below genomes

NbeHZ1#1#Chr02
Nbe_v1#1#s00170
GCA_034376525_1#1#CM067333.1
NbLab360#1#C11

Could you please help me?
Michal

Hi @mictadlo ,

which haplotype would you like have as the reference? I assume NbeHz1?
Then I would set `--vcf-spec "NbeHz1". Please tell me if that worked for you.

Some more details from the PGGB Docs:

   [vg]
    -V, --vcf-spec SPEC         specify a set of VCFs to produce with SPEC = REF[:LEN][,REF[:LEN]]*
                                the paths matching ^REF are used as a reference, while the sample haplotypes
                                are derived from path names, assuming they match the PanSN; e.g. '-V chm13',
                                a path named HG002#1#ctg would be assigned to sample HG002 phase 1.
                                If LEN is specified and greater than 0, the VCFs are decomposed, filtering 
                                sites whose max allele length is greater than LEN. [default: off]

Thank you, it finished successfully, but I can't see any VG (Variant Graph Output):

> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map

What did I miss?

Hi @mictadlo,
it could be that the VG_DECONSTRUCT process is buggy in the v1.0.0 (and v1.1.0). I recently pushed a fix to dev #183, but as it turns out, I still need to update bcftools to v1.19 in that process.
I will do that tomorrow. And directly make a new release. I will let you know, once this is done.

To get a glimpse of the fix, you can try out the current dev branch. Else a new release is close: #186.

Hi, Thank you. I'm just running it and will let you know how it went.

Hi,
I ran the dev version but I can't see any VG results:

> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map

I have got some tower errors:

Mar-15 13:04:45.953 [Tower-thread] WARN  i.s.tower.plugin.TowerJsonGenerator - Tower request field `tasks.script` exceeds expected size
 | offending value: `
    seqwish \
        --threads 6 \
        --paf-alns=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_173.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_174.paf,ch
nVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_176.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_175.paf,chnVSjapVSkorVSauVSusa.
PanSN-1-19.fasta.gz.chunk_178.paf

Susa.PanSN-1-19.fasta.gz.chunk_90.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_94.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.g
z.chunk_87.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_96.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_99.paf,chnVSjap
VSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_23.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_80.paf \
        --seqs=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
        --gfa=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.seqwish.gfa \
        -k 19 -f 0.0 -B 10000000 -P

    cat <<-END_VERSIONS > versions.yml
    "NFCORE_PANGENOME:PANGENOME:PGGB:SEQWISH":
        seqwish: $(echo $(seqwish --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
    END_VERSIONS
    `, size: 11710 (max: 10240)
Mar-15 13:04:46.083 [Actor Thread 30] INFO  nextflow.container.SingularityCache - Pulling Singularity image https://depot.galaxyproject.org/singularity/smoothxg:0.7.2--h40c17d1_0 [cache /home/lorencm/.nextflow/NXF_SINGULARITY_CACHEDIR/depot.galaxyproject.org-singularity-smoothxg-0.7.2--h40c17d1_0.img]

I am attaching the
nextflow.log.txt

What did I miss?

That's strange. Can you please share the content of your FASTA index, so I can get a better idea if you set --vcf_spec in the right way.

As far a I can see tower only gives out a warning ,right?

Where can I find the Fasta index?

Yes, tower seems to have given only warnings. However, when I logged into tower it showed successful but when I clicked on the run it shows that some tasked failed.

Would it be possible for you to share the tower run somehow?

As documented in https://nf-co.re/pangenome/1.1.1/docs/output#input-check, it should be in the folder samtools_faidx.

It seems I don't have the folder

results> find . -name "samtools_faidx"

I will ask how to share the tower run with our admins.

So you created FASTA.gz.fai before running the pipeline?

Completly, forgottten:

> ls ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz*
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz      
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gzi
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai

Please find chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai.txt

nextflow run nf-core/pangenome \
      -r dev \
      --wfmash_chunks 200 \
      --input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
      --n_haplotypes 5 \
      --outdir results \
      --vcf-spec NbLab360 \
      -profile singularity \
      -resume

I think you have to do --vcf_spec 'NbLab360'. There is an underscore, not a hyphen. Didn't you get a warning?

I haven't seen any warnings about underscores.

I mean something like

WARN: The following invalid input values have been detected:

* --vcf-spec: gi|345525392:5000-18402

Yes, now I see it:

Mar-19 23:46:58.789 [main] WARN  nextflow.validation.SchemaValidator - The following invalid input values have been detected:

* --vcf-spec: NbLab360
* --vcfSpec: NbLab360

Maybe you can extract the whole log file from tower, so no need to share more.

I will try. I ran it again with --vcf-spec 'NbLab360' but I still got the following warning:

Mar-20 00:00:31.128 [main] WARN  nextflow.validation.SchemaValidator - The following invalid input values have been detected:

* --vcf-spec: NbLab360
* --vcfSpec: NbLab360

I think you have to do --vcf_spec 'NbLab360'. There is an underscore, not a hyphen. Didn't you get a warning?

Please use a _ and not a -.

--vcf_spec 'NbLab360'

Hi, Thank you. Now, I got a new error:

 The exit status of the task that caused the workflow execution to fail was: 1

Error executing process > 'NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)'

Caused by:
  Process `NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)` terminated with an error exit status (1)

Command executed:

  ref=$(echo "NbLab360" | cut -f 1 -d:)
  delim=$(echo "NbLab360" | cut -f 2 -d:)
  pop_length=$(echo "NbLab360" | cut -f 3 -d:)
  
  if [[ -z $pop_length ]]; then
      pop_length=0
  fi
  
  vcf="chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa".$(echo $ref | tr '/|' '_').vcf
  vg deconstruct -P $ref -H $delim -e -a -t "6" "chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa" > $vcf
  bcftools stats $vcf > $vcf.stats
  if [[ $pop_length -gt 0 ]]; then
      vcf_decomposed=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa.final.$(echo $ref | tr '/|' '_').decomposed.vcf
      vcf_decomposed_tmp=$vcf_decomposed.tmp.vcf
      bgzip -c -@ 6 $vcf > $vcf.gz
      vcfbub -l 0 -a $pop_length --input $vcf.gz | vcfwave -I 1000 -t 6 > $vcf_decomposed_tmp
      #TODO: to remove when vcfwave will be bug-free
      # The TYPE info sometimes is wrong/missing
      # There are variants without the ALT allele
      bcftools annotate -x INFO/TYPE $vcf_decomposed_tmp  | awk '$5 != "."' > $vcf_decomposed
      rm $vcf_decomposed_tmp $vcf.gz
      bcftools stats $vcf_decomposed > $vcf_decomposed.stats
  fi
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT":
      pggb: $(pggb --version 2>&1 | grep -o 'pggb .*' | cut -f2 -d ' ')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Warning [vg deconstruct]: -H is deprecated, and will be ignored
  .command.sh: line 13: NbLab360: unbound variable

Work dir:
  /mnt/hpccs01/scratch/waterhouse_team/pangenome/nf-dev02/work/09/6d70b3cf40e26972132b3d245452e0

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Hi @mictadlo I think this is because you use an older version of the pipeline which was buggy with respect to the variant calling. I would recommend to use the latest one 1.1.2. Sorry for this.

Hi @subwaystation, I got this error
Cannot find revision `1.1.2` -- Make sure that it exists in the remote repository `https://github.com/nf-core/pangenome

I don't know why, but this also always happens to me when I run a new revision of a pipeline. Please try again. Nextflow is confusing things.

Hi, Thank you for all your help. It worked successfully.

Fantastic news, thanks for letting me know @mictadlo .