How to use Bgzip

Question

How to use Bgzip

mictadlo opened this issue a year ago · 32 comments

mictadlo commented a year ago

Description of feature

Hi,
I have 5 genomes and 5 FASTA files. How to use Bgzip?

Thank you in advance,

Best wishes,

Michal

Answer 1 · 2024-02-12T07:57:23.000Z

I did the following steps:

cat ../genomes/ChinaLab/NbeHZ1_genome_1.0.fa ../genomes/JapanLab/Nbe_v1_scf.fa ../genomes/LAB360/NbLab360.genome.fasta ../genomes/UsaLab/Niben261_genome.fasta > ../genomes/chnVSjapVSauVSusa.fasta

bgzip -@ 4 ../genomes/chnVSjapVSauVSusa.fasta
samtools faidx ../genomes/chnVSjapVSauVSusa.fasta.gz

nextflow run nf-core/pangenome \
      -r 1.0.0 \
      --wfmash_chunks 20 \
      --input ../genomes/chnVSjapVSauVSusa.fasta.gz \
      --n_haplotypes 1 \
      --outdir results \
      -profile singularity \
      -resume

Unfortunately, I got the following error:

  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)
Feb-12 17:49:57.268 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)'

Caused by:
  Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)

Command executed:

  wfmash \
      chnVSjapVSauVSusa.fasta.gz \
      chnVSjapVSauVSusa.fasta.gz \
       \
      --threads 6 \
       \
      -n 0 -s 5000 -p 90  -X  -l 25000 -k 19 -H 0.001   -m > chnVSjapVSauVSusa.fasta.gz.paf
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP":
      wfmash: $(echo $(wfmash --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [wfmash] ERROR, skch::parseandSave, the number of mappings to retain for each segment has to be grater than 0.

what did I miss?

Answer 2 · 2024-02-12T08:08:55.000Z

Hi @mictadlo,

from my understanding you have 5 different haplotypes in your FASTA.
So please set --n_haplotypes 5. This should do the trick.

Answer 3 · 2024-02-12T23:23:48.000Z

Hi,
Thank you, I noticed that my FASTA files do not follow. PanSN-spec naming scheme. For each FASTA file I increased haplotype_id := number. Are those changes correct?

> grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | head
>Chr01
>Chr02
>Chr03
>Chr04
>Chr05
>Chr06
>Chr07
>Chr08
>Chr09
>Chr10
grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | tail
>scaffold_990
>scaffold_991
>scaffold_992
>scaffold_993
>scaffold_994
>scaffold_995
>scaffold_996
>scaffold_997
>scaffold_998
>scaffold_999

Changing to:

[NbeHZ1][#][1][#][Chr01]
...
[NbeHZ1][#][1][#][scaffold_999]

> grep ">" JapanLab/Nbe_v1_scf.fa | head
>Nbe.v1.s00010
>Nbe.v1.s00020
>Nbe.v1.s00030
>Nbe.v1.s00040
>Nbe.v1.s00050
>Nbe.v1.s00060
>Nbe.v1.s00070
>Nbe.v1.s00080
>Nbe.v1.s00090
>Nbe.v1.s00100
> grep ">" JapanLab/Nbe_v1_scf.fa | tail
>Nbe.v1.s16590
>Nbe.v1.s16600
>Nbe.v1.s16610
>Nbe.v1.s16620
>Nbe.v1.s16630
>Nbe.v1.s16640
>Nbe.v1.s16650
>Nbe.v1.s16660
>Nbe.v1.s16670
>Nbe.v1.s16680

Changing to:

[Nbe_v1][#][2][#][Nbe.v1.s00010]
...
[Nbe_v1][#][2][#][Nbe.v1.s16680]

> grep ">" LAB360/NbLab360.genome.fasta | head
>NbLab360C00 64506568 NbLab350C00
>NbLab360C01 NbLab350C08
>NbLab360C02 NbLab350C02
>NbLab360C03 NbLab350C06
>NbLab360C04 NbLab350C03
>NbLab360C05 NbLab350C05
>NbLab360C06 NbLab350C04
>NbLab360C07 NbLab350C07
>NbLab360C08 NbLab350C01
>NbLab360C09 NbLab350C09
> grep ">" LAB360/NbLab360.genome.fasta | tail
>NbLab360C10 NbLab350C10
>NbLab360C11 NbLab350C18
>NbLab360C12 NbLab350C12
>NbLab360C13 NbLab350C16
>NbLab360C14 NbLab350C13
>NbLab360C15 NbLab350C15
>NbLab360C16 NbLab350C14
>NbLab360C17 NbLab350C17
>NbLab360C18 NbLab350C11
>NbLab360C19 NbLab350C19

Changing to:

[NbLab360][#][3][#][NbLab360C01]
...
[NbLab360][#][3][#][NbLab360C00]

> grep ">" UsaLab/Niben261_genome.fasta | head
>Niben261Chr01
>Niben261Chr02
>Niben261Chr03
>Niben261Chr04
>Niben261Chr05
>Niben261Chr06
>Niben261Chr07
>Niben261Chr08
>Niben261Chr09
>Niben261Chr10
> grep ">" UsaLab/Niben261_genome.fasta | tail
>Niben261Scf17630
>Niben261Scf17631
>Niben261Scf17632
>Niben261Scf17633
>Niben261Scf17634
>Niben261Scf17635
>Niben261Scf17636
>Niben261Scf17637
>Niben261Scf17638
>Niben261Scf17639

Changing to:

[Niben261][#][4][#][Niben261Chr01]
...
[Niben261][#][4][#][Niben261Scf17639]

> abyss-fac ChinaLab/NbeHZ1_genome_1.0.fa JapanLab/Nbe_v1_scf.fa  LAB360/NbLab360.genome.fasta UsaLab/Niben261_genome.fasta

n	n:500	L50	min	N75	N50	N25	E-size	max	sum	name
2831	2831	10	1000	136e6	143.2e6	148.2e6	142.1e6	183.5e6	2.929e9	ChinaLab/NbeHZ1_genome_1.0.fa
1668	1668	10	1000	132.1e6	141.7e6	145e6	132.2e6	184.4e6	2.926e9	JapanLab/Nbe_v1_scf.fa
20	20	10	58.94e6	132.9e6	140.6e6	145.2e6	144.4e6	180.8e6	2.806e9	LAB360/NbLab360.genome.fasta
17639	17627	10	522	130.9e6	137.1e6	142.2e6	138.5e6	176.9e6	2.751e9	UsaLab/Niben261_genome.fasta

Thank you in advance,

Michal

Answer 4 · 2024-02-13T05:40:58.000Z

Your abstract renaming schemes look good to me :)

Answer 5 · 2024-02-13T11:37:57.000Z

Thank you, but I am little confused with For instance, HG002#1#ctg1234 names ctg1234 on the first haplotype or phase group of HG002, while HG002#2#ctg9876 is contig ctg9876 on the other. . Does it mean that the haplotype_id should be 1 for each genome?

[sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]
[NbeHZ1][#][1][#][Chr01]
[Nbe_v1][#][1][#][Nbe.v1.s00010]
[NbLab360][#][1][#][NbLab360C01]
[Niben261][#][4][#][Niben261Chr01]

Answer 6 · 2024-02-13T17:44:23.000Z

Assuming you only have one haplotype (with bacteria for example, we always assume this), then setting the id to 1 should be fine here. Seeing your examples, I would do:

[NbeHZ1][#][1][#][Chr01]
[Nbe_v1][#][1][#][s00010]
[NbLab360][#][1][#][C01]
[Niben261][#][1][#][Chr01]

Does it make sense?

Answer 7 · 2024-03-01T05:11:00.000Z

Hi,
Thank you. I could run the pipeline, but unfortunately, I did not get any VG results.

nextflow run nf-core/pangenome \
      -r 1.0.0 \
      --wfmash_chunks 200 \
      --input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
      --n_haplotypes 5 \
      --outdir results \
      -profile singularity \
      -resume

Reading the documentation I noticed that I need to use --vcf_spec. However, I do not understand how to change "REF:DELIM[:LEN][,REF:DELIM:[LEN]]*" for my below genomes

NbeHZ1#1#Chr02
Nbe_v1#1#s00170
GCA_034376525_1#1#CM067333.1
NbLab360#1#C11

Could you please help me?
Michal

Answer 8 · 2024-03-01T09:33:08.000Z

Hi @mictadlo ,

which haplotype would you like have as the reference? I assume NbeHz1?
Then I would set `--vcf-spec "NbeHz1". Please tell me if that worked for you.

Some more details from the PGGB Docs:

   [vg]
    -V, --vcf-spec SPEC         specify a set of VCFs to produce with SPEC = REF[:LEN][,REF[:LEN]]*
                                the paths matching ^REF are used as a reference, while the sample haplotypes
                                are derived from path names, assuming they match the PanSN; e.g. '-V chm13',
                                a path named HG002#1#ctg would be assigned to sample HG002 phase 1.
                                If LEN is specified and greater than 0, the VCFs are decomposed, filtering 
                                sites whose max allele length is greater than LEN. [default: off]

Answer 9 · 2024-03-12T11:24:49.000Z

Thank you, it finished successfully, but I can't see any VG (Variant Graph Output):

> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map

What did I miss?

Answer 10 · 2024-03-12T12:35:58.000Z

Hi @mictadlo,
it could be that the VG_DECONSTRUCT process is buggy in the v1.0.0 (and v1.1.0). I recently pushed a fix to dev #183, but as it turns out, I still need to update bcftools to v1.19 in that process.
I will do that tomorrow. And directly make a new release. I will let you know, once this is done.

Answer 11 · 2024-03-14T10:33:33.000Z

To get a glimpse of the fix, you can try out the current dev branch. Else a new release is close: #186.

Answer 12 · 2024-03-14T22:15:31.000Z

Hi, Thank you. I'm just running it and will let you know how it went.

Answer 13 · 2024-03-18T08:08:08.000Z

Hi,
I ran the dev version but I can't see any VG results:

> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map

I have got some tower errors:

Mar-15 13:04:45.953 [Tower-thread] WARN  i.s.tower.plugin.TowerJsonGenerator - Tower request field `tasks.script` exceeds expected size
 | offending value: `
    seqwish \
        --threads 6 \
        --paf-alns=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_173.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_174.paf,ch
nVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_176.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_175.paf,chnVSjapVSkorVSauVSusa.
PanSN-1-19.fasta.gz.chunk_178.paf

Susa.PanSN-1-19.fasta.gz.chunk_90.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_94.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.g
z.chunk_87.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_96.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_99.paf,chnVSjap
VSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_23.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_80.paf \
        --seqs=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
        --gfa=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.seqwish.gfa \
        -k 19 -f 0.0 -B 10000000 -P

    cat <<-END_VERSIONS > versions.yml
    "NFCORE_PANGENOME:PANGENOME:PGGB:SEQWISH":
        seqwish: $(echo $(seqwish --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
    END_VERSIONS
    `, size: 11710 (max: 10240)
Mar-15 13:04:46.083 [Actor Thread 30] INFO  nextflow.container.SingularityCache - Pulling Singularity image https://depot.galaxyproject.org/singularity/smoothxg:0.7.2--h40c17d1_0 [cache /home/lorencm/.nextflow/NXF_SINGULARITY_CACHEDIR/depot.galaxyproject.org-singularity-smoothxg-0.7.2--h40c17d1_0.img]

I am attaching the
nextflow.log.txt

What did I miss?

Answer 14 · 2024-03-18T08:59:11.000Z

That's strange. Can you please share the content of your FASTA index, so I can get a better idea if you set --vcf_spec in the right way.

As far a I can see tower only gives out a warning ,right?

Answer 15 · 2024-03-18T23:59:06.000Z

Where can I find the Fasta index?

Yes, tower seems to have given only warnings. However, when I logged into tower it showed successful but when I clicked on the run it shows that some tasked failed.

Answer 16 · 2024-03-19T09:35:29.000Z

Would it be possible for you to share the tower run somehow?

As documented in https://nf-co.re/pangenome/1.1.1/docs/output#input-check, it should be in the folder samtools_faidx.

Answer 17 · 2024-03-19T13:17:41.000Z

It seems I don't have the folder

results> find . -name "samtools_faidx"

I will ask how to share the tower run with our admins.

Answer 18 · 2024-03-19T13:21:06.000Z

So you created FASTA.gz.fai before running the pipeline?

Answer 19 · 2024-03-19T13:33:42.000Z

Completly, forgottten:

> ls ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz*
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz      
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gzi
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai

Please find chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai.txt

nextflow run nf-core/pangenome \
      -r dev \
      --wfmash_chunks 200 \
      --input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
      --n_haplotypes 5 \
      --outdir results \
      --vcf-spec NbLab360 \
      -profile singularity \
      -resume

Answer 20 · 2024-03-19T13:35:55.000Z

I think you have to do --vcf_spec 'NbLab360'. There is an underscore, not a hyphen. Didn't you get a warning?

Answer 21 · 2024-03-19T13:44:55.000Z

I haven't seen any warnings about underscores.

Answer 22 · 2024-03-19T13:47:42.000Z

I mean something like

WARN: The following invalid input values have been detected:

* --vcf-spec: gi|345525392:5000-18402

Answer 23 · 2024-03-19T13:56:15.000Z

Yes, now I see it:

Mar-19 23:46:58.789 [main] WARN  nextflow.validation.SchemaValidator - The following invalid input values have been detected:

* --vcf-spec: NbLab360
* --vcfSpec: NbLab360

Answer 24 · 2024-03-19T14:43:16.000Z

Maybe you can extract the whole log file from tower, so no need to share more.

Answer 25 · 2024-03-20T02:30:27.000Z

I will try. I ran it again with --vcf-spec 'NbLab360' but I still got the following warning:

Mar-20 00:00:31.128 [main] WARN  nextflow.validation.SchemaValidator - The following invalid input values have been detected:

* --vcf-spec: NbLab360
* --vcfSpec: NbLab360

Answer 26 · 2024-03-20T09:03:51.000Z

I think you have to do --vcf_spec 'NbLab360'. There is an underscore, not a hyphen. Didn't you get a warning?

Please use a _ and not a -.

--vcf_spec 'NbLab360'

Answer 27 · 2024-03-24T02:27:50.000Z

Hi, Thank you. Now, I got a new error:

 The exit status of the task that caused the workflow execution to fail was: 1

Error executing process > 'NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)'

Caused by:
  Process `NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)` terminated with an error exit status (1)

Command executed:

  ref=$(echo "NbLab360" | cut -f 1 -d:)
  delim=$(echo "NbLab360" | cut -f 2 -d:)
  pop_length=$(echo "NbLab360" | cut -f 3 -d:)
  
  if [[ -z $pop_length ]]; then
      pop_length=0
  fi
  
  vcf="chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa".$(echo $ref | tr '/|' '_').vcf
  vg deconstruct -P $ref -H $delim -e -a -t "6" "chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa" > $vcf
  bcftools stats $vcf > $vcf.stats
  if [[ $pop_length -gt 0 ]]; then
      vcf_decomposed=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa.final.$(echo $ref | tr '/|' '_').decomposed.vcf
      vcf_decomposed_tmp=$vcf_decomposed.tmp.vcf
      bgzip -c -@ 6 $vcf > $vcf.gz
      vcfbub -l 0 -a $pop_length --input $vcf.gz | vcfwave -I 1000 -t 6 > $vcf_decomposed_tmp
      #TODO: to remove when vcfwave will be bug-free
      # The TYPE info sometimes is wrong/missing
      # There are variants without the ALT allele
      bcftools annotate -x INFO/TYPE $vcf_decomposed_tmp  | awk '$5 != "."' > $vcf_decomposed
      rm $vcf_decomposed_tmp $vcf.gz
      bcftools stats $vcf_decomposed > $vcf_decomposed.stats
  fi
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT":
      pggb: $(pggb --version 2>&1 | grep -o 'pggb .*' | cut -f2 -d ' ')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Warning [vg deconstruct]: -H is deprecated, and will be ignored
  .command.sh: line 13: NbLab360: unbound variable

Work dir:
  /mnt/hpccs01/scratch/waterhouse_team/pangenome/nf-dev02/work/09/6d70b3cf40e26972132b3d245452e0

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Answer 28 · 2024-03-27T14:53:47.000Z

Hi @mictadlo I think this is because you use an older version of the pipeline which was buggy with respect to the variant calling. I would recommend to use the latest one 1.1.2. Sorry for this.

Answer 29 · 2024-03-28T02:03:26.000Z

Hi @subwaystation, I got this error
Cannot find revision `1.1.2` -- Make sure that it exists in the remote repository `https://github.com/nf-core/pangenome

Answer 30 · 2024-03-28T08:47:20.000Z

I don't know why, but this also always happens to me when I run a new revision of a pipeline. Please try again. Nextflow is confusing things.

Answer 31 · 2024-04-10T23:54:33.000Z

Hi, Thank you for all your help. It worked successfully.

Answer 32 · 2024-04-11T08:42:12.000Z

Fantastic news, thanks for letting me know @mictadlo .