nf-core/pangenome

Modules required for the nf-core compliant DSL2 implementation of the pipeline

subwaystation opened this issue ยท 12 comments

FINAL TASK:

OPTIONAL:

multiqc is already present at version multiqc:1.11--pyhdfd78af_0

seqwish/induce is already present at version seqwish:0.7.1--h2e03b76_0

samtools/faidx is already present at version samtools:1.13--h8c37831_0

Cool!
I updated the list on the top to better reflect which subcommands we need from each tool. More than expected xD

seqwish needs to be updated to v0.7.2. Also the folder structure can go from seqwish/induce to just seqwish. I see no reason to do the first one. And it needs appropriate test data. https://github.com/nf-core/test-datasets/tree/modules/data/genomics/homo_sapiens/genome is not sufficient. I will ask for an additional pangenome folder, so we can put all test data relevant for this pipeline there.

Also the folder structure can go from seqwish/induce to just seqwish. I see no reason to do the first one.

That was the pattern to use at the time, create a tool directory (e.g. induce) even if the software has only one function. It is possible this is no longer the recommendation.

https://github.com/nf-core/test-datasets/tree/modules/data/genomics/homo_sapiens/genome is not sufficient. I will ask for an additional pangenome folder, so we can put all test data relevant for this pipeline there

In nf-core/modules the test data are only useful for smoke testing the modules (i.e. making sure they run with the correct inputs and outputs and don't explode). There are GFA files at https://github.com/nf-core/test-datasets/tree/modules/data/genomics/sarscov2/illumina/gfa to use. What else might we need?

Test data for the workflow itself are in this branch https://github.com/nf-core/test-datasets/tree/pangenome

There are still quite a few single tool directories present in https://github.com/nf-core/modules/tree/master/modules, e.g.

bamtools/split
bamutil/trimbam
bandage/image
checkm/lineagewf
cmseq/polymut
cnvkit/batch

I'll ask on slack what the current recommendation is

vg/deconstruct is no longer a valid vg command, at least as of the most recent Bioconda version

$ docker run -it quay.io/biocontainers/vg:1.36.0--h9ee0642_0 /bin/bash                                                                            
root@3a42cf5ce1c3:/# vg help                                                                                                                        
vg: variation graph tool, version v1.36.0 "Cibottola"                                                                                                 
                                                                                                                                                      
usage: vg <command> [options]                                                                                                                         
                                                                                                                                                      
main mapping and calling pipeline:                                                                                                                    
  -- autoindex     mapping tool-oriented index construction from interchange formats                                                                  
  -- construct     graph construction                                                                                                                 
  -- rna           construct splicing graphs and pantranscriptomes                                                                                    
  -- index         index graphs or alignments for random access or mapping                                                                            
  -- map           MEM-based read alignment                                                                                                           
  -- giraffe       fast haplotype-aware short read alignment                                                                                          
  -- mpmap         splice-aware multipath alignment of short reads                                                                                    
  -- augment       augment a graph from an alignment                                                                                                  
  -- pack          convert alignments to a compact coverage index                                                                                     
  -- call          call or genotype VCF variants                                                                                                      
  -- help          show all subcommands                                                                                                               
                                                                                                                                                      
For more commands, type `vg help`.                                                                                                                    
For technical support, please visit: https://www.biostars.org/t/vg/                                                                                   

vg/deconstruct is no longer a valid vg command, at least as of the most recent Bioconda version

$ docker run -it quay.io/biocontainers/vg:1.36.0--h9ee0642_0 /bin/bash                                                                            
root@3a42cf5ce1c3:/# vg help                                                                                                                        
vg: variation graph tool, version v1.36.0 "Cibottola"                                                                                                 
                                                                                                                                                      
usage: vg <command> [options]                                                                                                                         
                                                                                                                                                      
main mapping and calling pipeline:                                                                                                                    
  -- autoindex     mapping tool-oriented index construction from interchange formats                                                                  
  -- construct     graph construction                                                                                                                 
  -- rna           construct splicing graphs and pantranscriptomes                                                                                    
  -- index         index graphs or alignments for random access or mapping                                                                            
  -- map           MEM-based read alignment                                                                                                           
  -- giraffe       fast haplotype-aware short read alignment                                                                                          
  -- mpmap         splice-aware multipath alignment of short reads                                                                                    
  -- augment       augment a graph from an alignment                                                                                                  
  -- pack          convert alignments to a compact coverage index                                                                                     
  -- call          call or genotype VCF variants                                                                                                      
  -- help          show all subcommands                                                                                                               
                                                                                                                                                      
For more commands, type `vg help`.                                                                                                                    
For technical support, please visit: https://www.biostars.org/t/vg/                                                                                   

It is! vg is just hiding lots of commands. Just type vg help.

vg: variation graph tool, version v1.36.0 "Cibottola"

usage: vg <command> [options]

main mapping and calling pipeline:
  -- autoindex     mapping tool-oriented index construction from interchange formats
  -- construct     graph construction
  -- rna           construct splicing graphs and pantranscriptomes
  -- index         index graphs or alignments for random access or mapping
  -- map           MEM-based read alignment
  -- giraffe       fast haplotype-aware short read alignment
  -- mpmap         splice-aware multipath alignment of short reads
  -- augment       augment a graph from an alignment
  -- pack          convert alignments to a compact coverage index
  -- call          call or genotype VCF variants
  -- help          show all subcommands

useful graph tools:
  -- deconstruct   create a VCF from variation in the graph
  -- gbwt          build and manipulate GBWTs
  -- ids           manipulate node ids
  -- minimizer     build a minimizer index or a syncmer index
  -- mod           filter, transform, and edit the graph
  -- prune         prune the graph for GCSA2 indexing
  -- sim           simulate reads from a graph
  -- snarls        compute snarls and their traversals
  -- stats         metrics describing graph and alignment properties
  -- view          format conversions for graphs and alignments

specialized graph tools:
  -- align         local alignment
  -- annotate      annotate alignments with graphs and graphs with alignments
  -- chunk         split graph or alignment into chunks
  -- circularize   circularize a path within a graph
  -- clip          remove BED regions (other other nodes from their snarls) from a graph
  -- combine       merge multiple graph files together
  -- convert       convert graphs between handle-graph compliant formats as well as GFA
  -- depth         estimate sequencing depth
  -- dotplot       generate the dotplot matrix from the embedded paths in an xg index
  -- filter        filter reads
  -- gamcompare    compare alignment positions
  -- gampcompare   compare multipath alignment positions
  -- gamsort       Sort a GAM file or index a sorted GAM file.
  -- genotype      Genotype (or type) graphs, GAMS, and VCFs.
  -- inject        lift over alignments for the graph
  -- paths         traverse paths in the graph
  -- simplify      graph simplification
  -- surject       map alignments onto specific paths
  -- trace         trace haplotypes
  -- vectorize     transform alignments to simple ML-compatible vectors
  -- viz           render visualizations of indexed graphs and read sets

developer commands:
  -- benchmark     run and report on performance benchmarks
  -- cluster       find and cluster mapping seeds
  -- find          use an index to find nodes, edges, kmers, paths, or positions
  -- mcmc          Finds haplotypes based on reads using MCMC methods
  -- test          run unit tests
  -- validate      validate the semantics of a graph or gam
  -- version       version information

For technical support, please visit: https://www.biostars.org/t/vg/

Also the folder structure can go from seqwish/induce to just seqwish. I see no reason to do the first one.

That was the pattern to use at the time, create a tool directory (e.g. induce) even if the software has only one function. It is possible this is no longer the recommendation.

https://github.com/nf-core/test-datasets/tree/modules/data/genomics/homo_sapiens/genome is not sufficient. I will ask for an additional pangenome folder, so we can put all test data relevant for this pipeline there

In nf-core/modules the test data are only useful for smoke testing the modules (i.e. making sure they run with the correct inputs and outputs and don't explode). There are GFA files at https://github.com/nf-core/test-datasets/tree/modules/data/genomics/sarscov2/illumina/gfa to use. What else might we need?

Test data for the workflow itself are in this branch https://github.com/nf-core/test-datasets/tree/pangenome

I think the discussion about test data deserves its own issue. Let's continue at #74.

vg is just hiding lots of commands.

Ah got it, thanks!

For now, we don't need the optional module. Happy easter!