nf-core/pangenome

Adding input files for Panache

mictadlo opened this issue · 3 comments

Description of feature

Description of feature

Hi,
I found Panache a web-based interface designed for the visualization of linearized pangenomes. It can be used to show presence/absence information of pangenomic blocks of sequence or genes in a browser-like display. This documation shows how to create the input files for Panache.

Thank you for your considaration.

Michal

Hi @mictadlo,

I am aware of Panache, but it does not seem straightforward to get the files right. See SouthGreenPlatform/panache#32. It depends on the input data.
If @SingingMeerkat or @brettChapman can share all the steps necessary from a pggb graph to the actual visualization, I would have a starting point, though.

@mictadlo There is also https://github.com/chfi/waragraph. You can directly plugin the 2D TSV layout from nf-core/pangenome and interactively explore the graph, including a 1D viz!

Hi @mictadlo and @subwaystation ,

Thanks for your interest in Panache! I agree that for now the bridge between pggb and Panache is difficult to cross, especially as Panache has been built as to be usable for pangenome graphs and pan gene atlas alike, and assumes nothing about how the input blocks are obtained.

Unfortunately I cannot dedicate as much time as I would like on Panache anymore, but I would be happy to help make it more accessible. I may have more time in 2 weeks, in the meantime I opened a dedicated issue at SouthGreenPlatform/panache#38 , to keep it in my mind.

Hi @mictadlo

I'd be happy to share my steps for creating a PAV matrix here, generated from the PGGB graphs:

reference_prefix=(some reference ID name)
odgi paths -i pangenome_chr1.og -f | grep ${reference_prefix} -A 1 > reference.fa
samtools faidx reference.fa
cut -f 1,2 reference.fa.fai > genome.txt
bedtools makewindows -g genome.txt -w 1000 > pangenome_chr1.w1000bp.bed
odgi pav -i pangenome_chr1.og -b pangenome_chr1.w1000bp.bed -M -B 0.5 > pangenome_chr1.pavs.txt

You then correct any header names, remove the reference column name from the PAV file (odgi pav produces pav for every path in the graph) add additional columns as per the Panache Wiki (https://github.com/SouthGreenPlatform/panache/wiki/Files-&-formats), and merge all the PAV matrices together across all chromosomes. I usually use pandas and merge all the dataframes as some matrices have columns in different positions, generating a BED file called pav.bed. Then you want to merge with gene coordinates to only show overlapping PAV with genes. This helps reduce the size as a large PAV matrix can hit Panache performance.

bedtools intersect -wa -a pav.bed -b genome.gff | sort | uniq | sort -k1,1 -k2,2n > overlaps.bed

The resulting BED file an the GFF file can then be converted to JSON format using the Panache conversion script.

You'll also want to generate a newick file of all genomes except the reference genome used, which will be added for sorting by phylogeny. I use mashtree for this.