/pgge

the pangenome graph evaluator

Primary LanguageShellMIT LicenseMIT

pgge

the pangenome graph evaluator

This pangenome graph evaluation pipeline measures the reconstruction accuracy of a pangenome graph (in the variation graph model). Its goal is to give guidance in finding the best pangenome graph construction tool for a given input data and task.

It has five phases:

  1. SplitSamples: (sample preparation) -- SHORT DESCRIPTION. TODO.

  2. splitfa: (split sequences) -- SHORT DESCRIPTION. TODO.

  3. SubSampling: (pick a subset of random sequences) -- SHORT DESCRIPTION. TODO.

  4. GraphAligner: (alignment) -- SHORT DESCRIPTION. TODO.

  5. peanut: (alignment evaluation) -- SHORT DESCRIPTION. TODO.

  6. beehave.R: (plot evaluation results) -- SHORT DESCRIPTION. TODO.

general usage

Clone this repository:

git clone --recursive https://github.com/pangenome/pgge
cd pgge

Create a pangenome graph and its consensus graphs using pggb, storing the results in the pggb_yeast directory.

⚠️ This step assumes you have correctly installed pggb:

pggb -i data/yeast/cerevisiae.pan.fa.gz -t 16 -s 50000 -p 90 -n 5 -Y "#" -k 8 -B 10000000 -w 30000 -I 0.7 -o pggb_yeast -W

Evaluate the consensus graphs stored in the pggb_yeast directory:

./pgge -g "pggb_yeast/*consensus*.gfa" -f data/yeast/cerevisiae.pan.fa.gz  -t 16 -r scripts/beehave.R  -l 100000 -s 50000 -o pgge_yeast

Make sure that you include the opening and closing " in the command line, else the regex can't be resolved. For a single input GFA, this is not required.

Optionally, you can set -b to write the unmapped regions to BED.

If you want to enable random subsampling to reduce alignment time, you can select either -p/--subsample-percentage or -u/--subsample-number.

⚠️ pgge summarizes results by sample name. If you have

S288C.chrI
S288C.chrII
S288C.chrIII

in your given FASTA file, the results will only contain one line of metrics. In this case for S288C. This is useful if you have contig sequences in your FASTA and want to summarize by sample name. pgge always splits by . and takes the first entry in the resulting split as sample name.

⚠️ pgge was designed for processing the results of pggb. If you are evaluating your own data not originating from pggb it is recommended to set the -n/--input-graph-names parameter to ensure the final PNG is labeled correctly. This parameters requires a TSV with 2 rows:

  1. The name of the original input graph.
  2. The name to display in the PNG.

In the following an example for the yeast data set:

cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10000::y:0:1000000.gfa	10k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@1000::y:0:1000000.gfa	1k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@100::y:0:1000000.gfa	0.1k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10::y:0:1000000.gfa	0.01k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10000:y.gfa	10k:y
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@1000:y.gfa	1k:y
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@100:y.gfa	0.1k:y
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10:y.gfa	0.01k:y

output

The output is written to pgge_yeast/pgge-l100000-s50000.tsv in a tab-delimited format:

cat pgge_yeast/pgge-l100000-s50000.tsv | column -t
sample.name  cons.jump  aln.id    qsc                 uniq                multi                    nonaln
DBVPG6044    10000:y    0.994253  0.9882487336244542  0.9878719650655022  0.00037676855895196504   0.011751266375545851
DBVPG6044    1000:y     0.99429   0.9905872052401746  0.9902261572052402  0.00036104803493449783   0.009412794759825328
DBVPG6044    100:y      0.994346  0.9920169432314411  0.9917783406113537  0.00023860262008733625   0.007983056768558951
DBVPG6044    10:y       0.994804  0.9931444978165939  0.9930238427947599  0.00012065502183406113   0.0068555021834061135
DBVPG6765    10000:y    0.992895  0.984453537117904   0.9841169868995633  0.00033655021834061135   0.01554646288209607
DBVPG6765    1000:y     0.992816  0.9851402620087336  0.9847942358078603  0.00034602620087336247   0.014859737991266376
DBVPG6765    100:y      0.992857  0.9850624454148471  0.9848960262008734  0.00016641921397379914   0.014937554585152838
DBVPG6765    10:y       0.993555  0.9918473362445415  0.9916482969432314  0.00019903930131004368   0.008152663755458514
S288C        10000:y    0.993815  0.9840108085106383  0.9836786808510638  0.0003321276595744681    0.015989191489361704
S288C        1000:y     0.993819  0.9856704255319149  0.9854043829787235  0.0002660425531914894    0.014329574468085107
S288C        100:y      0.994008  0.9880367234042553  0.9878560425531915  0.0001806808510638298    0.011963276595744681
S288C        10:y       0.994503  0.9923237872340426  0.9922378723404255  0.00008591489361702127   0.007676212765957447
SK1          10000:y    0.994393  0.98889             0.9882832467532467  0.0006067532467532468    0.01111
SK1          1000:y     0.994355  0.9909501731601732  0.9903794805194805  0.0005706926406926407    0.00904982683982684
SK1          100:y      0.994531  0.9920734632034632  0.9916370995670996  0.00043636363636363637   0.007926536796536796
SK1          10:y       0.99508   0.9932187878787879  0.9930288311688311  0.00018995670995670995   0.006781212121212121
UWOPS034614  10000:y    0.993074  0.9854122807017544  0.9849935964912281  0.0004186842105263158    0.014587719298245615
UWOPS034614  1000:y     0.993131  0.9833553070175438  0.9829300438596491  0.00042526315789473683   0.01664469298245614
UWOPS034614  100:y      0.99331   0.9884982894736842  0.9883386842105263  0.00015960526315789473   0.01150171052631579
UWOPS034614  10:y       0.993955  0.9915775           0.991506403508772   0.00007109649122807018   0.0084225
Y12          10000:y    0.994867  0.9878221834061135  0.9873637554585153  0.00045842794759825325   0.012177816593886464
Y12          1000:y     0.994863  0.9892637554585153  0.9888601746724891  0.0004035807860262009    0.010736244541484715
Y12          100:y      0.994997  0.9919159388646288  0.9917003056768559  0.00021563318777292578   0.00808406113537118
Y12          10:y       0.995301  0.9941565065502184  0.9939880786026201  0.00016842794759825327   0.00584349344978166
YPS128       10000:y    0.995545  0.9904155895196507  0.9902230567685589  0.00019253275109170305   0.009584410480349345
YPS128       1000:y     0.99559   0.9909312663755458  0.9907755458515284  0.00015572052401746726   0.009068733624454149
YPS128       100:y      0.995676  0.993891615720524   0.9938559825327511  0.000035633187772925764  0.006108384279475983
YPS128       10:y       0.99591   0.995602576419214   0.995569519650655   0.000033056768558951964  0.004397423580786026

The first number is the aln.id derived from the alignment identity GAF field of GraphAligner. All other metrics can be found in the metrics section of peanut.

pgge also generates a visualization of the results pgge_yeast/pgge-l100000-s50000.tsv.png: pgge_yeast.sh

installation

required tools

  1. bash

  2. samtools

  3. splitfa

  4. GraphAligner

  5. peanut

  6. R with packages tidyverse, ggrepel, gridExtra installed.

docker

To simplify installation and versioning, we have an automated GitHub action that pushes the current docker build to the GitHub registry. To use it, first pull the actual image:

docker pull ghcr.io/pangenome/pgge:latest

Or if you want to pull a specific snapshot from https://github.com/orgs/pangenome/packages/container/package/pgge:

docker pull ghcr.io/pangenome/pgge:TAG

Going in the pgge directory

git clone --recursive https://github.com/pangenome/pgge.git
cd pgge

you can run the container using the example DRB1-3123 provided in this repo:

docker run -it -v ${PWD}/data/:/data pangenome/pgge "pgge -g '/data/HLA/DRB1-3123/*.consensus*.gfa' -f /data/HLA/DRB1-3123/DRB1-3123.fa -r /scripts/beehave.R -t 16 -o /data/HLA/DRB1-3123/pgge_docker -l 1000 -s 1000 -p 100"

⚠️ In contrast to running pgge from the command line, when running in a docker container, we have to use ' instead of " in order to ensure that the regex is parsed properly.

The -v argument of docker run always expects a full path: If you intended to pass a host directory, use absolute path. This is taken care of by using ${PWD}.

If you want to experiment around, you can build a docker image locally using the Dockerfile:

docker build -t ${USER}/pgge:latest .

Staying in the pgge directory, we can run pgge with the locally build image:

docker run -it -v ${PWD}/data/:/data ${USER}/pgge 'pgge -g "/data/HLA/DRB1-3123/*.consensus*.gfa' -f /data/HLA/DRB1-3123/DRB1-3123.fa -r /scripts/beehave.R -t 16 -o /data/HLA/DRB1-3123/pgge_docker -l 1000 -s 1000 -p 100"

TODOs

  • pgge should accept a list of GFA files as input (path/to/files/*.consensus*.gfa) and output the summarized results in one PNG
  • Integrate https://github.com/ekg/splitfa as an option to prepare the input FASTA.
  • Add the possibility to split the input by sample name. Later re-use that information in the final result. THIS IS THE NEW DEFAULT.
  • Add R script to visualize the result.
  • Explain aln.id.
  • Add option to directly start from GAF file.
  • Add output-folder option.
  • Add possibility to input several GAF files. Make sure the user can input a list of samples for the GAFs.
  • The user should be able to select options for GraphAligner.
  • Add a toolchain that compares the query alignments with the exact nodes they aligned to in the graph.
  • Add Dockerfile.
  • Add a CI building the Dockerfile and emitting evaluation metrics for all tools using HLA-Zoo data.
  • Add usage examples for minigraph, cactus, and SibeliaZ.
  • Integrate into nf-core/pangenome pipeline. HALFWAY THERE.

authors

Simon Heumos, Andrea Guarracino, Erik Garrison, Christian Fischer