Michael Gruenstaeudl, Jul-18-2021
This workshop was held as part of the conference Botany 2021 Virtual!.
- Abstract
- Introduction
- Survey of IR annotations of archived plastid genomes
- Installation of airpg
- Application of airpg
- Visualization of sequencing depth and evenness of complete plastid genomes
- Installation of PACVr
- Mapping of sequence reads
- Application of PACVr
This workshop is intended to illustrate the application of computational methods that are helpful in assessing the quality of complete plastid genome sequences. Specifically, the workshop will illustrate the application of two software tools that (i) evaluate the annotations of inverted repeat annotations, and (ii) evaluate and visualize sequencing depth and evenness of the genome records, respectively. The workshop is intended for researchers with prior experience in the assembly and annotation of plastid genomes and assumes that participants have already assembled and annotated at least one plastid genome themselves. Participants will be guided through the application of both tools in a step-by-step process. All instructions presented in this workshop have been designed and customized for the UNIX command line (e.g., bash) and should be executed on a UNIX-compatible operating system (OS-X or Linux).
The inverted repeats (IRs) are characteristic features of the great majority of land plant plastid genomes. High-quality plastid genome records should contain sequence annotations for these features. The software airpg has been designed to automatically evaluate the presence of complete and correct IR sequence annotations in complete plastid genome records archived on GenBank.
Software needed: Python 3, airpg
$ pip install airpg
- Objective: What proportion of all complete plastid genomes of all moss lineages (i.e., liverworts, hornworts, and mosses) submitted to NCBI Nucleotide since the beginning of 2000 does not have complete IR annotations?
- Time needed: ca. 8 min.
Identify the plastid genomes:
$ airpg_identify.py -q "complete genome[TITLE] AND \
(chloroplast[TITLE] OR plastid[TITLE]) AND \
2000/01/01:2021/05/31[PDAT] NOT partial[TITLE] \
AND (Marchantiophyta[ORGN] OR Bryophyta[ORGN] \
OR Anthocerotophyta[ORGN])" \
-o airpg_SimpleExample_output1.tsv
Analyze their IR annotations:
$ airpg_analyze.py -i airpg_SimpleExample_output1.tsv \
-m john.smith@example.com -o airpg_SimpleExample_output2.tsv
Visualize the accumulation of plastid genomes of all moss lineages with and without complete IR annotations over time.
# Get number of genome records
$ NL=$(wc -l airpg_SimpleExample_output1.tsv | awk '{print $1}')
$ echo "$NL-1" | bc
# 76
# Get submission dates of oldest and newest genome record
$ awk -F'\t' '{print $6}' airpg_SimpleExample_output1.tsv | \
grep "^2" | sort -n | awk 'NR==1; END{print}'
# 2003-02-04
# 2021-04-24
# Adjust script airpg_SimpleExample_visualization.R manually and then run
$ Rscript /path_to_git_folder/extras/airpg_SimpleExample_visualization.R
More complex tutorials regarding the application of airpg can be found here as well as - in a platform-independent way - here.
The quality of complete plastid genome records often correlates with the records' sequence coverage. Specifically, high-quality plastid genome records often exhibit considerable sequencing depth and high sequencing evenness. The software PACVr has been designed to automatically evaluate and visualize sequencing depth and evenness of the genome records so that users can assess both coverage indicators for given plastid genome records.
Software: R, PACVr
$ R
install.packages("PACVr")
To measure sequencing depth and evenness of a plastid genome, the sequence reads that were originally used to assemble that genome must be mapped against it. The process of read mapping is explained here.
library(PACVr)
# Specify input files
gbkFile <- system.file("extdata", "NC_045072/NC_045072.gb", package="PACVr")
bamFile <- system.file("extdata", "NC_045072/NC_045072_PlastomeReadsOnly.sorted.bam",package="PACVr")
# Specify output file
outFile <- "NC_045072_AssemblyCoverage_viz.pdf"
# Run PACVr
PACVr.complete(gbk.file=gbkFile, bam.file=bamFile, windowSize=250,
logScale=FALSE, threshold=0.5, syntenyLineType=3,relative=TRUE,
textSize=0.5, output=outFile)
A more complex tutorial regarding the application of PACVr can be found here.