/palmdb

Database of virus RdRP barcode sequences

Creative Commons Zero v1.0 UniversalCC0-1.0

PALMdb

Introduction

PALMdb is a database of viral polymerase palmprint (barcode) sequences classified (1) by taxonomy and (2) by clustering sequences into species-like operational taxonomic units (OTUs) at 90% identity. PALMdb was created using the palmscan algorithm to mine public sequence databases.

2021-03-14 update with 250,000+ new sequences, 130,000+ new species

Added new sequences generated by the Serratus project, see paper on biorxiv.

Palmprint sequence

PALMdb

The palmprint is a ~100aa segment of the viral polymerase gene delineated by the conserved catalytic motifs "A" and "C" in the palm sub-domain.

Files

Releases are posted in sub-directories named YYYY-MM-DD giving the date the release was posted. Files are in FASTA (for sequences) or tab-separated values (TSV, for annotations). Files and formats are subject to change between releases.

YYYY-MM-DD/
	+--- acc_taxid.tsv         # 1. source accession 2. NCBI TaxID
	+--- acc_u.tsv             # 1. source accession 2. unique sequence identifier u<nnn>
	+--- otu_centroids.fa      # OTU centroid sequences
	+--- sources.fa            # palmprints from all source databases
	+--- species_ncbi_ictv.tsv # 1. species name 2. bothdbs/onedb 3. NCBI TaxID 4. ICTV Version.SortID
	+--- taxon.tsv             # 1. NCBI TaxID 2. name 3. clade names superkingdom...species
	+--- u_otu.tsv             # 1. u<nnn> 2. u<nnn> of OTU centroid
	+--- uniques.fa            # unique sequences from sources.fa, relabeled as u<nnn>
	+--- u_tax.tsv             # tab-separated file with approximate taxonomy assignments for uniques.    

References

A. Babaian and R. C. Edgar (2021), Ribovirus classification by a polymerase barcode sequence, biorxiv https://doi.org/10.1101/2021.03.02.433648

R. C. Edgar et al. (2021), Petabase-scale sequence alignment catalyses viral discovery, biorxiv https://www.biorxiv.org/content/10.1101/2020.08.07.241729v2