/AHRD_on_gene_clusters

Automated Assignment of Human Readable Descriptions on Gene Clusters generated by OrthoMCL

Primary LanguageR

AHRD_on_gene_clusters

Automated Assignment of Human Readable Descriptions on Gene Clusters generated by OrthoMCL

Introduction

“AHRD_on_gene_clusters” is small R project that annotates Gene Clusters with Human Readable Descriptions. Gene Clusters are sets of amino acid sequence of significant similarity. Typically they are found, or generated, by the following method:

1. A database is decided upon: Use the proteome of the organism you are investigating, typically all freshly sequenced Amino Acid Sequences, and join it with reference proteomes.
2. Run a Blast “all vs all” search, finding for each protein those it is significantly similar to.
3. Based on the resulting BLAST scores cluster above Proteins using a Markov Clustering approach as implemented in OrthoMCL or just the plain MCL

The resulting clusters can be interpreted as gene families and enable several conclusions and interpretations: Like gene expansion or loss in a certain organism, phylogenetic reconstruction in order to reconstruct the evolutionary history of the genes of your interest etc.

The problem is, that no cluster has a short, concise, and trustworthy human readable description that gives you a quick overview of what kind of gene family you are dealing with here. And typically you have many such clusters. – In the Tomato genome for istance we had 17,490!

This version of AHRD (see https://github.com/groupschoof/AHRD for the original used to describe single query proteins) provides a simple method to annotate such gene clusters.

Algorithm

“AHRD_on_gene_clusters” first reads in the latest InterPro database and InterProScan annotations for the above proteins (see Introduction). Then it annotates each gene cluster with a Human Readable Description (HRD) as follows:

For each gene in the current cluster:

1. Find the most frequently annotated InterPro Family
2. Assign it as the cluster’s HRD
3. Compute the HRD’s score as the annotation frequency of the InterPro Family in the current cluster. That is the fraction of the cluster’s Proteins annotated with this InterPro Family.
4. If no InterPro Family reaches the custom threshhold frequency of 0.5, use any other InterPro annotation, for instance a Domain.
5. Generate a HRD as a named list. Two slots are present for each Gene Family’s HRD: most.frequent.IPRs and frequency. The first one holds each most frequent InterPro entry as a list, each of which has the slots SHORT.NAME, and ID, among others.

Installation

You need R version >= 3

require(devtools)
install_github("groupschoof/AHRD_on_gene_clusters@v0.2")

Or, if you do not want to use devtools:

git clone https://github.com/groupschoof/AHRD_on_gene_clusters.git
cd AHRD_on_gene_clusters
git checkout tags/v0.2
R CMD INSTALL .

Usage

In an interactive R shell type:

require( AHRD.on.gene.clusters )

And get a descriptive usage example with:

help( 'AHRD.on.gene.clusters-package' )

References

  • Van Dongen, Stijn. “Graph Clustering Via a Discrete Uncoupling Process.” SIAM Journal on Matrix Analysis and Applications 30, no. 1 (January 2008): 121–141. doi:10.1137/040608635.
  • Li, Li, Christian J Stoeckert Jr, and David S Roos. “OrthoMCL: identification of ortholog groups for eukaryotic genomes.” Genome research 13, no. 9 (September 2003): 2178–2189. doi:10.1101/gr.1224503.
  • McGinnis, Scott, and Thomas L. Madden. “BLAST: At the Core of a Powerful and Diverse Set of Sequence Analysis Tools.” Nucleic Acids Research 32, no. Web Server issue (July 1, 2004): W20–W25. doi:10.1093/nar/gkh435.
  • Altschul, S F, T L Madden, A A Schaffer, J Zhang, Z Zhang, W Miller, and D J Lipman. “Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs.” Nucleic Acids Research 25, no. 17 (September 1, 1997): 3389–3402.
  • Consortium, The Tomato Genome. “The Tomato Genome Sequence Provides Insights into Fleshy Fruit Evolution.” Nature 485, no. 7400 (May 31, 2012): 635–641. doi:10.1038/nature11119.
  • Apweiler, R., T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, et al. “InterPro—an Integrated Documentation Resource for Protein Families, Domains and Functional Sites.” Bioinformatics 16, no. 12 (December 1, 2000): 1145–1150. doi:10.1093/bioinformatics/16.12.1145.