GeneAtlas

DESCRIPTION

GeneAtlas is a simple gene enrichment in normal tissue expression profiles. We have a selection of publically available data sets for normal human tissue profiling, all data sets are located in the data/ directory, after gene normalization and quantification, please contact me if you want more details about the data preprocessing. Input are a selection of tissues to compare, output is a text file with the weigthed gene list [0,1).

USAGE

  geneatlas.pl [PARAMETERS]
  
  Parameters       Description
  
  -d --dataset     Select data set to use, "all" shows all data sets available.
  
  -l --list        List samples in data set.
  
  -s --select      Select samples, the value is a list of numbers or the name 
                   of the samples, a basic RegEx is applied in text mode. 
                   Multiple sources can be separated by commas.
  
  -e --exclude     Exclude samples, similar to "select" option, but it will
                   skip the samples in the computation.
  
  -m --method      Method for enrichment, multiple values are collapsed using:
                   a) maximal [default], b) average

  -c --score       Method to compute the score: a) norlog, b) wilson [default].
                   "norlog" is the normalized value of the log-enriched-ratio
                   multiplied by the normalized expression level.
                   "wilson" is the lower bound of Wilson score confidence 
                   interval for a Bernoulli parameter (95% conf.)
  
  -t --transcript  Report genes at the transcript level (default is collapsing
                   transcripts values to the gene level).
  
  -o --out         Write ouput to this file [default: stdout]
  
  -h --help        Show this screen.
  
  -v --verbose     Activate verbose mode.
  
  --version        Print version and exit.

EXAMPLES

  1. Show available data sets
     perl bin/geneatlas.pl -d all
  
  2. Show samples in one data set
     perl bin/geneatlas.pl -d GSE1133 -l
  
  3. Compute enrichment in the Appendix in data set GSE1133_microrray
     perl bin/geneatlas.pl -d GSE1133_microrray -s 7
     perl bin/geneatlas.pl -d GSE1133_microrray -s Appendix # this also works
  
  4. Compute enrichment in the Appendix in data set GSE1133_microrray excluding CD* samples 
     perl bin/geneatlas.pl -d GSE1133_microrray -s 7 -e 10,15,20,29,36,38,43,61,66
     perl bin/geneatlas.pl -d GSE1133_microrray -s Appendix -e CD # this also works
  
  5. Change the method to average values
     perl bin/geneatlas.pl -d GSE1133_microrray -s 7 -m average

DATA SOURCES

Microarray data is processed from raw data (CEL files) with R:Bioconductor affy and limma packages, all samples are normalized with "rma".

RNAseq data is precessed from raw reads (FASTQ files), after removing low quality reads, repeats and ribosomal sequences, the reads are mapped to the reference genome hg19/GRCh37 with and optimized version of Blat, then converted to BAM and transcript quantification with Cufflinks using the gene models of Ensembl r64.

GSE1133_microrray was obtained from: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1133 [Su AI, Wiltshire T, Batalov S, Lapp H et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 2004 Apr 20;101(16):6062-7. PMID: 15075390.]

GSE2361_microarray was obtained from: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2361 [Ge X, Yamamoto S, Tsutsumi S, Midorikawa Y et al. Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues. Genomics 2005 Aug;86(2):127-41. PMID: 15950434.]

GSE3526_microarray was obtained from: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3526 [Roth RB, Hevezi P, Lee J, Willhite D et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 2006 May;7(2):67-80. PMID: 16572319.]

bodymap1_rnaseq was obtained from: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12946 [Wang ET, Sandberg R, Luo S, Khrebtukova I et al. Alternative isoform regulation in human tissue transcriptomes. Nature 2008 Nov 27;456(7221):470-6.]

bodymap2_rnaseq was kindly provided by Gary Schroth (Illumina Co.) The reads are 75b single ends produced in HiSeq2000.

rnaseq_atlas_rev1 was obtained from: http://medicalgenomics.org/rna_seq_atlas [Castle JC, Armour CD, Löwer M, Haynor D, Biery M, et al. (2010) Digital Genome-Wide ncRNA Expression, Including SnoRNAs, across 11 Human Tissues Using PolyA-Neutral Amplification. PLoS ONE 5(7): e11779. doi:10.1371/journal.pone.0011779]

DATA TABLES

The data tables are simple tab-delimited text files, each row represents a transcript, the first column is the gene symbol, the second the transcript or probe id, the rest of the columns are the values (RPKM for RNAseq, UF for microarrays) in each tissue.

AUTHOR

Juan Caballero, Institute for Systems Biology @ 2012

caballero/GeneAtlas

GeneAtlas