/tdg09

Software and data used in Tamuri AU, dos Reis M, Hay AJ, Goldstein RA (2009) Identifying Changes in Selective Constraints: Host Shifts in Influenza. PLoS Comput Biol 5(11): e1000564 doi:10.1371/journal.pcbi.1000564

Primary LanguageJava

TdG09 program: identifying changes in selective constraints

This program is an implementation of the model described in:

Tamuri AU, dos Reis M, Hay AJ, Goldstein RA (2009) Identifying Changes in Selective Constraints: Host Shifts in Influenza. PLoS Comput Biol 5(11): e1000564. doi:10.1371/journal.pcbi.1000564

This phylogenetic model uses site-specific amino acid frequencies to distinguish patterns of substitution between two (or more) lineages or groups of taxa. We first estimate the site-specific amino acid frequencies at a given location in a protein alignment assuming that the pattern of substitution is homogeneous across all branches. We then estimate multiple sets of site-specific frequencies, allowing them to differ among different branches, producing a non-homogeneous model of evolutionary change. Using statistical tests, we then see whether the non-homogeneous model provides a significantly better fit to the data than the homogenous model.

Tutorial

  1. Install Java: The program requires a recent version of the Java Runtime (JRE), which, if not already installed, can be downloaded from Oracle. Linux packages are usually available in the distribution's repository (e.g. sudo apt-get install openjdk-7-jre for Debian, Ubuntu etc. distributions).

  2. Download the program: The latest version of the program is available for download from the repository. The download includes a compiled binary as well as a build.xml to compile from sources using ant. Unzip the download and check that the program works:

    $ ls -F
    README.md  build.xml  dist/  etc/  lib/  src/
    
    $ java -cp dist/tdg09.jar tdg09.Analyse
    Usage: java -cp tdg09.jar tdg09.Analyse [options]
     Options:
     * -alignment
          Alignment in PHYLIP format
     * -groups
          Group labels to partition tree e.g. Av Hu
       -threads
          Number of threads to use
          Default: 1
     * -tree
          Tree in NEWICK format
    
  3. Preparing your data: The program requires a protein sequence alignment in PHYLIP format and the corresponding tree in Newick format. The tree branch lengths should be optimised for an amino acid model, such as WAG, using a program such as RAxML or PAML. Most importantly, sequence names must be prefixed by a two-letter identifier that is used to indicate its lineage/grouping. For example, to identify changes in selective constraints between avian and human flu viral proteins, we can prefix every sequence name with 'Av' or 'Hu' to indicate its lineage:

    $ cat etc/H1.faa
    434 566
    Hu_HA_AAX56530_H1N1      MKAKLLVLLCAFTATYADTI...
    Hu_HA_AAY78939_H1N2      MKVKLLILLCTFTATYADTI...
    Av_HA_ABB19507_H1N6      MEAKLFVLFCTFTVLKADTI...
    Av_HA_ABB19518_H1N1      MEAKLFVLFCTFTALKADTI...
    ...
    

Both the sequence alignment and the tree must follow this convention.

```
$ cat etc/H1.tree
((Av_HA_ABB19607_H1N1:0.0852880,Av_HA_ABG88212_H1N1:0.1036700):
0.0248320,((Av_HA_ABB19618_H1N1:...
```

This is the only way by which the tdg09 program determines which of the different non-homogeneous models a particular tree branch should use. Example data sets of flu viral proteins (used in Tamuri et al. 2009) are included in the etc/ directory.

  1. Running the program: The command-line options for the program are:

    • -alignment : the sequence alignment file in PHYLIP format e.g. etc/H1.faa
    • -tree : the tree file in Newick format e.g. etc/H1.tree
    • -groups : the two-letter identifiers used to partition the sequences e.g. Av Hu
    • -threads : specify the numbers of CPU cores/threads to utilise e.g. 2

    The program prints messages to standard out, so this should be captured using > or tee. We are now ready to run the program:

    $ java -cp dist/tdg09.jar tdg09.Analyse -alignment etc/H1.faa \
    -tree etc/H1.tree -groups Av Hu -threads 2 > H1_out.txt
    

    or if you have 'tee' installed:

    $ java -cp dist/tdg09.jar tdg09.Analyse -alignment etc/H1.faa \
    -tree etc/H1.tree -groups Av Hu -threads 2 | tee H1_out.txt
    
  2. Inspect the results: In this example, program output is captured in H1_out.txt:

    $ cat H1_out.txt
    StartTime: 2013-03-19 13:44:10.129
    WorkingDirectory: /Users/Tester/Documents/tdg09
    Options: -alignment etc/H1.faa -tree etc/H1.tree -groups Av Hu -threads 2 
    
    TreeFile: /Users/Tester/Documents/tdg09/etc/H1.tree
    AlignmentFile: /Users/Tester/Documents/tdg09/etc/H1.faa
    
    Alignment:
      SequenceCount: 434
      SiteCount: 566
    
    Groups: [Av, Hu]
    
    # The internal nodes of the tree are not labelled. Labelling...
    # Node 432 from [Av, Hu] resolved
    # Node 432 from [Av, Hu] resolved
    # Assuming that root of tree is in group [Av]
    # Switching from group [Av] to [Hu] at branch 432..431
    
    LabelledTree: >
        (((((((((((Av_HA_ABB19607_H1N1:0.0852880,Av_HA_ABG88212_H1N1:0.1036700)
        Av:0.0248320,((Av_HA_ABB19618_H1N1:0.0910450,Av_HA_ABG88201_H1N1:0.0810480)
        ...     
    	Hu:0.0618410)Hu:0.0978830)Hu:0.0436020)Hu:0.0482480)Hu:0.0255390)Hu:0.0362960)
    	Hu:0.0696670)Hu:0.0620790)Hu:0.0268645);   
    

The output contains a "LabelledTree" that shows the inferred lineage for each ancestral node. This should be checked in a tree viewing program (such as Dendroscope) to make sure that the lineages are correct. If not, they can be modified and the analysis can be rerun with the new, custom-labelled, tree. The output continues:

```    
# 2013-03-19 22:44:10.33 - site 1 complete.
# 2013-03-19 22:44:11.366 - site 3 complete.
# 2013-03-19 22:44:11.366 - site 4 complete.    

...

LrtResults:
#   Site,  delta lnL,  dof, LRT,       FDR
- [  204,  21.850690,  3,   0.0000000, 0.0000003 ]
- [  169,  11.528011,  1,   0.0000016, 0.0001542 ]
- [  289,  10.733096,  2,   0.0000218, 0.0008550 ]
- [  252,  10.927903,  2,   0.0000180, 0.0008796 ]
- [    9,   8.225516,  1,   0.0000499, 0.0008895 ]
- [  300,  15.027774,  5,   0.0000144, 0.0009396 ]
- [   62,   8.261257,  1,   0.0000481, 0.0009423 ]
- [  303,   8.261836,  1,   0.0000480, 0.0010463 ]
- [  239,   9.647869,  2,   0.0000646, 0.0010545 ]
- [  315,   9.483944,  2,   0.0000761, 0.0011468 ]
- [  253,   8.262439,  1,   0.0000480, 0.0011764 ]

...

FullResults:
# Site, WAG+ssF params, WAG+ssF lnL, WAG+lssF params, WAG+lssF params, delta lnL, dof, LRT, FDR
- [    1,  NA,         NA, NA,          NA,         NA, NA,        NA,        NA ]
- [    2,  3,  -21.488149,  5,  -13.378828,   8.109321,  2, 0.0003007, 0.0025627 ]
- [    3,  2,  -48.561153,  3,  -47.763333,   0.797820,  1, 0.2065223, 0.3489515 ]
- [    4,  2,  -24.137478,  3,  -23.407924,   0.729554,  1, 0.2270721, 0.3708845 ]
- [    5,  2,  -12.611075,  3,  -12.456865,   0.154209,  1, 0.5786522, 0.6593943 ]
- [    6,  3,  -23.856049,  5,  -22.761440,   1.094609,  2, 0.3346706, 0.5084918 ]
- [    7,  2,  -44.839152,  3,  -43.761797,   1.077355,  1, 0.1421333, 0.2509740 ]
- [    8,  NA,         NA, NA,          NA,         NA, NA,        NA,        NA ]   

...

```
Of interest are the "LrtResults" and "FullResults" tables. 

The LrtResults lists polymorphic sites (on which the non-homogeneous model was estimated) and orders them by the false discovery rate (a correction on the likelihood ratio test P-value required by multiple hypothesis testing). At a given FDR cutoff (e.g. 0.05), these sites are those at which the non-homogeneous model provides a statistically significant improvement over the homogenous model, indicating that the patterns of substitution are different between the different groups/lineages. 

The FullResults table lists further results from all sites. This includes the log-likelihood for the WAG+ssF (site-specific frequencies or *homogeneous model*) and WAG+lssF (lineage and site-specific frequencies or *non-homogeneous model*). Conserved locations are not analysed, so their entries are 'NA'. Finally, after the FullResults table, the output file contains details of the per-location results, such as the amino acid frequencies estimated by the ssF and lssF models.
  1. Analysing the results: The output file is in YAML format, which means it can be read by any other programming language that has a YAML parsing library. Here we show an example of analysing the results using the programming language R. The code is available in the src/R/example.R file. Start the R console, install and load the yaml library, then load the output file using the yaml.load_file function:

    $ R
    R version 2.15.3 (2013-03-01) -- "Security Blanket"
    ...
    > install.packages("yaml")
    Installing package(s) into ‘/Users/Tester/Library/R/2.15/library’
    (as ‘lib’ is unspecified)
    ...
    > library(yaml)
    > out <- yaml.load_file(input='/Users/Tester/Documents/tdg09/H1_out.txt')
    > summary(out)
                       Length Class  Mode
    StartTime            1    -none- character
    WorkingDirectory     1    -none- character
    Options              1    -none- character
    TreeFile             1    -none- character
    AlignmentFile        1    -none- character
    Alignment            2    -none- list
    Groups               2    -none- character
    LabelledTree         1    -none- character
    LrtResults         196    -none- list
    FullResults        566    -none- list
    ConservedPositions   2    -none- list
    SiteResults        566    -none- list
    EndTime              1    -none- character
    

    To convert the R list objects into a type easier to work with, we convert the LrtResults and FullResults objects into data.frames:

    > lrt_results <- as.data.frame(matrix(unlist(out$LrtResults), ncol=5, byrow=T))
    > names(lrt_results) <- c("site", "deltaLnL", "dof", "lrt", "fdr")
    > head(lrt_results)
      site  deltaLnL dof      lrt       fdr
    1  204 21.850690   3 0.00e+00 0.0000003
    2  169 11.528011   1 1.60e-06 0.0001542
    3  289 10.733096   2 2.18e-05 0.0008550
    4  252 10.927903   2 1.80e-05 0.0008796
    5    9  8.225516   1 4.99e-05 0.0008895
    6  300 15.027774   5 1.44e-05 0.0009396
    > sum(lrt_results$fdr <= 0.05) # how many sites identified with FDR < 0.05?
    [1] 55
    > full_results <- as.data.frame(matrix(unlist(out$FullResults), ncol=9, byrow=T))
    > names(full_results) <- c("site", "ssfParams", "ssfLnL", "lssfParams", "lssfLnL", "deltaLnL", "dof", "lrt", "fdr")
    > head(full_results)
      site ssfParams     ssfLnL lssfParams    lssfLnL deltaLnL dof       lrt       fdr
    1    1        NA         NA         NA         NA       NA  NA        NA        NA
    2    2         3 -21.488149          5 -13.378828 8.109321   2 0.0003007 0.0025627
    3    3         2 -48.561153          3 -47.763333  0.79782   1 0.2065223 0.3489515
    4    4         2 -24.137478          3 -23.407924 0.729554   1 0.2270721 0.3708845
    5    5         2 -12.611075          3 -12.456865 0.154209   1 0.5786522 0.6593943
    6    6         3 -23.856049          5  -22.76144 1.094609   2 0.3346706 0.5084918
    

    To produce a plot of FDR values by site:

    > fdr <- as.numeric(levels(full_results$fdr)[full_results$fdr]) # FDR column should be numeric
    Warning message:
    NAs introduced by coercion
    > fdr[is.na(fdr)] <- 1.0 # conserved locations implicitly have no evidence of non-homogeneity
    > sites <- out$Alignment$SiteCount
    > plot_ranges <- split(seq(1, sites), cut(seq(1, sites), 5)) # split sites in plot into 5 rows
    > par(mfrow=c(5,1), mar=c(2.0,0.5,0.5,0.5))
    for (p in 1:5) {
    	plot(1 - fdr,  
    	xlim=c(plot_ranges[[p]][1], tail(plot_ranges[[p]], n=1)), 
    	ty='h', lwd=1, main='', xlab='', ylab='', yaxt='n', col="#1B9E77")
    	lines(which(fdr <= 0.20), 1 - fdr[fdr <= 0.20], col="#D95F02", ty='h')
    	abline(h=0.95, lty='dashed')
    	points(which(fdr <= 0.05), 1 - fdr[fdr <= 0.05], pch=20, col="#DE2D26")
    }
    

    This produces the plot show below. Bars drawn in orange indicate locations with FDR < 0.20, and those locations with FDR < 0.05 have a red dot at their value (we drew the plot with "1 - fdr" values so that smaller FDRs are taller). We can see from this plot a cluster of identified sites at locations 200-210.

    image

    The R code in src/R/example.R also provides an example of using the phangorn and ape packages to simulate data under the homogeneous model for a particular site. We can then analyse the synthetic data with the TdG09 software, the parametric bootstrap providing significance of the non-homogeneous model on the original data. For example, here is the Monte Carlo distribution of Δ for 1000 parametric bootstrap replicates for flu protein HA (H1) site 204, showing that the homogenous model can be rejected in favour of the non-homogeneous model with P-value < 0.001.

    image