phylosub: A Python repository from shruti-gupta

This Python code is the accompanying software for the paper:
Wei Jiao, Shankar Vembu, Amit G. Deshwar, Lincoln Stein and Quaid Morris,
Inferring clonal evolution of tumors from single nucleotide somatic mutations,
BMC Bioinformatics 15:35, 2014.

It is built on top of the the code associated with the paper:
Ryan Prescott Adams, Zoubin Ghahramani and Michael I. Jordan,
Tree-Structured Stick Breaking for Hierarchical Data,  
Advances in Neural Information Processing Systems (NIPS) 23, 2010
http://hips.seas.harvard.edu/files/tssb.tgz

######################################################################

DEPENDENCIES:
SciPy - http://www.scipy.org/
ETE2 - http://ete.cgenomics.org/
CVXOPT - http://cvxopt.org/
PyGraphviz - http://networkx.lanl.gov/pygraphviz/

#######################################################################

PREPARING THE INPUT FILE:
Input file format:
The input to evolve.py should be a tab-delimited text file with 7 columns. The required column headers are:
-- gene: identifier for each somatic variant (each row should have a unique identifier)
-- a: number of reference allele read counts on the variant locus (if the read counts are from multiple tumour samples, the numbers from each sample should be separated by a comma ',', e.g. for three samples:  500, 342, 423)
-- d: total number of reads at the locus (if the read counts are from multiple tumor samples, they should be separated by a comma ',')
-- mu_r: fraction of expected reference allele sampling from reference population (e.g. if it is an A->T somatic mutation at the locus, the genotype of the reference population should be AA, so the mu_r should be 1-sequencing error rate)
-- mu_v: fraction of expected reference allele sampling from variant population (e.g. if it is an A->T somatic mutation at the locus, copy number is 2 and the expected genotype is AT for the variant population, then the expected fraction of expected reference should be 0.5)
-- delta_r and delta_v: pseudo-count for the Dirichlet prior on genotype probabilities (recommended value is 1, if no other prior information is available)

#######################################################################

RUNNING THE SOFTWARE:
1. To run the code on the sample data set "data.txt," execute the following command:

python evolve.py 'data.txt' 'trees' 'top_k_trees' 'clonal_frequencies' 'llh_trace' 100 100 1

# 'data.txt': file name of the input data
# 'trees': output folder name where the MCMC trees/samples are saved 
# 'top_k_trees': output file name to save top-k trees in text format. 
# 'clonalFrequencies': output file to save clonal frequencies
# 'lllhtrace': output file name to save log likelihood trace
The last three arguments are the number of MCMC samples, number of Metropolis-Hastings iterations, and a random seed number for initializing the MCMC sampler.

2. To generate the partial order plot for the trees (MCMC samples) generated by the above code, execute the following command:

 python porder.py 'trees' 'data.txt' 'partial_order.pdf'
  
# 'trees': the folder name where the MCMC trees/samples are located (saved from the above command)
# 'data.txt': file name of the input data
# 'partial_order.pdf': partial order plot output in pdf format

#######################################################################
shruti-gupta/phylosub