This Python code is the accompanying software for the paper: Wei Jiao, Shankar Vembu, Amit G. Deshwar, Lincoln Stein and Quaid Morris, Inferring clonal evolution of tumors from single nucleotide somatic mutations, BMC Bioinformatics 15:35, 2014. It is built on top of the the code associated with the paper: Ryan Prescott Adams, Zoubin Ghahramani and Michael I. Jordan, Tree-Structured Stick Breaking for Hierarchical Data, Advances in Neural Information Processing Systems (NIPS) 23, 2010 http://hips.seas.harvard.edu/files/tssb.tgz ###################################################################### DEPENDENCIES: SciPy - http://www.scipy.org/ ETE2 - http://ete.cgenomics.org/ CVXOPT - http://cvxopt.org/ PyGraphviz - http://networkx.lanl.gov/pygraphviz/ ####################################################################### PREPARING THE INPUT FILE: Input file format: The input to evolve.py should be a tab-delimited text file with 7 columns. The required column headers are: -- gene: identifier for each somatic variant (each row should have a unique identifier) -- a: number of reference allele read counts on the variant locus (if the read counts are from multiple tumour samples, the numbers from each sample should be separated by a comma ',', e.g. for three samples: 500, 342, 423) -- d: total number of reads at the locus (if the read counts are from multiple tumor samples, they should be separated by a comma ',') -- mu_r: fraction of expected reference allele sampling from reference population (e.g. if it is an A->T somatic mutation at the locus, the genotype of the reference population should be AA, so the mu_r should be 1-sequencing error rate) -- mu_v: fraction of expected reference allele sampling from variant population (e.g. if it is an A->T somatic mutation at the locus, copy number is 2 and the expected genotype is AT for the variant population, then the expected fraction of expected reference should be 0.5) -- delta_r and delta_v: pseudo-count for the Dirichlet prior on genotype probabilities (recommended value is 1, if no other prior information is available) ####################################################################### RUNNING THE SOFTWARE: 1. To run the code on the sample data set "data.txt," execute the following command: python evolve.py 'data.txt' 'trees' 'top_k_trees' 'clonal_frequencies' 'llh_trace' 100 100 1 # 'data.txt': file name of the input data # 'trees': output folder name where the MCMC trees/samples are saved # 'top_k_trees': output file name to save top-k trees in text format. # 'clonalFrequencies': output file to save clonal frequencies # 'lllhtrace': output file name to save log likelihood trace The last three arguments are the number of MCMC samples, number of Metropolis-Hastings iterations, and a random seed number for initializing the MCMC sampler. 2. To generate the partial order plot for the trees (MCMC samples) generated by the above code, execute the following command: python porder.py 'trees' 'data.txt' 'partial_order.pdf' # 'trees': the folder name where the MCMC trees/samples are located (saved from the above command) # 'data.txt': file name of the input data # 'partial_order.pdf': partial order plot output in pdf format #######################################################################