/TB-diversity-across-organs

Code and intermediate files for "Genomic diversity in autopsy samples reveals within-host dissemination of HIV-associated M. tuberculosis "

Primary LanguageMATLAB

TB-diversity-across-organs

Code and intermediate files for reproducing the figures from "Genomic diversity in autopsy samples reveals within-host dissemination of HIV-associated Mycobacterium tuberculosis", by Tami D Lieberman, Douglas Wilson, Reshma Misra, Lealia L Xiong, Prashini Moodley, Ted Cohen^, and Roy Kishony^ (^co-senior authors)

If you are applying these scripts to your own data, it is strongly recommended that you investigate your raw data carefully and adjust parameters to suit your genome, coverage, etc. If you are having trouble understanding these scripts or how to use them, would like additional functionality/flexibility, or have other questions, feel free to contact Tami Lieberman (Email address is easy to find at other locations.).

If you find any of these scripts helpful, please cite:
Lieberman TD, Wilson D, Misa R, Xiong LL, Moodley P, Cohen T*, Kishony R*. (2016). "Genomic diversity in autopsy samples reveals within-host dissemination of HIV-associated Mycobacterium tuberculosis." Nature Medicine. DOI: ​10.1038/nm.4205

Option 1 for reproducing the figures (recommended): Start with pre-processed data:

  1. Download subject_folders, containing a directory for each subject, each containing candidate_mutation_table.mat
  2. Download the scripts and tools directories
  3. Run identify_de_novo_muts.m to call de novo mutations in each subject (and gather other information for each subject), producing de_novo_muts.mat for each subject
  4. Run analyze_de_novo_muts.m to analyze these mutations and generate the figures

Option 2 for reproducing the figures: Start with called mutations:

Run step 4 only from above, ensuring de_novo_muts.mat are downloaded for each subject in subject_folders.

Option 3: Generate the processed data

  1. Download raw fastq files for each sample from the SRA (BioProject PRJNA323744)
  2. Process each fastq file using the commands in for_making_vcf_files.txt, processing each sample in its own directory
  3. Modify the sample_names.csv file for each subject in subject_folders to point to the correct location of the processed files.
  4. Run build_candidate_mutation_table_tb(SUBJECTFOLDER_FULLPATH) for each subject
  5. Continue as in Option 1

Functions that you may find useful for your other data

build_candidate_mutation_table_tb.m
Starting with a .pileup file (summarizing the alignment file in a text format) and .vcf file (must contain FQ scores for EVERY position on the genome — see for_making_vcf_files.txt) grabs useful information from each potential variant position. Adjust parameters at the top of this file to suit your needs.

identify_de_novo_muts.m
Starting with the output of build_candidate_mutation_table_tb, identify candidate mutations.

clickable_snp_table.m
Use to investigate the raw data at each genomic position. Requires data structures generated in identify_de_novo_muts.m

find_genotypes.m and assign_genotypes.m
Assigns genotypes given a matrix of derived mutation frequencies across samples