/who-analysis

Code used for WHO analysis work

Primary LanguageJupyter NotebookMIT LicenseMIT

Create Environment

This project uses conda to manage packages (install Anaconda here). All packages are found in the environment.yaml file. Change the first and last line of this file to reflect the environment name you wish to use and the path to your Anaconda directory. Then run

conda env create -f environment.yaml

to create the environment, which should take no more than 30 munutes. Run conda activate <env_name> to activate it and conda deactivate to deactivate once you are in it.

Or if you prefer, run the following to install all the packages without creating an environment.

conda install --file requirements.txt -c conda-forge -c bioconda
pip install fast-lineage-caller==0.3.1

Required Computing Resources

  • The genotypes matrices in this work are very large, and depending on the drug, can require up to 100 GB of RAM to run.
  • All models were run on 1 CPU core. A single model can take anywhere from a few minutes to a few hours to run.
  • Drugs with smaller datasets, such as Delamanid, will run much faster than drugs with many variants and samples, such as Ethambutol.

data/

  1. drug_CC.csv: Critical concentrations for each drug used for binarization of MIC data.
  2. drug_gene_mapping.csv: Names of genes and tiers used to build models for each drug.
  3. drugs_loci.csv: Dataframe of resistance loci and their coordinates in H37Rv. The Start column is 0-indexed, and the End column is 1-indexed.
  4. NotAwR by literature.xlsx: List of variants by drug that are not associated with resistance based on literature evidence.
  5. coll2014_SNP_scheme.tsv: Orthogonal lineage-defining SNPs for 62 lineages based on the Coll 2014 scheme.
  6. overview_MIC_data.xlsx: Overviews of MIC data, counts, sources, number resistant vs. susceptible, etc.
  7. samples_summary.csv: Dataframe of the number of samples across drugs. Includes columns for the numbers of samples with genotypes, binary phenotypes, SNP counts, MICs, lineages, and the numbers of (sample, gene) pairs with LOF and inframe mutations (to see how many get pooled).
  8. v1_to_v2_variants_mapping.csv: File mapping the variant names between the first and second versions of the catalog.

PCA/

  1. pca_explained_var.npy: Array of explained variances of the first 50 principal components.
  2. pca_explained_var_ratio.npy: Array of explained variance ratios (array sums to 1) of the first 50 principal components (PCs).
  3. eigenvec_50PC.csv: First 50 eigenvectors of the PCA run on 6,190 single nucleotide variant sites.
  4. minor_allele_counts.pkl.gz: 52,567 (isolates) × 6,938 (positions) matrix of minor allele counts (binary encoding) across the full genome. Note that uncompressed, this file is almost 3 GB, so ensure that you have enough RAM to read it in.
  5. mixed_site_counts.xlsx: SNVs for PCA with the proportion of isolates containing an unfixed variant (25% < AF ≤ 75%). Used for filtering out sites at which more than 1% of isolates have an unfixed variant.
  6. Vargas_PNAS_2023_homoplasy.xlsx: List of 1,525 homoplasic sites in MTBC. Dataset S1 from Vargas et al., PNAS, 2023.

Running the Grading Algorithm

Raw Data

Due to their large size, the raw genotypes, phenotypes, and MICs are available to download from the releases page of this repository. Each drug has a separate folder, which contains genos_1.csv.gz, phenos_binary.csv, and phenos_mic.csv. These were created by concatenating individual CSV files from this repository.

To use this data, create a new directory (for example, "analysis_dir") and update the output_dir parameter in the config.yaml file (more about this later) to this same directory name. Place each drug folder into this output directory. Keep the 3 files for each drug in separate drug-specific subfolders as in the release.

When you run the scripts, they will write the results to the same output_dir where the raw data are. The scripts will skip analyses that have already been done, so if a step is interrupted, it will pick up at the next one by detecting which steps have already completed based on the files that have been created.

0. Preprocessing (preprocessing/):

Before running any models, run the two scripts in the preprocessing directory in numerical order.

  1. preprocessing/01_make_gene_drug_mapping.py creates the file data/drug_gene_mapping.csv, which maps the input gene names to each drug, which facilitates constructing the input model matrices.
  2. preprocessing/02_samples_summary_andPCA.py generates 50 eigenvectors for population structure correction and saves them to PCA/eigenvec_50PC.csv. Intermediate files that this script creates are a dataframe of minor allele counts (data/minor_allele_counts.pkl.gz) and an array of the explained variance ratios of each of the 50 principal components (data/pca_explained_var.npy).

1. Model Scripts (model/)

All samples with any missing variants or unfixed variants were excluded from models. This is done at the model-level, so a sample can be exluded from one model but not another for the same drug. scikit-learn can not fit models with NaNs, and imputation can introduce biases, so we decided to drop all samples with missingness.

The following model scripts require the config file (config.yaml) and the full drug name (first letter capitalized) to run the analysis on as arguments.

  1. 01_make_model_inputs.py: create input matrices to fit a regression model.
  2. 02_run_regression.py performs a Ridge (L2-penalized) regression and a permutation test to assess coefficient significance.
  3. 03_likelihood_ratio_test.py: performs the likelihood ratio test for every mutation in the
  4. 04_compute_univariate_statistics.py: computes statistics like sensitivity, specificity, and positive predictive value for each mutation, with 95% exact binomial confidence intervals.

Parameters in the config file:

  • input_dir: Directory where all input directories are stored. Contains subfolders "grm", "phenotypes", and "full_genotypes". This parameter is not used because the data have already been combined into individual `genos_1.csv.gz`, `phenos_binary.csv`, and `phenos_mic.csv` files for each drug.
  • output_dir: Directory where model results will be written.
  • binary: boolean for whether to fit a binary or quantitative (MIC) model
  • atu_analysis: boolean for whether this is the normal binary analysis or a CC vs. CC-ATU analysis
  • atu_analysis_type: string "CC" or "CC-ATU" denoting which model to run in this analysis. Only used if atu_analysis = True. We did not run ATU analyses for this work.
  • tiers_lst: list of integers tiers to include in the model. We only run tier 1 models in this work.
  • pheno_category_lst: list of phenotype categories to include. The list can include the strings "WHO" and "ALL." Only used if binary = True and atu_analysis = False
  • silent: boolean for whether silent variants (synonymous, initiator codon, and stop retained variants) should be included
  • pool_type: one of 2 strings (poolSeparate or unpooled). The first pools features into 2 aggregate features: "LoF" and "inframe". The second leaes all variants disaggregated
  • amb_mode: how to handle mutations with intermediate AF. Options are DROP, AF, and BINARY.
  • AF_thresh: Only used if amb_mode = BINARY. Variants with AF > the threshold will be assigned to 1, the others to 0.
  • num_PCs: number of principal components (>= 0)
  • num_bootstrap: number of bootstrap samples

Some of the parameters above were kept constant throughout all analyses, but they remained as parameters if they need to be toggled in the future. We fit 9 models per drug:

  1. WHO, - silent variants, all variants unpooled.
  2. WHO, - silent variants, pool LoF mutations
  3. WHO, + silent variants, all variants unpooled
  4. ALL, - silent variants, all variants unpooled
  5. ALL, - silent variants, pool LoF mutations
  6. ALL, + silent variants, all variants unpooled
  7. MIC, - silent variants, all variants unpooled
  8. MIC, - silent variants, pool LoF mutations
  9. MIC, + silent variants, all variants unpooled

The appropriate parameters in config files are in the /config_files directory. An example script to run all drug models and the grading algorithm is given in run.sh.

2. Grading Algorithm (/grading)

After running the scripts in /model, run the two numbered scripts in /grading to run the grading algorithm. The algorithm will be run on all the models found in the specified folders.

  1. 01_get_single_model_results.py: Combines results from all the permutation test, LRT, and univarite statistics into a single table for each (model, drug) pair. Reuslts of all logistic regression models for a single drug are written to a single Excel file in a new /results directory in the home directory of the repository. The only required argument is config.yaml to get the directory in which the model results are stored.
  2. 02_combine_WHO_ALL_results.py: Integrates results from different models and gets a consensus grading for each (drug, variant) pair. Writes it to an output file. Required arguments: config.yaml and an output file to store the regression-based catalog at.

The regression-based catalog results from this work are in the file /results/Regression_Final_June2024_Tier1.csv

3. Resistance Predictions (/prediction)

  1. catalog_model.py: Uses the final regression catalog (created from 02_combine_WHO_ALL_results.py) to get resistance predictions. Any isolate that contains a Group 1 or 2 mutations is predicted resistance.
  2. catalog_model_SOLO.py: Does the same as the above script for the "SOLO INITIAL" results. The "SOLO FINAL" results were taken from Table 3 of the WHO report.