MIDAS-strains

Estimate strains from reads mapped to pan-genomes from the MIDAS database

Usage

Download the new package using github; one new python library scipy is required if not already installed:
git clone https://github.com/snayfach/MIDAS-strains

Note that the files reads_1.fa and and reads_2.fa are simulated FASTA files containing 20 strains from 5 different human gut species at equal relative abundances (5%). The abundances.tsv file lists the exact strains.

MIDAS should already be installed; run MIDAS as usual for the species module to estimate species abudnances:

export PYTHONPATH=$PYTHONPATH:/path/to/MIDAS
export PATH=$PATH:/path/to/MIDAS/scripts
export MIDAS_DB=/path/to/midas_db_v1.2

run_midas.py species midas_out -1 reads_1.fa -2 reads_2.fa -t 10

Now we use MIDAS to map reads against the pangenomes. Here we can specify the 5 species that we know are in the community to force MIDAS to map reads against these pangenomes regardless of their coverage in the test dataset. Also, I found slighly better performance in the test dataset when increasing the mapping % identity to 99%:

run_midas.py genes midas_out -1 MIDAS-strains/reads_1.fa -2MIDAS-strains/reads_2.fa -t 10 --mapid 99 --species_id Bacteroides_fragilis_54507,Bacteroides_vulgatus_57955,Bifidobacterium_longum_57796,Clostridium_perfringens_56840,Parabacteroides_distasonis_56985

Now, use the new script to estimate strain frequencies from the MIDAS output:
MIDAS-strains/midas_strains.py --indir midas_out

You should see some output that looks something like this:

estimating strain abundances for 5 species:
  Clostridium_perfringens_56840
  Bacteroides_vulgatus_57955
  Bifidobacterium_longum_57796
  Bacteroides_fragilis_54507
  Parabacteroides_distasonis_56985

1. Clostridium_perfringens_56840
reading read-depth (i.e. coverage) of pangenome genes from MIDAS output
  total genes: 17315
clustering reference strains at 90.0% gene content similarity
  total strains: 12, clustered strains: 12
selecting subset of genes found in <=50.0% of reference strains to use for strain estimation
  retained genes: 17048
obtaining initial estimation of strain frequencies
  total strains detected: 9
re-estimating frequencies after removing strains with <10.0% relative frequency
  total strains detected: 4
...

With an output file like this:

species_id      genome_name     genome_id       clustered_ids   relative_abundance      count_reads     coverage
Bacteroides_fragilis_54507      Bacteroides fragilis str. S6R6  1339321.3       1339321.3,1339320.3,1339322.3,1339319.3,1339317.3       0.0514508615857 19.3461962084
   0.194123465614
Bacteroides_fragilis_54507      Bacteroides fragilis str. S24L26        1339324.3       1339324.3,1339325.3,1339323.3   0.0480365693318 18.0623777101   0.181241383088
Bacteroides_fragilis_54507      Bacteroides sp. 2_1_56FAA       665938.3        665938.3        0.0521754004368 19.6186322832   0.196857141754
Bacteroides_fragilis_54507      Bacteroides fragilis str. 1007-1-F #7   1339292.3       1339292.3,1339293.3,1339340.3,1339295.3,1339294.3,1339339.3,1339338.3,1339337.3 0.0504578045777 18.9727937982   0.190376673782

Please notice the column clustered_ids. This is important!! This represents redundant strains that are very, very similar. For sample, the B. fragilis strain listed above (genome_name = Bacteroides fragilis str. S6R6 and genome_id = 1339321.3) is estimated to be present at 0.051 (5.1%) relative abundance in the test dataset. But this exact strain is not present in the test dataset (see abundances.tsv). However the strain with genome_id = 1339317.3 is listed in abundances.tsv and is clustered together with 1339321.3. So the output we get is consistent with what we expect.

For a full mapping of genome ids to genome names, see the MIDAS file: /path/to/midas_db_v1.2/genome_info.txt

Finally, you can use the -h flag to see a few options, but I've already set good defualt values, so this can be mostly ignored:

./midas_strains.py -h
description: use reads mapped to pangenomes to estimate strain frequencies

usage: midas_strains.py --indir <indir> [options]

optional arguments:
  -h, --help            show this help message and exit
  --indir CHAR          Path to input directory with MIDAS output for one sample
                        Should contain subdirectories: <indir>/species, <indir>/genes
  --midas_db CHAR       Path to reference database
                        By default, the MIDAS_DB environmental variable is used
  --min_strain_freq FLOAT
                        Minimum relative frequency of strains per species (default=0.10, range=0.0-1.0)
                        Useful for minimizing spurious, low-frequency predictions
  --max_similarity FLOAT
                        Maximum gene content between reference strains (default=0.90, range=0.0-1.0)
                        Used for grouping strains into non-redundant clusters
  --max_gene_freq FLOAT
                        Maximum frequency of genes across reference strains (default=0.50, range=0.0-1.0)
                        Used for removing genes present in all reference strains, which are not informative
  --max_genomes INT     Maximum number of genomes per species (default=300, range=0-inf)
                        Species with hundreds of genomes currently cause the program to run very slowly