Build Status codecov

introgression

Discovering yeast admixture through sequencing

Background

TBA

Installation

All required packages are specified in the conda environment located in code/environment.yml. The introgression environment can be generated with

conda env create -f environment.yml

To access the command line bindings of the main analyze class, install the setup file using pip with

conda activate introgression
pip install --editable .

while in the code directory.

Usage

Configuration

A set of initial parameters are provided in code/config.yaml which need to be set specifically for your system and dataset.

Strings of the form __KEY__ are substituted during execution and are used as a shortcut. For example, with output\_root set to /data/results, the value __OUTPUT_ROOT__/genes/ becomes /data/results/genes/.

Strings of the form {state} are used for wildcards within the code. Their location and surrounding characters can change, but the wildcard must be the same. For example, blocks_{state}.txt can be changed to {state}_with-block.txt but not blocks_{st}.txt.

Command Line

With the package installed and the conda environment activated, main methods are accessed with the introgression command. Some documentation is provided by adding the argument --help to introgression or any of its subcommands.

introgression

Options include:

  • --config: specify one or more configuration files. Files are evaluated in order. Conflicting values are overwritten by the newest file. This allows a base configuration for the system and analysis-specific configurations added as needed.
  • verbosity: set by varying the number of v's attached to the option, with -v indicating a log level of critical and -vvvvv indicating debug logging.
  • --log-file: Optional location to store log information. Default is stdout. If set and run on an interactive shell, some commands will display progress bars.

Most subcommand options will overwrite corresponding values in the config file. Leaving options unset without supplying a value in the config file will raise an error. Some values are only set through the config file including the list of chromosomes and the known states.

Available subcommands are:

predict

The predict subcommand uses an HMM to predict regions of introgression from alignment files. Several outputs are used in subsequent steps which refine the predicted introgressed regions.

Test strains on which to predict introgression can be supplied in the config file under the name strains or pulled from the directory structure of test_strains.

Available options are:

  • --alignment: input alignment file location with wildcards for {prefix} (optional), {strain} and {chrom}.
  • --prefix: An optional wildcard value for alignment files. If left blank, will default to the known states joined with an underscore. Leaving the {prefix} wildcard out of the alignment file will prevent its use as well.
  • --blocks: An output file containing the regions predicted to belong to the given state. Must contain the {state} wildcard which will be populated with a known state during analysis. Columns are the strain, chromosomes, the predicted state, start position, end position, and the number of sites supporting the assignment.
  • --test-strains. If strains are not provided in the config, this file with {strain} and {chrom} wildcards will be used to populate the strains for prediction.
  • --hmm-initial: Output file with the initial parameters of the HMM for each strain.
  • --hmm-trained: Output file with HMM parameters following Baum-Welch training.
  • --positions: Output file with indices of sites which are non-gapped sequences which differ between reference alignments.
  • --probabilities: Output file with the probability of each position belonging to the master reference strain.
  • --threshold: The threshold value to apply when filtering the predicted HMM path through the test sequence. Either a float, indicating cutoff probability, or 'viterbi' to indicate the Viterbi algorithm should be used to find the most likely sequence of states.
  • --only-poly-sites/--all-sites: A switch to indicate if all non-gapped, sequenced sites should be considered during HMM training, or only polymorphic sites. Default is only polymorphic sites.
id-regions

id-regions prepends a column to block files with a unique region id, of the form 'r#'. Regions are sorted by the start position of the region. Changing the states to label will affect the region numbers as a different set of regions will be considered.

Available options are:

  • --blocks: The input file to label with the wildcard {state}. This is the file produced by predict in the previous step.
  • --labeled: The output file, also containing {state} wildcard.
  • --state: May be specified multiple times to indicate which states to add labels to. Leaving unset will use the states in the config file (recommended).
summarize-regions

Analyzes the regions predicted to be introgressed. Several columns are added to the block file containing information about the region including the number of matching sites to each state.

Available options are:

  • --state: May be set multiple times for each state to summarize. Leaving unset will default to all states in the config file.
  • --labeled: the labeled block file with {state} wildcard created in the previous step.
  • --masks: Sequence mask files with {strain} and {chrom} wildcards.
  • --alignment: The input alignment file similar to the predict option.
  • --positions: The position file created during predict.
  • --quality: The output file with a {state} wildcard.
  • --region: The alignment for each region in the labeled file with {state} wildcard. Each state file contains all regions for the state.
  • --region-index: A pickled python dictionary used for random access into the region file. Must have {state} wildcard.
filter-regions

From the quality files produced in summarize-regions, filter regions based on several criteria including those with weak support for the alternative hypothesis and those which can be assigned to multiple alternative states.

Regions passing the 'introgressed filter' satisfy all of the following:

  • fraction of gaps masked in reference > 0.5
  • fraction of gaps masked in predicted state > 0.5
  • number of matches to predicted > 7
  • number of matches to predicted > number of matches to reference
  • sequence identity with predicted state is higher than reference
  • sequence identity with reference is > 0.7

Regions passing the 'ambiguous filter' match only the predicted state. No other state has:

  • sequence identity >= sequence identity with predicted state * threshold
  • matching bases >= matching bases with predicted state * threshold

Available options are:

  • --region: The region file from summarize-regions.
  • --region-index: The region index file from summarize-regions.
  • --quality: The quality file produced by summarize-regions with {state} wildcard.
  • --introgress-filter: The output file with only regions passing introgression filter. Must contain {state} wildcard.
  • --introgress-inter: An output file with all regions. Includes the reason for filtering by the introgression filter or blank if it passes. Must contain {state} wildcard.
  • --ambiguous-filter: Output file containing only regions which pass the ambiguous filter after passing the introgression filter. Must contain {state} wildcard.
  • --ambiguous-inter: Contains all regions from introgression filter with a column for the reason the region failed ambiguous filtering. Must contain {state} wildcard.
  • --thresh: The threshold to apply to the ambiguous filter.
  • --filter-sweep: If set and threshold values are supplied as arguments, will output summary information for applying the ambiguous filter with various threshold values.

filter-regions accepts multiple threshold values as arguments to test and output to the filter-sweep file. Sample usage would be

introgression --config config.yml \
    filter-regions
    --threshold 0.995 \
    --filter-sweep sweep.txt \
    0.99 0.98 0.8  # these are the sweep arguments

where 0.99, 0.98 and 0.8 are used as test threshold values as summarized in sweep.txt. Note that the ambiguous filter will only use the threshold 0.995 in this example.

summarize-strains

Summarize strains produces summary information for each test strain including the number of regions and bases assigned to each hidden state, filtered at each stage, and ambiguous between states.

Available options are:

  • --introgress-inter: The introgressed filter file as used in filter-regions.
  • --ambiguous-inter: The ambiguous filter file as used in filter-regions.
  • --strain-info: Tab separate table with information on the strain to include with the summary output. Columns should be the strain name, alternate name, location, environment, and population.
  • --state-counts: The summary output file.

License

TBD