Detecting archaic sequence segregation in modern human genomes
Admixture has played a prominent role in shaping patterns of human genomic variation, including gene flow with now extinct hominins like Neanderthals and Denisovans. We describe a novel probabilistic method called IBDmix to identify introgressed hominin sequences, which unlike existing approaches, does not use a modern reference population.
IBDmix consists of 3 steps:
- Generate custom genotype file
- Detect introgressed regions
- Sort and filter results
The provided CMake lists will automatically build the executables, but requires version 3.14 or greater, git and the repository to be cloned. To use:
git clone https://github.com/PrincetonUniversity/IBDmix.git
cd IBDmix
pwd
# /path/to/repo/IBDmix
mkdir build
cd build
cmake ..
cmake --build .
The executables generate_gt
and ibdmix
will be left in IBDmix/build/src
.
Note that all chromosomes must be integers in the input vcfs and masked bed files. Vcf and mask files should be split by chromosome. See the included snakefile as an example pipeline implementation.
generate_gt
has the following options:
- -h, --help Print the help information and exit.
- -a, --archaic The archaic vcf file. May contain multiple archaic samples but only one will be utilized for IBDmix. Must be uncompressed text.
- -m, --modern The modern vcf file. Must be uncompressed text.
- -o, --output The merged genotype file output. Written as uncompressed text. All files must be specified to run.
ibdmix
takes the following options:
- -h, --help Print the help information and exit.
- -g, --genotype
The input genotype file produced by
generate_gt
. Required. - -o, --output The output file. Format is tab-delimited text with columns for individual ID, chromosome, start, end, and LOD score. The regions are half open with [start, end), where the end position is the next position in the input genotype file. Required.
- -s, --sample File specifying which samples to consider. One individual per line, must match header in genotype file exactly. If not specified all samples will be utilized (all columns except archaic).
- -n, --archaic
Name of archaic individual. Must match column name.
If not specified, taken as the first column of the
genotype file (default output order for
generate_gt
). - -r, --mask Bed file of masked regions to include in the analysis. If not specified all sites are considered.
- -d, --LOD-threshold Threshold value of log(odds) for emitting regions. Default: 3.0
- -m, --minor-allele-count-threshold Threshold count for filtering minor alleles. For each site, the number of occurrences in the sample population is tabulated. If fewer than this count are found, the LOD score for the site is taken as 0. Default: 1
- -a, --archaic-error Allele error rate for archaic DNA. Default: 0.01
- -e, --modern-error-max Maximum allele error rate for modern humans. Default: 0.0025
- -c, --modern-error-proportion
Ratio between allele error rate and minor allele
frequency for modern humans. The error rate for a
given site is taken as the minimum of the
-e
option and the product of this value with the allele frequency. Default: 2 - -i, --inclusive-end A switch to change the default behavior of the end position. By default, the reported end will be the next position in the genotype file which produces a decreasing LOD. It corresponds to the region of [start, end). With inclusive end set, the reported end is the last position with increasing LOD, or [start, end].
- -t, --more-stats
A switch to report additional region-level statistics.
Additional columns include:
- sites: total number of sites in genotype file in region
- positive_lods: number of sites with positive LOD scores in the region
- negative_lods: number of sites with negative LOD scores in the region
- mask_and_maf: sites which are both in the masked region and fail MAF cutoff
- in_mask: sites in the mask but pass MAF
- maf_low: sites failing MAF with low frequency
- maf_high: sites failing MAF with high frequency
- rec_2_0: sites failing mask or MAF but still considered due to a genotype of 2 in archaic and 0 in modern
- rec_0_2: same as rec_2_0 but with 0 in archaic and 2 in modern
- -w, --write-snps Include positions with positive LOD scores as a comma-separated list.
- --write-lods Include LOD scores of positive sites as a comma-separated list. Same order as write-snps output (e.g. zip the two entries to get position/lod values).
Once a run of ibdmix
completes, it is informative to filter the results
on a range of LOD values and length cutoffs. It is faster to perform this
operation on the ibdmix
output than to rerun with different options.
summary.sh
takes up to five options in order:
- length cutoff The minimum length to emit a region. Use 0 for all.
- LOD cutoff The minimum LOD score to emit a region. Use 0 for all.
- population The population to label each row.
- input Optional, the uncompressed input file.
- output Optional, the uncompressed output file. Contains only the regions passing both filters, sorted by ID then start position of the region.
If neither input nor output is specified, standard input and output are
utilized. To specify only one value, use -
to indicate standard
input/output. For example, to filter length greater than 1000 and LOD greater
than 5 from ibd_output.txt and generate compressed output.gz
summary.sh 1000 5 population ibd_output.txt - | gzip > output.gz
The above workflow is automated through snakemake. In the provided snakefiles/Snakefile, all steps from compilation to summary generation are encoded. By default, summary files will be generated but the workflow can be customized by setting options in snakefiles/config.yaml. All files are compressed with gzip.
The easiest way to get started is to set your paths in the config file and
run snakemake
in the snakefiles directory. If mask statistics are generated,
bedtools needs to be installed and included in PATH, otherwise run snakemake
with the --use-singularity
flag to download and use a docker container instead.
If a cmake module is present, adding the --use-envmodules will activate it prior to compilation. Due to the dependence on envmodules, the snakefile requires snakemake >= 5.10.
A note on wildcards.
Wildcards are specified in snakemake by enclosing them in curly braces. When changing options in the config file, it is important that all wildcards are kept and spelled correctly. For example, say you want to change the summary output, which is currently
altai_1kg_{population}_{chrom}.gz
. As long as {chrom} and {population} are present, the rest can change. If you want to place each population in a separate directory, this could be changed to{population}/altai_1kg_chr_{chrom}.txt.gz
. The output files would look likeASN/altai_1kg_chr_20.txt.gz
. However,{pop}/altai_1kg_{chr}.gz
will throw an error because snakemake expects to find specific wildcards.
- compiler and options: These are utilized to make the executables from cpp source code. The compiler requires C++ 11 standard.
- exe_root: Where to place compiled executables.
- source_root: The path to the IBDmix directory.
- archaic_vcf, modern_vcf: Required inputs
- genotype_file, ibd_output: Temporary intermediate files
- ibd_summary: Final output. If no summary values are supplied, ibd_output is the final target instead.
- sample_file: The input sample files for
ibdmix
. If removed, all samples will be used and 'ALL' will be the population name. - mask_file: The bed file to use for each chromosome. If removed, no masking will be performed during IBDmix.
- IBDmix options: All parameters for performing ibdmix. The genotype, output, mask, and sample are handled automatically. Other options can be supplied here (including archaic sample name, more stats and inclusive end).
- IBDmix mask_stats: If set to True and a mask file is provided, additional columns will be determined including the total number of BP masked in each region and the largest masked area within each region. This values can be used for further filtering.
- IBDmix summary_lod: List of all LOD values for filter. If sorted output is desired, a value of 0 will keep all regions. Remove this entry to just produce the raw IBD files.
- IBDmix summary_length: List of all length values for filter. If sorted output is desired, a value of 0 will keep all regions. All permutations of length and lod are produced.
- --notemp: Without this option, genotype files and raw IBD calls will be automatically deleted when they are no longer needed by the workflow. If you expect to perform multiple executions utilize this flag to keep those files around.
- --delete-temp-output: Once you are done with temp files this will find and delete all specified by the current config.yaml
- --cores: Specify the number of threads available to snakemake. With more cores, more jobs can be run simultaneously.
- --dryrun: See what rules will run without running any of them.
GNU GPLv3