AD-LIBS is a local ancestry inference tool for low-coverage resequencing data. It is described in this paper. The goal of AD-LIBS is to take a low-coverage genome from one or more hybrid/admixed individuals, along with one or more low-coverage genomes from representatives of two ancestral populations, and produce a map of where in the genome the hybrid individual derives ancestry from each ancestral population. Maps consist of fixed-width windows across the genome, and each window is labeled as homozygous for ancestry from one or the other ancestral population, or heterozygous for ancestry. Because of this, AD-LIBS is only suited for analysis of diploid organisms.
To circumvent problems with identifying heterozygous sites in low-coverage samples, AD-LIBS reads in one FASTA file per individual. FASTA files must be aligned to the same reference genome and should be "pseudo-haploid," where every base is randomly sampled from one or the other parental haplotype, as has been done in studies like this one. To create pseudo-haploid FASTA files, we recommend using the pu2fa program after mapping reads.
Results are in BED format, with a fourth column denoting ancestry in a given window (AA for homozygous population A, AB for heterozygous, and BB for homozygous population B). BEDTools contains many programs helpful for downstream analysis of these files.
AD-LIBS requires the following Python packages:
An easy way to obtain the first three is to use Anaconda, which can install these packages and their dependencies, in an environment built specifically for this program. Alternatively, mamba is a faster alternative -- just replace "conda" with "mamba" in commands if it's installed.
To set up dependencies, you can just use the included environment.yml file:
conda env create -f environment.yml
Then activate the new conda environment, called adlibs, before running:
conda activate adlibs
If you're using Mac OS X, you also need to install the standalone version of argp through Homebrew: once Homebrew is installed, just run
brew install argp-standalone
AD-LIBS equires gcc for compiling C components, as well as zlib for reading compressed FASTA files.
To compile C programs, simply type make
in the main project directory.
Binary files will be stored in /bin
and run by the main Python program.
Note that src/kseq.h
is an excellent FASTA parser in C, taken from
HTSLib, part of the
SAMTools codebase.
The Python program bin/adlibs.py is the main program with which users should interact.
AD-LIBS works in pieces; this program interprets input, bin/adlibs_score
reads in FASTA files and prints out scores computed in windows across sequences,
and bin/adlibs_hmm.py
uses the Pomegranate package to build and run the
Viterbi algorithm on the HMM. Additionally, bin/allele_freq_drift.py
uses the
diffusion approximation to the genetic drift problem to compute an important
probability used by the HMM, and adlibs_math.py
and adlibs_dists.py
both
contain code used by adlibs_hmm.py
.
bin/adlibs.py
can use command-line input (run with -h for help) or, since
the many parameters can get cumbersome, a configuration file containing all
the information needed for one run. Whether or not you provide a configuration
file, AD-LIBS will create a new one with all given parameters along with your
output files, in case you want to run AD-LIBS again with all or most of the
same parameters.
The example directory contains an example configuration file, along with
(short) pseudo-haploid FASTA sequences of genomic scaffolds for two brown bears
(OFS01_s317.fa
and Swe_s317.fa
), two polar bears (PB12_s317.fa
and
PB105_s317.fa
), and two admixed ABC Islands brown bears (ABC01_s317.fa
and
ABC02_s317.fa
). adlibs.config
contains all the parameters needed to run
AD-LIBS on this data set, with polar bears as ancestral population "A" and
brown bears as ancestral population "B". All output will be prefixed with
"example." and the adlibs.config
file contains comments explaining the
parameters.
To run AD-LIBS on this example data set, run
bin/adlibs.py example/adlibs.config
while in the main project directory.
When complete, you will see
BED files
containing results in the example directory. Each line is a genomic window,
with start and end coordinates, along with a label (AA for homozygous
population A ancestry, AB for heterozygous ancestry, and BB for homozygous
population B ancestry). You can try altering parameters and re-running as you
wish to see how it affects results.
If AD-LIBS creates one or more empty BED files, the most likely problem is that one or more of your genomes has too many ambiguous bases (Ns). In this case, you can try to increase the skip threshold parameter -- this is the percent of all bases in each window allowed to be N before the window is skipped -- so setting it to 0.75 means AD-LIBS will still process a window that is up to 75% N bases. If this doesn't work, you can increase the window size to try to increase the number of valid bases in each window. Otherwise, you can re-generate your pseudohaploid FASTA files using lower quality filters to allow more bases to come through.
AD-LIBS was described in the article AD-LIBS: inferring ancestry across hybrid genomes using low-coverage sequence data, NK Schaefer, B Shapiro, and RE Green, BMC Bioinformatics, 2017. We'd appreciate a citation if you use it in your research.