/nQuire

A statistical framework for ploidy estimation using NGS short-read data

Primary LanguageCMIT LicenseMIT

nQuire Manual

About

nQuire implements a set of commands to estimate ploidy level of individuals from species, where recent polyploidization occurred and intraspecific ploidy variation is observed. Specifically, nQuire uses next-generation sequencing data to distinguish between diploids, triploids and tetraploids, on the basis of frequency distributions at variant sites where only two bases are segregating.

For more background see also the publication at BMC Bioinformatics.

Get it

To use nQuire, you will need a Linux machine where the gcc compiler, as well as the libz, libm and libpthread system libraries are installed. To acquire nQuire, first clone the repository recursively:

git clone --recursive https://github.com/clwgg/nQuire

This will clone both the nQuire code, as well as the htslib module, which is its only external dependency. After cloning, first compile the submodule, and then the nQuire code:

cd nQuire
make submodules
make

This will create the nQuire binary, which you can copy or move anywhere for subsequent use.

Updating

When updating to the current version, please make sure to also update the submodules:

git pull origin master
git submodule update
make submodules
make

Usage

Generation of .bin file

nQuire requires as input only a .bam alignment file, which contains NGS reads mapped to a suitable reference genome. The required information is extracted from the .bam alignments into a binary file (suffix .bin), which is generated using the create subcommand. By default, only sites where two bases are segregating at a minimum frequency of 0.2 are reported. These settings are recommended to use for the subsequent analyses of ploidy. The minimum frequency may be adjusted using the -f flag. It is also possible to establish filters for mapping quality and minimum coverage through the -q and -c options, respectively. By default, the minimum mapping quality is set to 1, and the minimum coverage to 10. It is important to establish a reasonable minimum coverage cutoff, as base frequencies at biallelic sites have to be assessed.

nQuire create -b input.bam -o base

The .bin file can also store reference IDs and positional information (genomic coordinates) when the create subcommand is used with the -x flag.

nQuire create -b input.bam -o base -x

Interaction with BED files

It is now possible for nQuire to interact with BED files in two ways. In both cases the BED file has to include the following columns:

  1. Sequence/Chromosome name
  2. Start coordinate (0-based)
  3. End coordinate (1-based)
  4. Region name

Per default when a BED file is supplied, nQuire will create a separate .bin file for each region. The output file will be named [base]-[region_name].bin. This allows running the model by region, chromosome or window, for example to detect aneuploidies.

nQuire create -b input.bam -o base -r region.bed

Alternatively, all BED regions can be concatenated into one file. This is useful for example to query specific annotations or regions targeted with capture experiments. The output file will have the name [base]-bedcc.bin.

nQuire create -b input.bam -o base -r region.bed -y

Visualization of .bin file

The file produced by the create subcommand can be visually inspected using either the view or histo subcommands. The histo subcommand will produce a ASCII histogram on the command line based on the base frequencies stored in the .bin file.

nQuire histo base.bin

The view subcommand allows to inspect the coverage and base counts at all positions in the file, as well as filter them to produce a new .bin file.

nQuire view base.bin

The columns in the output of the view subcommand represent the following:

  1. Coverage
  2. Base count 1
  3. Base count 2

If an extended .bin was created using the -x flag of the create subcommand, the view subcommand can interact with this file too. Since it is much easier to store reference IDs instead of reference names, the view subcommand allows to re-annotate these reference IDs with their corresponding names if the .bam file that the .bin file was created from is handed to it via the -a flag.

nQuire view base.bin -a input.bam

The columns in the output of the view subcommand used on an extended .bin file represent the following:

  1. Reference sequence (ID)
  2. Reference position (0-based)
  3. Coverage
  4. Base count 1
  5. Base count 2

Using the -f flag of the view subcommand one can query the type of the .bin, which so far is 0 for the default format, and 1 for the extended format.

nQuire view -f base.bin

Denoising

In many cases, the base frequency histogram contains a high baseline of noise, which results mostly from mismappings and is elevated in highly repetitive genomes. This can to some extend be handled using a stringent mapping quality cutoff in the creation of the .bin (e.g. -q 30). To tackle this problem more efficiently, nQuire also contains the subcommand denoise. It uses a Gaussian Mixture Model with Uniform noise component (GMMU, for more information please refer to the next section “Model” or the publication referenced above) to assess the extent of this uniform noise, and scales it down allowing to easily detect peaks in the histogram of base frequencies.

nQuire denoise base.bin -o base_denoised

The denoise subcommand also returns the percentage of information kept after the denoising procedure. If this value is suspiciously low, there might not be enough data left for subsequent testing. Please inspect the histogram also with the histo command before and after denoising to visually assess the shape of the distribution of base frequencies.

Assessing ploidy level

The main testing framework of nQuire utilizes a Gaussian Mixture Model (GMM, please refer to the next section “Model” as well as the publication referenced above), which describes the histogram as a mixture of Gaussians with varying means and mixture proportions. The likelihood of certain assumptions based on this model given the empirical data is maximized using an Expectation-Maximization (EM) algorithm.

The most important subcommand using the GMM is lrdmodel. This is a mixture of the three fixed models from modeltest and the free model in estmodel, as all four of those models are used. Subsequently, the maximized log-likelihood of the three fixed models are subtracted from the maximized log-likelihood of the free model to get three delta log-likelihoods. As the log-likelihood of the free model can basically be seen as the “optimum” for the empirical data under the assumptions of this model, the higher the delta log-likelihood of a fixed model, the further it is from the optimum and the lower is the support for the corresponding ploidy level.

nQuire lrdmodel base.bin

Since this is the major analysis step of the tool, it allows for multithreading over multiple input files. These may be different samples, or different regions of the same bam file split by BED regions (see section on the create subcommand).

nQuire lrdmodel -t n_threads file1.bin [file2.bin ...]

The output from lrdmodel contains 8 tab-separated columns:

  1. Filename
  2. Free model maximized log-likelihood
  3. Diploid fixed model maximized log-likelihood
  4. Triploid fixed model maximized log-likelihood
  5. Tetraploid fixed model maximized log-likelihood
  6. Diploid delta log-likelihood
  7. Triploid delta log-likelihood
  8. Tetraploid delta log-likelihood

The modeltest subcommand maximizes the likelihood under the assumption of either di-, tri- or tetraploidy where mean and mixture proportions are fixed, and only the standard deviation of the Gaussians is varied.

nQuire modeltest base.bin

It returns the log-likelihood for each of the assumed ploidy levels, together with the standard deviation of the Gaussians included in that model.

When running the estmodel subcommand no assumptions are made and the EM-algorithm maximizes the likelihood of a mixture of three Gaussians given the empirical data freely.

nQuire estmodel base.bin

The result is the maximized log-likelihood when parameters can be varied freely, as well as all parameter estimates for the three Gaussians (mixture proportion, mean and standard deviation).

The simpler framework just uses ideal histograms under the assumption of each of the ploidy levels (diploid: N(0.5,0.05); triploid: N(0.33,0.04) + N(0.67,0.04); tetraploid: N(0.25,0.04) + N(0.5,0.05) + N(0.75,0.04)) and does linear regression on the y-values of the empirical and the ideal histograms. The subcommand for that is histotest.

nQuire histotest base.bin

histotest reports for each ploidy level the sum of squared residuals (SSR) of empirical vs. ideal histograms, as well as the slope, its standard error and the R2 of the regression of y-values. A good fit between ideal and empirical histograms is characterized by low SSR, positive slope with low standard error, as well as a high R2.

Model

At the heart of nQuire is a Gaussian Mixture Model (GMM) which is used in the modeltest, estmodel and lrdmodel subcommands. For the denoise subcommand it is extended to a Gaussian Mixed Model with Uniform noise component (GMMU).

The GMM aims to model the read frequency histogram as a mixture of up to three Gaussian distributions between 0 and 1, that are scaled relatively to each other by some mixture proportion. This model can be used for parameter estimation through maximum likelihood estimation using an Expectation-Maximization (EM) algorithm, as well as model comparison when we have specific expectations about our data. We use up to three Gaussians, because the expected distributions of read frequencies at biallelic sites for each of our ploidy levels of interest are one Gaussian with mean 0.5 for diploid, two Gaussians with means 0.33 and 0.67 for triploid, and three Gaussians with means 0.25, 0.5 and 0.75 for tetraploid. We can fix these values in the GMM to assess the maximal log-likelihood under each of the three assumptions (three fixed models). Additionally we can estimate the parameters without constraints to get the maximal log-likelihood under complete freedom (one free model). The comparison of maximized log-likelihoods under the fixed models to the free model then allows us to assess how close each of these three ploidy assumptions are to the optimum under the GMM model.

For the denoise command there is a fourth component added to the three Gaussians, which has uniform probability density and only its mixture proportion can be varied. Together with a free model for the three Gaussians, the model under maximized likelihood allows us to assess the proportion of uniform noise in the histogram.