/ksrates

ksrates is a tool to position whole-genome duplications relative to speciation events using substitution-rate-adjusted mixed paralog-ortholog Ks distributions.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Test pipeline CI Push DockerHub CI Documentation Status

VIB-UGent Center for Plant Systems Biology—Evolutionary Systems Biology Lab

ksrates

ksrates is a tool to position whole-genome duplications* (WGDs) relative to speciation events using substitution-rate-adjusted mixed paralog–ortholog distributions of synonymous substitutions per synonymous site (KS).

* or, more generally, whole-genome multiplications (WGMs), but we will simply use the more common WGD to refer to any multiplication

Quick overview

To position ancient WGD events with respect to speciation events in a phylogeny, the KS values of WGD paralog pairs in a species of interest are often compared with the KS values of ortholog pairs between this species and other species. For example, it is common practice to superimpose ortholog and paralog KS distributions in a mixed plot. However, if the lineages involved exhibit different substitution rates, such direct naive comparison of paralog and ortholog KS estimates can be misleading and result in phylogenetic misinterpretation of WGD signatures.

ksrates is user-friendly command-line tool and Nextflow pipeline to compare paralog and ortholog KS distributions derived from genomic or transcriptomic sequences. ksrates estimates differences in synonymous substitution rates among the lineages involved and generates an adjusted mixed plot of paralog and ortholog KS distributions that allows to assess the relative phylogenetic positioning of presumed WGD and speciation events.

For more details, see our preprint and the documentation below.

Documentation

Documentation
Tutorial
FAQ

Quick start

ksrates can be executed using either a Nextflow pipeline (recommended) or a manual command-line interface. The latter is available via Docker and Singularity containers, and as a Python package to integrate into existing genomics toolsets and workflows.

In the following sections we briefly describe how to install, configure and run the Nextflow pipeline and the basic usage of the command-line interface for the Docker or Singularity containers. For detailed usage information, a full tutorial and additional installation options, please see the full documentation.

Example datasets

To illustrate how to use ksrates, two example datasets are provided for a simple example use case analyzing WGD signatures in monocot plants with oil palm (Elaeis guineensis) as the focal species.

  • example: a full dataset which contains the complete sequence data for the focal species and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the ksrates Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.

  • test: a small test dataset that contains only a small subset of the sequence data for each of the species and takes only a few minutes to be run. This is intended for a quick check of the tool only and can be run locally, e.g. on a laptop. The results are not very meaningful.

See the Usage sections below and the Tutorial for more detail.

Nextflow pipeline

Installation

  1. Install Nextflow, official instructions are here, but briefly:

    1. If you do not have Java installed, install Java 8 or later or follow these steps:

      sudo apt-get install default-jdk
      
    2. Install Nextflow using either:

      wget -qO- https://get.nextflow.io | bash
      

      or:

      curl -fsSL https://get.nextflow.io | bash
      

      It creates the nextflow executable file in the current directory. You may want to move it to a folder accessible from your $PATH, for example:

      mv nextflow /usr/local/bin
      
  2. Install either Singularity (recommended, but see here or Docker. This is needed to run the ksrates Singularity or Docker container which contain all other required software dependencies, so nothing else needs to be installed.

  3. Install ksrates: When using Nextflow, ksrates and the ksrates Singularity or Docker container will be automatically downloaded simply when you execute the launch of the ksrates pipeline for the first time, and they will be stored and reused for any further executions (see Nextflow pipeline sharing). Therefore, in this case it is not necessary to manually install ksrates, simply continue with the Usage section below.

Usage

We briefly illustrate here how to run the ksrates Nextflow pipeline on the test dataset.

  1. Get the example datasets.

    1. Clone the repository to get the test datasets:

      git clone https://github.com/VIB-PSB/ksrates
      
    2. You may want to copy the dataset folder you want to use to another location, for example your home folder, and then change to that folder:

      cp ksrates/test ~
      cd ~/test
      
  2. Launch the ksrates Nextflow pipeline. (If this is the first time you launch the pipeline, Nextflow will first download ksrates and the ksrates Singularity or Docker container.)

    • Running locally on a laptop/desktop:

      When using Singularity (recommended):

      nextflow run VIB-PSB/ksrates --config config_elaeis.txt -with-singularity docker://vibpsb/ksrates:latest
      

      Or when using Docker:

      nextflow run VIB-PSB/ksrates --config config_elaeis.txt -with-docker vibpsb/ksrates:latest
      

      The required --config parameter specifies the (path to the) pipeline configuration file for the ksrates analyses to be run. If the specified file does not exist (at the given path) a new template configuration file will be generated and the pipeline exits. Edit and fill in the generated configuration file (see the full documentation for more detail) and then rerun the same command above to relaunch the pipeline.

      The dataset directory already contains a pre-filled ksrates pipeline configuration file for the oil palm example use case, config_elaeis.txt, therefore the above Nextflow command should directly launch the pipeline.

    • Running on a compute cluster:

      nextflow run VIB-PSB/ksrates --config config_elaeis.txt -c custom_nextflow.config
      

      The --config parameter is the same as above.

      The -c parameter specifies a Nextflow configuration file. This file contains settings to configure the compute cluster to be used and the pipelines resources on it such as number of CPUs and amount of memory. It also now configures whether to use the ksrates Singularity or Docker container. The dataset directory already contains a template Nextflow configuration file called custom_nextflow.config that can be adapted to your resources. Other general template Nextflow configuration files can be found in the doc directory in the repository.

      If the Nextflow configuration file is simply named nextflow.config, the configuration file will be automatically recognized and used without having to specify it using the -c parameter.

      Please see the full documentation and the Nextflow documentation for more detail on Nextflow configuration, e.g. for different HPC schedulers.

Command-line interface

Installation

Install either Singularity (recommended, but see here) or Docker. This is needed to run the ksrates Singularity or Docker container which contain ksrates and all other required software dependencies, so nothing else needs to be installed. The ksrates Singularity or Docker container will be automatically downloaded simply when you execute a ksrates command on the publicly accessible container for the first time, and they will be stored and reused for any further command executions.

Usage

We briefly illustrate here how to run ksrates using the Singularity or Docker container.

  • ksrates comes with a command-line interface. Its basic syntax is:

    ksrates [OPTIONS] COMMAND [ARGS]...
    
  • To execute a ksrates command using the Singularity container the syntax is:

    singularity exec docker://vibpsb/ksrates ksrates [OPTIONS] COMMAND [ARGS]...
    
  • Or to execute a ksrates command using the Docker container the syntax is:

    docker run --rm -v $PWD:/temp -w /temp vibpsb/ksrates ksrates [OPTIONS] COMMAND [ARGS]...
    

Some example ksrates commands are:

Show usage and all available COMMANDs and OPTIONS:

ksrates -h

Generate a template configuration file for the focal species:

ksrates generate-config config_elaeis.txt

Show usage and ARGS for a specific COMMAND:

ksrates orthologs-ks -h

Run the ortholog KS analysis between two species using four threads/CPU cores:

ksrates orthologs-ks config_elaeis.txt elaeis oryza --n-threads 4

Please see the full documentation for more details and the complete set of commands.

Support

If you come across a bug or have any question or suggestion, please open an issue.

Citation

If you publish results generated using ksrates, please cite:

Sensalari, C., Maere, S., and Lohaus, R. (2021) ksrates: positioning whole-genome duplications relative to speciation events using rate-adjusted mixed paralog–ortholog KS distributions. bioRxiv 2021.02.28.433234 doi: 10.1101/2021.02.28.433234

This article is a preprint and has not been certified by peer review [what does this mean?].