/Novel-X

Novel insertion detection with 10X reads

Primary LanguagePythonMIT LicenseMIT

Novel-X: Novel sequence insertion detection using Linked-Reads

Novel-X detects and genotypes novel sequence insertions in 10X sequencing dataset using non-trivial read alignment signatures and barcode information.

Table of contents

  1. Installation
  2. Command Options
  3. Output Formats
  4. Example Commands
  5. Publications
  6. Contact & Support

Installation

To start working with Novel-X please clone this repository recursively:

git clone --recursive git@github.com:1dayac/Novel-X.git

If you clone repository non-recursively Novel-X will not work. To fix this run from Novel-X folder:

git submodule update --init --recursive

Novel-X is a pipeline based on a popular Snakemake workflow management system and consists of multiple steps and requires a lot of external software.

First, the following software should be installed (version numbers used for testing are shown in brackets, but other versions should also work):

  • Longranger (version 2.15) - Download Page
  • Velvet (commit 9adf09f) - GitHub Page - outdated but still useful assembler with minimal assumptions about the data. Note that we use kmer length of 63 during the assembly for 10X data, and Velvet should be compiled using
make ’MAXKMERLENGTH=63’

command. For more information, refer to the Velvet manual.

All this programs (except LongRanger) can be installed with conda package. We provide conda-env.yml file that allows to install them using the following command:

conda env create -f conda-env.yml

Path to executables (if executables are not in $PATH) should be provided in path_to_executables_config.json file.

Python dependencies are listed in requirements.txt file. They can be downloaded and installed with following command:

pip install -r requirements.txt

Inside bxtools folder run following commands (estimated execution time is around 2 minutes):

./configure
make
make install

We tested our tool using CentOS Linux 7 OS, but we suppose that it should work at any modern Unix-like system.

Then you are ready to go.

Command Options

Novel-X can be run with novel-x.py script with two modes:

  • run - run pipeline from the scratch
  • restart - if previous pipeline was not finished for some reason you can try to catch up with novel-x.py restart command.

A typical command to start Novel-X is

python novel-x.py run --bam my_bam.bam --genome my_genome.fasta --outdir my_dir

Optional arguments are:

  • --lr20 - needed if you run pipeline on a bam file obtained by LongRanger2.0 pipeline
  • --nt - optional filtering of non-human sequences from the orphan contigs

We added two option groups to handle different data and its properties (molecule length, intra-molecule coverage, etc.).

Data option group:

  • --10x - for 10X Genomics data [Default]
  • --tellseq - for Tell-Seq data
  • --stlfr - for stLFR data

Tell-Seq and stLFR data should be converted to LongRanger-compatible bam. For stLFR data, use this pipeline. For Tell-Seq data refer to Tell-Seq paper.

Coverage group:

  • --high-coverage - best for 60X coverage and higher [Default]
  • --low-coverage - best for 20X-40X coverage

You can invoke help message by typing:

python novel-x.py run --help

or

python novel-x.py restart --help

Output Formats

Novel-X write results into vcf-file. If your bam-file was named HM2KYBBXX_NA18509.bam, the resulting vcf-file will be named HM2KYBBXX_NA18509.vcf and will be stored inside the outdir folder.

Example Commands

Run from the start:

python ~/Novel-X/novel-x.py run --bam /athena/ihlab/scratch/dmm2017/70_samples_data/HLF3WBBXX_NA12006_longranger.bam -t 8 -m 200 --nt /athena/ihlab/scratch/dmm2017/blast_database/  --genome /athena/ihlab/scratch/dmm2017/hg38/hg38.fa --outdir /athena/ihlab/scratch/dmm2017/70_samples/novelx_NA12006

Restart from the last stage:

python ~/Novel-X/novel-x.py restart --outdir novelx_NA12006

There is a problem on filter_target_contig stage at the moment. It can exit with non-zero exit code. We recommend to comment out the next line before using restart option:

parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/*

Demo command

We placed a toy dataset in demo folder to test that software is installed correctly. You can run command:

python ~/Novel-X/novel-x.py run --bam ~/Novel-X/demo/demo.bam -t 1 -m 20  --genome ~/Novel-X/demo/demo.fasta --outdir out

This command takes about 15 minutes to finish on our hardware. It produces a vcf-file with a single vcf record.

chr1_25500000_25535000  29503   .       T       TGTATTGTGTGTATGAGGGTTGTGTGCTGTGTGTTGTGTATATATTGTATGTGTTATGTGTATGTATGTCGTATGAGTGTATTCTGTATATGTGTTTTGTGTGGTCTATTATGTATGTGGCATGTGTTGTGTATGTGTGTTGTGTGTGATGTGTTGTATGTGTGTTGTGCATATATGTTGTTTCTGTGTATGTATGTTATGTGTATGTGTATGTTGTGTTGTATGTATGGGTTGTGCCTATGTGCTGTGTTGTGTGCTGCATGCATGTTTGTGTGGTGTGTGTATTTAGGTTGTGTGCTATTTATGTGTCTATATTGTATGTGTTGTATGTGTGTTGTATGTATGTGTAGTGTATGTGTGTTGTGTGTGATGTGTATATGTGGTGTGTGTATGTCTGTTATGTGTATGTATGAGTGTATGTGTGTTGTGTGTGTTGTGTATATGTGTTGTGTGTGTTGTGTATGTGTGTTGTGAGTTGTGTATATGTGGTGTGAGTTGTGTTGTGTCATGTATGTGTGCATTGTGTATAGGTGTTGCATGTGTGTTGTGTTGTGTGTATGTGTTGTGTTGTGTATATGTGGTATGTGAATGTGTATGTTGTATGTTGTGTTGTATGTATATGTGTTATGTATATGTGATGTGTGTGTTGTGTATATGCTGGGTGTGTGTGTACATGTGTGTATGTGTGTTGTATGTATGTGTGTATGCATGTGTGTTGCGTATATGTGGTATGTGTGCATGTGTGTTGTCATGTGTATGTGTGTTGTGTATATGTGTGTGTTGTGTATATGTGTTGTGTGTATGTGTATCATGTTGTGTGTATGTGTTATGTTGTGTATATGTGGTGTGTGAATGTGTGTTGTGTGTATGTGTATGTTGTCTGTTTTGTGTGTGTATACGTGGTGTGTGTGTGTTGTGTTGTGTATATGTGTTGTGTGTGTTGCGTGTATGTGTTGTGTGTT      .       PASS    DP=100  NODE_1_length_4180_cov_43.887399        2776    276     347     1306

Output may slightly differ based on your software versions.

Publications

"Novel sequence insertion detection using Linked-Reads" preprint is available at https://www.biorxiv.org/content/10.1101/551028v1.

Contact & Support

Feel free to drop any inquiry to meleshko.dmitrii@gmail.com