/stRainy

Graph-based assembly phasing

Primary LanguagePythonOtherNOASSERTION

CC BY-NC-SA 4.0

stRainy

stRainy is a graph-based phasing algorithm, that takes a de novo assembly graph (in gfa format) and simplifies it by combining phasing information and graph structure.

Screenshot 2023-01-30 at 16 47 16

Conda Installation

The recommended way of installing is through conda:

git clone https://github.com/katerinakazantseva/stRainy
cd stRainy
git submodule update --init
make -C submodules/Flye
conda env create -f environment.yml -n strainy

Note that if you use an M1 conda installation, you should run conda config --add subdirs osx-64 before installation. Find details here

Once installed, you will need to activate the conda environment prior running:

conda activate strainy
./strainy.py -h

Quick usage example

After successful installation, you should be able to run:

conda activate strainy
./strainy.py -g test_set/toy.gfa -q test_set/toy.fastq.gz -o out_strainy -m hifi 

Limitations

stRainy is under active development! The current version is optimized for a relatively simple bacterial communities (one or a few bacterial species, 2-5 strains each). Extending stRainy to larger metagenomes is a work in progress.

Input requirements

stRainy supports PacBio HiFi and Nanopore (Guppy5+) sequencing.

The two main inputs to stRainy are:

  1. GFA file (can be produced with metaFlye or minigraph) and
  2. FASTQ file (containing reads to be aligned to the fasta reference generated from the GFA file).

Improving de novo metagenomic assmbelies

We have tested stRainy using metaFlye metagenoimic assembly graphs as input. The recommended set of parameters is --meta --keep-haplotypes --no-alt-contigs -i 0.

Note that -i 0 disables metaFlye's polishing procedure, which we found to improve read assignemnt to bubble branches during minimap2 realignment. --keep-haplotypes retains structural variations between strains on the assmebly graph. --no-alt-contigs disables the output of "alternative" contigs, that can later confuse the read aligner.

Usage and outputs

stRainy has 2 stages: phase and transform. With the command below, stRainy will phase and transform by default. Please see Parameter Description section for the full list of available arguments:

./strainy.py -g [gfa_file] -q [fastq_file] -m [mode] -o [output_dir]

1. phase stage performs read clustering according to SNP positions using community detection approach and produces csv files with read names, corresponding cluster names and a BAM file. The BAM file visualises the clustering of the reads.

Screenshot 2023-01-30 at 17 01 47


2. transform stage transforms and simplifies the initial assembly graph, producing the final gfa file: strainy_final.gfa

Screenshot 2023-01-30 at 16 45 20

Parameter desciption

usage: strainy.py [-h] -o OUTPUT -g GFA -m {hifi,nano} -q FASTQ [-stage {phase,transform,e2e}] [-s SNP] [-t THREADS] [-f FASTA] [-b BAM] [--unitig-split-length UNITIG_SPLIT_LENGTH]

options:
  -h, --help            show this help message and exit
  -stage {phase,transform,e2e}
                        stage to run: either phase, transform or e2e (phase + transform) (default: e2e)
  -s SNP, --snp SNP     vcf file (default: None)
  -t THREADS, --threads THREADS
                        number of threads to use (default: 4)
  -f FASTA, --fasta FASTA
                        fasta file (default: None)
  -b BAM, --bam BAM     bam file (default: None)
  --unitig-split-length UNITIG_SPLIT_LENGTH
                        The length (in kb) which the unitigs that are longer will be split, set 0 to disable (default: 50)

Required named arguments:
  -o OUTPUT, --output OUTPUT
                        directory that will contain the output files (default: None)
  -g GFA, --gfa GFA     gfa file (default: None)
  -m {hifi,nano}, --mode {hifi,nano}
                        type of reads (default: None)
  -q FASTQ, --fastq FASTQ
                        fastq file containing reads to perform alignment, used to create a .bam file (default: None)

Acknowledgements

Consensus function of stRainy is Flye

Community detection algorithm is Karate club

Credits

stRainy was originally developed at at Kolmogorov lab at NCI

Code contributors:

  • Ekaterina Kazantseva
  • Ataberk Donmez
  • Mikhail Kolmogorov

Citation

Ekaterina Kazantseva, Ataberk Donmez, Mihai Pop, Mikhail Kolmogorov. "stRainy: assembly-based metagenomic strain phasing using long reads" bioRxiv 2023, https://doi.org/10.1101/2023.01.31.526521

License

Shield: CC BY-NC-SA 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0