/colorSV

somatic SV calling on matched tumor-normal co-assembly graphs

Primary LanguageC++

colorSV (coassembly long-range SV caller) identifies long-range somatic structural variation by examining the local topology of matched tumor-normal co-assembly graphs. The input is a joint assembly graph in .gfa format (generated by an assembler such as hifiasm), and the output is a set of SV calls.

More information about the method can be found in the preprint "Long-range somatic structural variation calling from matched tumor-normal co-assembly graphs" (link).

Table of Contents

Software Requirements

Installation

colorSV can be installed by downloading the required executables and scripts from the downloads page

wget https://github.com/mktle/colorSV/releases/download/v0.1.0/colorSV-0.1.0.tar.gz
tar xvzf colorSV-0.1.0.tar.gz

or by building from source (in which case minimap2 and k8 must already be installed)

git clone git@github.com:mktle/colorSV.git
cd colorSV && make

Example Usage

A set of example data is available on the downloads page. The reads are a subset of the publicly available COLO829/COLO829BL PacBio Revio datasets, where the tumor datasets have the IDs m84039_230312_025934_s1 and m84039_230328_000836_s3. The CHM13 reference can be downloaded here.

Since the co-assembly step can take some time, we have also uploaded a sample co-assembly graph generated from the example read data. The following sections contain instructions for running colorSV on both the raw reads themselves and the graph after co-assembling the raw reads.

From Example Reads

# download demo FASTA data
wget https://github.com/mktle/colorSV/releases/download/v0.1.0/demo_fasta.tar.gz
tar xvzf demo_fasta.tar.gz

# Perform co-assembly
hifiasm -o demo.asm ./demo_fasta/tumor.fa ./demo_fasta/normal.fa 2> demo.log

# Identify and align tumor-only unitigs
colorSV preprocess -o ./demo_results --graph demo.asm.bp.r_utg.gfa --reference chm13v2.fa --tumor-ids m84039_230312_025934_s1 m84039_230328_000836_s3 --read-sep /

# Perform SV calling
colorSV call -o ./demo_results --graph demo.asm.bp.r_utg.gfa --filter ./demo_fasta/chm13v2.cen-mask.bed

# View example translocation call
less -S ./demo_results/translocations_region_filtered.sv

From An Example Graph

# download demo co-assembly graph
wget https://github.com/mktle/colorSV/releases/download/v0.1.0/demo_graph.tar.gz
tar xvzf demo_graph.tar.gz

# Identify and align tumor-only unitigs
colorSV preprocess -o ./demo_results --graph ./demo_graph/demo.asm.bp.r_utg.gfa --reference chm13v2.fa --tumor-ids m84039_230312_025934_s1 m84039_230328_000836_s3 --read-sep /

# Perform SV calling
colorSV call -o ./demo_results --graph ./demo_graph/demo.asm.bp.r_utg.gfa --filter ./demo_graph/chm13v2.cen-mask.bed

# View example translocation call
less -S ./demo_results/translocations_region_filtered.sv

Pipeline

The colorSV method performs SV calling in three steps: 1) generation of a co-assembly graph using an external assembler, 2) identification and alignment of tumor-only graph nodes, and 3) filtering of candidate breakpoints by performing a local topology search on the co-assembly graph. The second and third steps are separated so that the SV calling step can be performed on the same graph multiple times (e.g., using different parameters) without having to rerun tumor-only node identification and sequence alignment.

1) Joint Assembly

colorSV requires a matched tumor-normal co-assembly graph as input. This graph should be generated by pooling reads from matched tumor and normal samples, then combining them into a single assembly. For instance, assembling the reads with hifiasm will generate multiple co-assembly graphs in the .gfa format. We would recommend using the raw unitig graph (with the .asm.bp.r_utg.gfa suffix), as using the processed unitig or contig graphs (with the p_utg or p_ctg suffixes) may reduce sensitivity due to the removal of small bubbles.

2) Preprocessing

The next step of the pipeline identifies and aligns tumor-only nodes of the co-assembly graph (nodes that are only supported by reads from the tumor sample) to the reference genome. The command can be run with

./colorSV preprocess -o /path/to/output/directory/ --graph coassembly_graph.gfa --reference ref.fa --tumor-ids id1,id2 --read-sep <sep> [optional flags]

This command will create the directory /path/to/output/directory/intermediate_output/, which will be used to store the IDs and alignments of the tumor-only nodes. More information about the options:

  • Required arguments
    • --graph: path to a joint assembly graph in .gfa format (see the section on joint assembly)
    • --reference: path to the reference genome in .fa format (we would recommend using T2T-CHM13)
    • --tumor-ids: a list of the tumor sample IDs (--tumor-ids ID1 ID2)
      • For example, the tumor reads might come from a sample with the identifier m63080_182826_000329_s2, and a given read might have an ID that looks something like m63080_182826_000329_s2/43058412/ccs. The ID to specify for this example would be m63080_182826_000329_s2.
      • If pooling tumor reads from multiple runs, they can be specified in a space or comma-delimited list.
    • --read-sep: the separator used within read IDs, with the assumption that the text before the first separator will match either a tumor sample ID or a normal sample ID
      • For example, suppose you have reads IDs of the format m63080_182826_000329_s2/43058412/ccs. You would specify / as the --read-sep, and m63080_182826_000329_s2 should either be specified in --tumor-ids or be the identifier of a normal sample
  • Optional arguments
    • --min-reads: minimum number of reads a tumor-only node must be supported by to be considered a candidate (default 2)
    • --min-mapq: minimum MAPQ a tumor-only node alignment must have to be considered a candidate (default 10)
    • -t: number of threads used by minimap2 during alignment (default 3)

3) SV Calling

The last step of the pipeline uses the alignments generated from the preprocessing step and the original co-assembly graph to call somatic breakpoints. The command can be run with

./colorSV call -o /output/directory/ --graph coassembly_graph.gfa --filter mask_regions.bed [optional flags]

where the directory specified in -o must contain an intermediate_outputdirectory generated from a preprocessing step. This command will save the call sets for translocations and all SVs in the output directory as translocations_region_filtered.sv and sv_calls_region_filtered.sv, respectively. The command will also save versions of the call sets prior to filtering with the --filter file as translocations.sv and sv_calls.sv.

More information about the options:

  • Required arguments
    • --graph: path to a joint assembly graph in .gfa format (should be the same as the one used in the preprocessing step)
    • --filter: a BED file containing regions to remove from the call set (an example with the CHM13 centromere regions is included in the examples directory)
  • Optional arguments
    • -k: maximum number of layers to search during the breadth-first search for local connectedness (default 10)
      • i.e., a graph will be considered locally disconnected if its neighbors have a distance of more than k
    • -q: minimum MAPQ of alignments when extracting breakpoints from tumor-only node alignments
    • -Q: minimum MAPQ for alignment ends when extracting breakpoints from tumor-only node alignments
    • --index-bin-size: number of nodes per chunk when indexing assembly graph file (default 100)
      • affects runtime but not final results

Limitations

  1. colorSV does not perform well for small intrachromosomal events. This is because our filtering relies on checking the whether the co-assembly graph is still locally connected after removing tumor-only nodes, but smaller somatic events would likely still have a connected co-assembly graph due to close genomic proximity. Our testing has therefore focused on translocations and intrachromosomal events on the scale of 1Mb.

  2. colorSV is also reliant on the quality of the input co-assembly graph. In complex regions where the assembler has difficulty resolving the final assembly graph (such as if there are multiple large genomic events happening in close proximity), the accuracy of the final call set will likely be lower.