SV-HotSpot: detection of hotspots targeted by recurrent structural variants associated with gene expression

tool logo

SV-HotSpot is a tool for detection and visualization of genome hotspots targeted by recurrent structural variants (SVs) associated with gene expression. To so so, SV-HotSpot seamlessly integrates SV calls, gene expression, genome annotations (inluding genes and other functional elements such as regulatory elements) to identify and annotate recurrent SVs, and assess their potential consequences on the expression of nearby genes.

SV-HotSpot is developed at Christopher Maher Lab at Washington University in St. Louis.

SV-HotSpot Manual

Installation

Clone the SV-Hotspot repository (with PATH_TO_SV_HOTSPOT is the local directory where you want to install SV-HotSpot):

cd PATH_TO_SV_HOTSPOT
git clone https://github.com/ChrisMaherLab/SV-Hotspot.git

Install required R packages by running the following command inside an R session:

install.packages(peakPick)
install.packages(ggplot2)
install.packages(plyr)
install.packages(reshape2)
install.packages(grid)
install.packages(gridExtra)
install.packages(gtable)
install.packages(data.table)
install.packages(RCircos)
install.packages(stringr)
install.packages(ggsignif)

SV-HotSpot uses bedTools (https://github.com/arq5x/bedtools2) for various overlapping and counting procedures, which should also be installed and made available as command lines.

Preparing Input

The tool requires as an input the following:

  1. Genome assembly name (e.g. hg38, hg19, mm9, mm10, etc.) which is used to extract chromosomes sizes file (a tab-delimited file with two columns, chrom and size). Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). SV-HotSpot built-in genomes are: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
  1. Structural variants file in BEDPE format (Example of how to prepare this file is available in test_data folder). In case only VCF files are available, please try to convert them to BEDPE format using existing tools such vcftobedpe form svtools. Instructions on how to install it and use it can be found here.

Optional files:

  1. Expression data in a matrix format where the first column represents the feature (e.g. gene) and columns represents samples. See the example provided at test_data folder. Please note that features in this file have to be unique.

  2. Copy number segments in BED format. See the example at test_data on how to prepare this file.

  3. An annotation file in BED format. If the user didn't provide this file, a built-in annotation file based on the genome name will be used. Please note that the feature in the expression file has to match the one in this file.

  4. Region of interest file(s) (e.g. enhancers, transcription binding sites, etc) in BED format. An example of this file is given in test_data folder. Multiple files can be provided but have to be separated by comma (make sure no space between the file names).

  5. ChIP-Seq coverage in BED format. See the example at test_data on how to prepare this file.

All other parameters are optional and a default value was assigned to each (run sv-hotspot.pl --help for more details).


IMPORTANT NOTES:

  • To prepare all your files, please look at the examples provided in test_data folder and do accordingly.
  • Structural variant types must be in the format of THREE letters and should only include the following types: BND, DEL, DUP, INS, INV.
  • All files headers should be the same as the ones in the examples files.
  • The chromosome names in all files should be consistent and in the format of chr#.
  • The feature name in the annotation file and the expression file should be the same.
  • The "name" column in the SVs file should be in the format of sample/type (e.g. Sample1/INV)

Running SV-HotSpot

Assume PATH_TO_SV_HOTSPOT is the local directory where SV-HotSpot was installed, the following command runs SV-HotSpot on the mCRPC test data provided by SV-HotSpot. The test data are specifically for identifying SV hotspots affecting androgen receptor (AR) gene. To read more about this study, please refer to this Cell paper.

perl PATH_TO_SV_HOTSPOT/src/sv-hotspot.pl -o OUTPUT \
	-g hg38 -C chrX \
	--sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \
	-e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
	-c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
	-r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
	--chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
	--chip-cov-lbl H3K27ac \
	-w 100000 -s 30000 -d 50000 \
	--t-amp 2.99 --t-del 1.35 \
	--stat-test wilcox.test --pval 0.05 \
	--plot-top-peaks 10 --left-ext 100000 --right-ext 100000

More SV-HotSpot command line paramters:

USAGE:
      sv-hotspot.pl [OPTIONS] -g/--genome <genomeName> --sv <structuralVariants>

      NOTE:
	(1) Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). 
            - Built-in Genomes: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
            - Please refer to the documentation in case your genome is not listed above. 
	(2) Structutal variants file should be in "BEDPE" format.
	(3) Gene expression data and copy number segments are required to visualize hotspot regions.
	(4) Region of interest file(s) (e.g. promoters, enhancers, chip-seq, etc.) should be in "BED" format

OPTIONS:
	-w/--sliding-win-size		sliding window size	<int>		[ sliding window size. default: 100kb ]
	-s/--sliding-win-step		sliding window step 	<int>		[ sliding window step. default: 1kb ]
	-a/--annot			annotation file 	<filename>	[ an annotation file in "BED" format ]
	-e/--exp			expression file 	<filename>	[ expression file in a "matrix" format ]
	-c/--cn				Copy number file 	<filename>	[ copy number segments in BED format ]
	-W/--peakPick-window-size	peak calling window 	<int>		[ peakPick window size. default: 100bp ]
	-m/--peakPick-min-sd		peak calling min. SD	<int>		[ peakPick standard deviation cutoff. default: 5 ]
	-t/--pct-samples		percentage of samples	<int>		[ percentage of samples cutoff to call peaks. default: 10 ]
	-o/--output			output directory	<string>	[ results output directory. default: ./ ]
	-p/--pval			pvalue cuttoff		<float>		[ pvalue threshold used for significance. default: 0.05 ]
	-G/--genes-of-int		list of genes		<filename>	[ list of genes of interest to be used for visualization ]
	-r/--region-of-int		region(s) of interest	<filename>	[ region of interest file(s) in "BED" format separated by comma ]
	-C/--chrom			chromosome name 	<string>	[ chromosome name used to detect hotspots. default: ALL ]
	-S/--sv-type			structural variant type	<string>	[ SV type used to detect hotspots. default: ALL ]
	-d/--merge-dist-size		distance size		<int>		[ distance cutoff used to merge adjacent peaks. default: 10kb ]
	--merge-pct-samples		percentage of samples	<int>		[ percentage of samples cutoff to merge similar peaks. default: 5 ]
	--stop-merge-num-peaks		number of peaks		<int>		[ number of peaks cutoff to stop merging adjacent peaks. default: 0 ]
	-k/--num-nearby-genes		Number nearby genes	<int>		[ number of up/downstream genes to the peak. default: 1 ]
	--t-amp				amplification threshold	<float/int>	[ amplification cutoff. default: 2.8 ]
	--t-del				deletion threshold	<float/int>	[ deletion cutoff. default: 0.5 ]
	--stat-test			statistical test	<string>	[ statistical test used for comparison (wilcox.test or t.test). default: wilcox.test ]
	--chip-cov			chip-seq coverage	<filename>	[ chip-seq coverage file in "BED" foramt ]
	--chip-cov-lbl			chip-seq coverage label	<string>	[ label used for chip-seq coverage ]
	--plot-top-peaks		plot top # peaks	<int>		[ number of top peaks to plot. default: top 10 ]
	--left-ext			size of left extension	<int>		[ size of the left extension of the peak. default: 0bp ]
	--right-ext			size of right extension	<int>		[ size of the right extension of the peak. default: 0bp ]
	--genes-to-plot			genes names		<string>	[ names of genes to show on the plot. default: none.
										  If no genes were provided, all genes in the peak will be plotted ]
	--plot-layout			plot orientation	<string>	[ orientation of the peak plot (wide or narrow). default: narrow ]

Results will be written to a subdirectory named sv-hotspot-output inside OUTPUT directory specified in the command.

Output

There are two main output files:

  1. annotated_peaks_summary.tsv: this file has all information about identified peaks.

  2. genes.associated.with.SVs.tsv: this file contains statisitcal information for all genes affected by SVs.

  3. In addition, SV-HotSpot provides various visualizations composed of overlaying tracks representing copy number aggregation, SV aggregation, and gene/regulatory/region of interest annotation tracks.

Additional files:

  1. Peaks files for each chromosome and their corresponding figures. These files are located in the peaks folder.
  2. UCSC custom track files. These files are located in peaks/ucsc_custom_track_files. These files can be viewed on the UCSC Genome Browser.

Plot Peaks (hotspot sites)

In some cases when the number of detected peaks is high, it is impractical to plot all peaks as this process takes long time. Thus, we set the tool to plot only the top # of peaks (default is 10). In case you need to increase/decrease this number, you need to provide this parameter --plot-top-peaks=# with the number of peaks required. Set this parameter to 0 in case you do not want to plot any peaks.

To plot peak(s), we provided a script for this process. You just need to provide the peak name(s), SV file, the results directory, the expression, and copy number data with the remaining parameters shown above. Peak names must be separated by comma. To show the usage page of this command type the following command:

plot-peak.pl

USAGE:
      plot-peak.pl [OPTIONS] -p <peakName1,peakName2,...> --sv <structuralVariants> --res-dir <resultsDirectory> -e/--expr <expression> -c/--cn <copynumber>

      NOTE:
	(1) Results directory should be the same as the output directory used with sv-hotspot.pl

OPTIONS:
	-a/--annot	Annotation file 	<filename>	[ an annotation file in BED format ]
	-o/--output	output directory	<string>	[ default: ./ ]
	--t-amp		amplification threshold	<float/int>	[ threshold for copy number amplifications. default: 2.8 ]
	--t-del		deletion threshold	<float/int>	[ threshold for copy number deletions. default: 0.5 ]
	--chip-cov	chip-seq coverage	<filename>	[ If ChIP-Seq coverage file is provided, peaks will be overlapped with this file ]
	--chip-cov-lbl	chip-seq coverage label	<string>	[ the chip-seq coverage label used in the plot (e.g. histone name) ]
	--left-ext	size of left extension	<int>		[ number of extended bases from the left side of the peak. default: 0 ]
	--right-ext	size of right extension	<int>		[ number of extended bases from the right side of the peak. default: 0 ]

As an example, the following command plots peak "pX.55.1" from the above test.

PATH_TO_SV_HOTSPOT/src/plot-peak.pl -p pX.55.1 \
             --sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \ 
             --res-dir /RESULTS/PATH \
             -e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
             -c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
             --chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
             -r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
             -o /SOME/PATH --t-amp 2.99 --t-del 1.35 --chip-cov-lbl H3K27ac \
             --left-ext 100000 --rigth-ext 100000

Example of SV-HotSpot visualization

(1) Recurrent SVs targeting a non-coding region located upstream of AR gene:

visualization example

(2) Circos plot of recurrent SV hotspots identified by SV-HotSpot across the genome:

visualization example

SV-HotSpot Docker Instructions

To use SV-HotSpot, a docker image has been created and tested on Linux and Mac. To run SV-HotSpot, you need to have Docker installed on your machine.

Docker Installation

To obtain the latest docker image, run the following in your command line:

docker pull chrismaherlab/sv-hotspot

To test the image, run the following command which shows the usage of this tool:

docker run chrismaherlab/sv-hotspot sv-hotspot

How to run SV-HotSpot using Docker?

We have provided an example data available in "test_data" folder specifically for identifying SV hotspots affecting androgen receptor (AR) gene. To read more about this study, please refer to this Cell paper.

To run SV-HotSpot on this test data, use the following command:

docker run -v /local/folder:/data chrismaherlab/sv-hotspot sv-hotspot -g hg38 -C chrX \
              --sv /data/test_data/sv.bedpe -e /data/test_data/exp.tsv \
              -c /data/test_data/cna.tsv --chip-cov /data/test_data/H3K27ac.bg \
              -r /data/test_data/enhancers.bed -o /data/SV-HOTSPOT-TEST \
              -w 100000 -s 30000 -d 50000 --merge-pct-samples 5 \
              -t-amp 2.99 --t-del 1.35 --stat-test wilcox.test --pval 0.05 \
              --chip-cov-lbl H3K27ac --left-ext 100000 --right-ext 100000 \
              --plot-top-peaks 1 --genes-to-plot AR
  • Note, the -v flags map your local filesystem locations to a “location” within the Docker image. Therefore, you need to change /local/folderto your local folder on your machine. This folder must contain the "test_data" folder. The final output will be sent to this folder as well. Use "$PWD" in case you want to use the current directroy.
  • Please note that you need to provide the absolute path for /local/folder.
  • /data is a folder on the image container used to receive the input data that was mapped by -v flag. The SV-HotSpot pipeline is configured with this folder and it has to be provided the same way as in the command above.

Post-processing and filtering

Once hotspots of SVs are identified, users should perform post-processing and filtering. SV-HotSpot provides a script to perform post-filtering of the results based on multiple criteria including filtering of genes with low expression and/or weak expression association (p-value, logFC, mean group expression), removal of wide peaks not associated with known cancer genes (eg. using COSMIC census genes), removal of peaks with low number of samples harboring hotspot SVs. To performed filtering, run the filter.r command as follows:

Usage:  Rscript filter.r
            <SV-HotSpot_result_dir>
            <max.p-value-to-infer-expression-association>
            <min.logfc>
            <min.group.expr>
            <max.number.of.associated.gene.per.peak, peak w/ more genes associated are ignored>
            <max.peak.length, peak wider than this w/o assoc. w/ known genes are ignored>
            <min.peak.percent.samples, 0-100, peaks with percent of samples smaller than this are ignored>
            <tsv_file_of_cosmic_census_genes>
            <csv_other_known_genes_to_keep>

        Example: 
           Rscript filter.r /path/to/sv-hotspot-output 0.05 0.22 10 9 500000 15 \
                            data/cosmic_census_genes.tsv AR,ERG,PTEN,TP53

Output will be written to SV-HotSpot result directory which includes the following two files:

1) annotated_peaks_summary.filtered.tsv
2) genes.associated.with.SVs.filtered.tsv

Visualization of hotpots (peaks)

The top hotspots are automatically visualized when running SV-Hotspot detection (parameter --plot-top-peaks=# when running SV-HotSpot detection). Additionally, users can plot hotspots identified by SV-HotSpot independently of SV-Hotspot detection. To plot hotspot sites (or peaks), run the following command:

To show the usage page of this script, run the following command:

docker run chrismaherlab/sv-hotspot plot-peak

To plot peaks, use the following command which plots pX.172 and pX.173 peaks (peak name(s) taken from "annotated_peaks_summary.tsv" output file). You will also need to provide the SV calls file used to run SV-Hotspot, path to SV-HotSpot results directory, expression and copy number data, and the remaining parameters shown in SV-HotSpot detection command above. Peak names must be separated by comma and no space between them.

docker run -v "$PWD":/data chrismaherlab/sv-hotspot plot-peak \
           -p pX.55.1 --genes-to-plot AR \
           --res-dir /data/SV-HOTSPOT-TEST/sv-hotspot-output \
           -o /data/SV-HOTSPOT-TEST/sv-hotspot-output \
           --sv /data/test_data/sv.bedpe -e /data/test_data/exp.tsv \
           -c /data/test_data/cna.tsv --chip-cov /data/test_data/H3K27ac.bg \
           -r /data/test_data/enhancers.bed --t-amp 2.99 --t-del 1.35 \
           --chip-cov-lbl H3K27ac --left-ext 100000 --right-ext 100000
  • res-dir must refer to the folder containing the results generated from running sv-hospot command with -o option. Please note that you need to include "sv-hotspot-output" at the end of res-dir path since SV-HotSpot always creates this folder which is used to write all output results.

Visualization of whole genome and individual chromsomes

SV-HotSpot also provides whole genome level visualization of hotspot sites. The script plot_whole_genome.r can be used to generate the histogram aggregation of the counts of samples harboring structural variants at whole genome level (circos plot) or individual chromosome level. This script can be run as follows.

   Usage: Rscript plot_whole_genome.r
             <SV-Hotspot_result_dir>
             <annotation_dir>
             <genome_assembly_version, eg. hg38>
             <plot_circos, eg. TRUE, FALSE>
             <chromosomes_to_plot, eg. "ALL", "chr1,chrX"> 
             <genes_to_show, eg. "ERG,PTEN,ETV1">
             <color_genes_by_association_direction_with_SV, eg. "TRUE", "FALSE">
             <output_figure_format, eg. "png", "pdf">
             <output_figure_dir>

   Example:
           Rscript plot_whole_genome.r /path/to/sv-hotspot-output /path/to/annotations \
                                        hg38 TRUE chr1,chrX,chr21 ERG,AR,PTEN TRUE png out

References

  1. Quigley, D. A., Dang, H. X., Zhao, S. G., Lloyd, P., Aggarwal, R., Alumkal, J. J., ... & Playdle, D. (2018). Genomic hallmarks and structural variation in metastatic prostate cancer. Cell, 174(3), 758-769.

How to cite SV-HotSpot?

Eteleeb, A.M., Quigley, D.A., Zhao, S.G. et al. SV-HotSpot: detection and visualization of hotspots targeted by structural variants associated with gene expression. Sci Rep 10, 15890 (2020). https://doi.org/10.1038/s41598-020-71168-7

Contact

Abdallah Eteleeb: eteleeb@wustl.edu