SPLASH is an unsupervised and reference-free unifying framework to discover regulated sequence variation through statistical analysis of k-mer composition in both DNA and RNA sequence. SPLASH leverages our observation that detecting sample-regulated sequence variation, such as alternative splicing, RNA editing, gene fusions, V(D)J, transposable element mobilization, allele-specific splicing, genetic variation in a population, and many other regulated events can be unified–in theory and in practice. This is achieved with a simple model, SPLASH, that analyzes k-mer composition of raw sequencing reads (Chaung et al. 2022). SPLASH finds constant sequences (anchors) that are followed by a set of sequences (targets) with sample-specific target variation and provides valid p-values. SPLASH is reference-free, sidestepping the computational challenges associated with alignment and making it significantly faster and more efficient than alignment, and enabling discovery and statistical precision not currently available, even from pseudo-alignment.
The first version of SPLASH pipeline proved its usefulness. It was implemented mainly in Python with the use of NextFlow. Here we provide a new and improved implementation based in C++ and Python (Kokot et al. 2023). This new version is much more efficient and allows for the analysis of datasets >1TB size in hours on a workstation or even a laptop.
A key concept of SPLASH is the analysis of composition of pairs of substrings anchor–target across many samples. The substrings can be adjacent in reads or can be separated by a gap.
The image below presents the SPLASH pipeline on a high-level.
The easiest way to get SPLASH is to use precompiled release. To get version 2.3.0 and run the example, is is sufficient to do the following:
curl -L https://github.com/refresh-bio/SPLASH/releases/download/v2.3.0/splash-2.3.0.linux.x64.tar.gz | tar xz
cd example
./run-example.sh
It is possible to run splash using the docker container available through GitHub packages (https://github.com/orgs/refresh-bio/packages/container/package/splash).
To pull the image run:
sudo docker pull ghcr.io/refresh-bio/splash:2.3.0 # replace version number if needed
Example of how to run splash with docker. Prerequisites:
- There is
input.txt
in the current directory - All input splash files defined in the
input.txt
are in the same or child directory
sudo docker run -v `pwd`:/home/ubuntu ghcr.io/refresh-bio/splash:2.3.0 splash input.txt
Sometimes, sudo
is unavailable (for example, on HPC).
In such a case docker container may be transformed into a singularity container (https://docs.sylabs.io/guides/latest/user-guide/).
To pull the singularity version of splash use:
singularity pull docker://ghcr.io/refresh-bio/splash:2.3.0 # replace version as needed
This will result in a splash_2.3.0.sif
file created in the current directory.
To execute splash using this file run:
./splash_2.3.0.sif splash input.txt
It is also possible to run without pulling first:
singularity run docker://ghcr.io/refresh-bio/splash:2.3.0 splash input.txt
It may be necessary to configure bind for singularity (-B
parameter) depending on your configuration.
SPLASH is implemented as several applications written in the C++ programming language and a Python wrapper to run the whole pipeline. Currently, the software may be used only under Linux. A compiler supporting C++17 is needed to compile the code. You can use the following snippet to compile SPLASH.
git clone https://github.com/refresh-bio/splash
cd splash
make -j
After this splash may be run as follows:
bin/splash # this will print help
The simplest way of installing splash after compilation is to run:
sudo make install
If the PREFIX
environment variable is not defined the above will install splash in /usr/local/bin
which requires sudo
.
It is possible to override the install location so that sudo
is not required.
For example, to install in the user's home directory, one may run:
make install PREFIX=~/splash
or
export PREFIX=~/splash
make install
To uninstall splash one may run:
sudo make uninstall
The same PREFIX
should be used like for installation (depending on this sudo
may be not required).
To verify the installation on a small example, one may perform the following:
cd example
./download.py #download exemplary data
splash input.txt #run the pipeline with default parameters
The result consists of two TSV files, namely,
result.after_correction.all_anchors.tsv
result.after_correction.scores.tsv
The first file contains all unfiltered anchors found by the pipeline. The second file contains only anchors whose corrected p-value is below 0.05.
There are a lot of parameters allowing to customize the pipeline. They can be grouped into several categories.
The parameters will be displayed when running splash without parameters (or with --help
).
--input_file
—--anchor_list
— list of accepted anchors, this is path to plain text file with one anchor per line without any header (default accept all anchors)--anchor_len
— anchor length (default: 27)--gap_len
— gap length, if 'auto' it will be inferred from the data, in the opposite case it must be an int (default: 0)--target_len
— target length (default: 27)
--poly_ACGT_len
— filter out all anchors containing poly(ACGT) of length at least <poly_ACGT_len> (0 means no filtering) (default: 8)--artifacts
— path to artifacts, each anchor containing artifact will be filtered out (default: )--dont_filter_illumina_adapters
— if used anchors containing Illumina adapters will not be filtered out (default: False)--anchor_unique_targets_threshold
— filter out all anchors for which the number of unique targets is <= anchor_unique_targets_threshold (default: 1)--anchor_count_threshold
— filter out all anchors for which the total count <= anchor_count_threshold (default: 50)--anchor_samples_threshold
— filter out all anchors for which the number of unique samples is <= anchor_samples_threshold (default: 1)--anchor_sample_counts_threshold
— filter out anchor from sample if its count in this sample is <= anchor_sample_counts_threshold (default: 5)--n_most_freq_targets_for_stats
— use at most n_most_freq_targets_for_stats for each contignency table, 0 means use all (default: 0)--min_hamming_threshold
— keep only anchors with a pair of targets that differ by >= min_hamming_threshold (default: 0)
--pvals_correction_col_name
— for which column correction should be applied (default: pval_opt)--fdr_threshold
— keep anchors having corrected p-val below this value (default: 0.05)
--outname_prefix
— prefix of output file names (default: result)--dump_Cjs
— output Cjs (default: False)--max_pval_opt_for_Cjs
— dump only Cjs for anchors that have pval_opt <= max_pval_opt_for_Cjs (default: 0.1)--n_most_freq_targets
— number of most frequent tragets printed per each anchor in stats mode (default: 2)--with_effect_size_cts
— if set effect_size_cts will be computed (default: False)--with_pval_asymp_opt
— if set pval_asymp_opt will be computed (default: False)--dump_sample_anchor_target_count_txt
— if set contignency tables will be generated in text format (default: False)--dump_sample_anchor_target_count_binary
— if set contignency tables will be generated in binary (SATC) format, to convert to text format latersatc_dump
program may be used, it may take optionally mapping from id to sample_name (--sample_names param) (default: False)--dump_sample_anchor_target_count_txt
— if set contignency tables will be generated in text format (default: False)dump_sample_anchor_target_count_binary
— if set contignency tables will be generated in binary (satc) format, to convert to text format later satc_dump program may be used, it may take optionally mapping from id to sample_name (--sample_names param) (default: False)--dont_clean_up
— if set then intermediate files will not be removed (default: False)
--sample_name_to_id
— file name with mapping sample name <-> sammpe id (default: sample_name_to_id.mapping.txt)--bin_path
— path to a directory where satc, satc_dump, satc_merge, sig_anch, kmc, kmc_tools binaries are (if any not found there splash will check if installed and use installed) (default: ./)--tmp_dir
— path to a directory where temporary files will be stored (default: let splash decide)--logs_dir
— director where run logs of each thread will be stored (default: logs)
--n_threads_stage_1
— number of threads for the first stage, too large value is not recomended because of intensive disk access here, but may be profitable if there is a lot of small size samples in the input (default: 4)--n_threads_stage_1_internal
— number of threads per each stage 1 thread (default: 8)--n_threads_stage_2
— number of threads for the second stage, high value is recommended if possible, single thread will process single bin (default: 32)--n_bins
— the data will be split in a number of bins that will be merged later (default: 128)--kmc_use_RAM_only_mode
— if set may increase performance but also RAM-usage (default: False)--kmc_max_mem_GB
— maximal amount of memory (in GB) KMC will try to not extend (default: 12)
--opt_num_inits
— the number of altMaximize random initializations (default: 10)--opt_num_iters
— the maximum number of iterations in altMaximize (default: 50)--num_rand_cf
— the number of random c and f used for pval_base (default: 50)--num_splits
— the number of contingency table splits (default: 1)--opt_train_fraction
— in calc_stats mode use this fraction to create training data from contingency table (default: 0.25)--without_alt_max
— if set int alt max and related stats will not be computed (default: False)
run_10X
cbc_len
umi_len
soft_cbc_umi_len_limit
cbc_filtering_thr
cell_type_samplesheet
--supervised_test_samplesheet
--supervised_test_anchor_sample_fraction_cutoff
--supervised_test_num_anchors
There are following columns in the resulting tsv files
Column | Meaning | Notes |
---|---|---|
anchor | anchor | |
pval_opt | p-value from alternating maximization | number of iterations and other parameters can be configured with switches. Optimization formulation and statistical exposition in (Baharav et al. 2023). |
effect_size_bin | measure of anchor's effect size | bounded between [0,1], indicates how well the data is divided between 2 groups of columns. 0 indicates no different between groups, 1 indicates two groups of columns with disjoint row supports (Baharav et al. 2023). |
pval_asymp_opt | asymptotically valid p-value based on alternating maximization | compute optimizing c and f used for pval_opt, and evaluate approximate p-value given by asymptotic normality (Baharav et al. 2023). |
pval_base | p-value based on random partitionings | Compute base p-value (using num_rand_cf random c,f), Bonferroni correction to yield valid p-value. |
M | anchor count | number of anchor occurences in the input (= the sum of elements in contingency table) |
anch_uniqTargs | number of unique targets for anchor | (=number of rows in contingency table) |
number_nonzero_samples | number of samples containing anchor | (=number of columns in contingency table) |
target_entropy | measure of diversity of target distribution | entropy of empirical target distribution. Compute the empirical target distribution |
avg_hamming_distance_max_target | average hamming distance to most abundant target | for each observed target (count), compute the hamming distance between it and the most abundant target (highest counts), average this over all targets. Useful in identifying / filtering SNPs (this measure will be low). |
avg_hamming_distance_all_pairs | average hamming distance between all pairs of targets | For the M counts in the matrix, compute the hamming distance between all pairs of targets, output the average. |
avg_edit_distance_max_target | average Levenshtein distance to most abundant target | Same as avg_hamming_distance_max_target, but Levenshtein distance as opposed to hamming distance. If an anchor has high avg_hamming_distance_max_target but low avg_edit_distance_max_target, this is indicative that the discrepancy between the top targets is due to an insertion or deletion. If the two quantities are similar, then this discrepancy is most likely due to a SNP. |
avg_edit_distance_all_pairs | average Levenshtein distance between all pairs of targets | Same as avg_hamming_distance_all_pairs but with Levenshtein distance as opposed to hamming distance. |
pval_opt_corrected | Benjamini-Yekutieli corrected pval_opt | only present in *.scores.tsv file |
In the example, the input.txt
file was used. This file defines the set of input samples for the algorithm.
Its format is one sample per line. Each line should contain the name of a sample and (after space) path to the input sample.
Important note: if a relative path is specified it is relative to the current working directory, not the directory of input.txt
.
Given the broad applications of SPLASH, in analysis_notebooks folder, we provide the analysis downstream of SPLASH on how to interpret the results for a few major applications in which SPLASH has been applied so far. We should note that SPLASH is quite general and continues to be applied in new genomics problems.
The notebook (analysis_notebooks/SPLASH_splicing_analysis_notebook.Rmd
) provides detailed step-by-step instructions on how SPLASH results can be interpreted for an alternative splicing analysis. The reference genome in this example is human but it could be replaced with any other organism with any quality of transcriptome annotation.
By default SPLASH will store 2 most frequent targets per each anchor in the resulting TSV files. This should be sufficient for splicing, but for RNA editing/mismatches 4 may be a better choice. It may be set with --n_most_freq_targets
switch. If the number of targets for a given anchor is lower than specified value there will be a single -
for each missing target.
SPLASH stores intermediate and optional output files in SATC format (Sample Anchor Target Count).
The unique id is assigned to each sample. The ids are consecutive numbers starting with 0. The first sample from the input file gets id 0, the second one gets 1, and so on. By default SPLASH will store the mapping in sample_name_to_id.mapping.txt
file, but this may be redefined with --sample_name_to_id
parameter. This mapping file may be useful to access the data stored in SATC format.
When --dump_sample_anchor_target_count_txt
switch is used there will be an additional output directory (named result_dump
by default, but the results
part may be redefined with --outname_prefix
switch). This directory will contain a number of files (equal to the number of bins, default 128, may be redefined with --n_bins
switch). The extension of these files is .satc.dump
. Each line of these files is a tab-separated list of <sample_name> . This is the easiest way to be able to reproduce contingency tables used during computation. Each file contains some number of anchors, but it is assured that a specific anchor is present in a single file. Since these text files could be large, it may be proficient to use binary (SATC) files instead.
When --dump_sample_anchor_target_count_binary
switch is used there will be an additional output directory (named result_satc
by default, but the results
part may be redefined with --outname_prefix
switch). This directory will contain a number of files (equal to the number of bins, default 128, may be redefined with --n_bins
switch). The extension of these files is .satc
. These are binary files in SATC format. Their content may be converted to textual representation with satc_dump
program (part of the SPLASH package). Each file contains some number of anchors, but it is assured that a specific anchor is present in a single file. Since these text files could be large, it may be proficient to use binary (SATC) files instead, especially if one wants to investigate only some of all anchors.
To convert SATC files into textual representation one may use satc_dump
program. The simples usage is
satc_dump input.satc output.satc.dump
There are also additional parameters that may be useful, namely:
--anchor_list
— path to text file containing anchors separated by whitespaces, only anchors from this file will be dumped--sample_names
— path for decode sample id, each line should contain <sample_name> <sample_id>--n_bins <int>
— if set to value different than 0 the input is interpreted as a list of bins (each bin in separate line, first list is bin_0, second line is bin_1, etc. (in case of ill-formed input results will be incorrect)--separately
— if set with n_bins != 0 output param will be treated as suffix name and there will be output for each bin
If --sample_names
is not used in the output there will be sample ids instead of its names. SPLASH by default write mapping to sample_name_to_id.mapping.txt
file (may be redefined with --sample_name_to_id
switch of SPLASH.
plotGeneration.py
provides basic functionality to visualize contingency tables. To use, run SPLASH with the --dump_sample_anchor_target_count_binary flag. Then, plotGeneration.py
can be run with inputs:
--dsName
— dataset name to be used in title of plotsoutFolder
— path to save files tometadataPath
— path to .tsv file containing two columns titledsampleName
andmetadata
satcFolder
— path to folder containing .satc filespvDfPath
— path to p-value .tsv file (result.after_correction.scores.tsv)sampleMappingTxt
— path to sample_name_to_id.mapping.txt fileanchorFile
— file of anchors to be plotted, one anchor per linesatc_dump_file
— path to satc_dump utility file, within SPLASH /bin folderskipSATC
— optional flag, if .satc files have already been dumped for this set of anchors into the output folder, and the desire is to regenerate plots.
The generated output is 1 file per anchor, each file containing 4 subplots. At bottom left is the raw I x J contingency table, where each column is a sample, and each row represents a target (low abundance targets not shown). Each sample’s target counts are normalized by
The script c_analysis.ipynb
shows how the saved c vectors can be loaded in for further analysis. --dump_Cjs
must be enabled for this.
To facilitate downstream analysis of anchors, we provide a postprocessing script SPLASH_extendor_classification.R
, that can be run on the anchors file generated from the SPLASH run to classify anchors to biologically meaningful events such as alternative splicing, and base pair changes. SPLASH_extendor_classification.R
needs the following inputs:
directory
— the output directory used for the SPLASH runwhich_anchors_file
— flag to decide which anchor file (after correction or all anchors) to use, could be "after_correction" or "all"effect_size_cutoff
— the effect size cutoff for significant anchors (default 0.2)num_samples_cutoff
— the minimum number of samples for an anchor to be called (default 20)STAR_executable
— path to STAR executable filesamtools_executable
— path to samtools executable filebedtools_executable
— path to bedtools executable filebowtie2_executable
— path to bowtie2 executable fileSTAR_reference
— path to STAR index files for the reference genomebowtie2_reference
— path to bowtie2 index for the reference genomebowtie2_univec_index
— path to bowtie2 index for univecannotated_splice_juncs
— path to the file containing annotated splice junctions from the reference transcriptome (can be either downloaded or generated fromSPLASH_build.R
)annotated_exon_boundaries
— path to the file containing annotated exon boundaries from the reference transcriptome (can be either downloaded or generated fromSPLASH_build.R
)gene_coords_file
— path to the file containing gene coordinates from the reference transcriptome (can be either downloaded or generated fromSPLASH_build.R
)paralogs_file
— (optional) path to file containing list of paralogous genes from the reference genome
The script will generate a file classified_anchors.tsv
in the same directory used for SPLASH run, containing significant anchors along with their biological classification and alignment information.
For running the classification script for a given reference genome/transcriptome you first need to obtain a fasta file for the reference genome and a gtf file for the transcriptome annotation. You then need to do the following 3 steps (note that all index/annotation files from these 3 steps should be generated from the same fasta and gtf file):
-
STAR index: you need STAR index for reference genome. You can use default parameters to build the index:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir STAR_index_files --genomeFastaFiles $fasta file$ --sjdbGTFfile $gtf file$
-
Bowtie index: you also need to Bowtie2 index that can be built using this command:
bowtie2-build $fasta file$ Bowtie_index_files/$Bowtie_index_name$
-
Three annotation files: three files are needed for annotated exon boundaries, annotated splice sites, and gene coordinates that should be built by running script
SPLASH_build.R
. You need three inputs forSPLASH_build.R
script:$gtf_file$
(absolute path to the gtf file),$hisat2_directory$
(directory containing HISAT2 codes downloaded from HISAT2 repository, the script assumes that there are two python scripts at:$hisat2_directory$/extract_exons.py
and$hisat2_directory$/extract_splice_sites.py
),$outfile_name$
(the name used for the annotation files that script will generate). The script can be run using the following command:
Rscript SPLASH_build.R $gtf_file$ $hisat2_directory$ $outfile_name$
If the script is run successfully, it will generate 3 output annotation files in the same directory as the script:$outfile_name$_known_splice_sites.txt
(for annotated splice sites),$outfile_name$_exon_coordinates.bed
(for annotated exon boundaries), and$outfile_name$_genes.bed
(for annotated gene coordinates)
The human files were built for both T2T assembly and GRCh38 assembly. The mouse files were built based on mm39 assembly. The annotation files can be downloaded using the following links:
- Human (T2T):
Annotated exon coordinates
: https://drive.google.com/file/d/1R-4-ICDAzmIBgQmlOF22nNrCWoSgrmHi/view?usp=share_linkAnnotated splice junctions
: https://drive.google.com/file/d/1owlOQyP1z4cyFvYcAAA-qQmc-K6jGbs9/view?usp=share_linkGene coordinates
: https://drive.google.com/file/d/1L0A7iGXEYiOsPQ0QiJayKPybJ79ZDi2F/view?usp=sharingParalogous genes
: https://drive.google.com/file/d/1mqGft4tPlx8X3cRYoqQnPeXaonLfSbGa/view?usp=share_link
- Human (GRCh38):
Annotated exon coordinates
: https://drive.google.com/file/d/1oK6OgQnFFVvybBo0EZ5aIyeoZLAtMyZF/view?usp=sharingAnnotated splice junctions
: https://drive.google.com/file/d/1izVHy1m-ddlNgJtFKfWcHdtkc_Y5bHHP/view?usp=sharingGene coordinates
: https://drive.google.com/file/d/1REfnl9ZNYcsb-1jSurDHcsL7QFJ00JEp/view?usp=sharing
- Mouse (mm39):
Annotated exon coordinates
: https://drive.google.com/file/d/1npE0rkxhsDtJk3FeMdfuZwc5Elfuk4bq/view?usp=sharingAnnotated splice junctions
: https://drive.google.com/file/d/1iJhf421nMRDC0uCo_0jh7Nkns8NAieTE/view?usp=sharingGene coordinates
: https://drive.google.com/file/d/1V8By-yq7AmgXY-XDhipgjjsamL0ghhJa/view?usp=sharingParalogous genes
: https://drive.google.com/file/d/1Uf28bE2XF9Y2w57ARlUGfO5agiFFXx2S/view?usp=sharing
Marek Kokot, Roozbeh Dehghannasiri, Tavor Baharav, Julia Salzman, and Sebastian Deorowicz. SPLASH2 provides ultra-efficient, scalable, and unsupervised discovery on raw sequencing reads bioRxiv (2023)
Kaitlin Chaung, Tavor Baharav, Ivan Zheludev, Julia Salzman. A statistical, reference-free algorithm subsumes myriad problems in genome science and enables novel discovery, bioRxiv (2022)
Tavor Baharav, David Tse, and Julia Salzman. An Interpretable, Finite Sample Valid Alternative to Pearson’s X2 for Scientific Discovery, bioRxiv (2023)