A program for accurately identifying splice junctions using Oxford Nanopore sequencing data.
Oxford Nanopore sequencing, Transcriptomics, Splice junctions
NanoSplicer utilises the raw ouput from nanopore sequencing (measures of electrical current commonly known as squiggles) to improve the identification of splice junctions. Instead of identifying splice junctions by mapping basecalled reads, nanosplicer compares the squiggle from a read with the predicted squiggles of potential splice junctions to identify the best match and likely junction. Mapped nanopore reads are still required in order to identify the presence of splice junctions for downstream NanoSplicer analysis. See You et al. (2021) for more description of the method.
The program contains 3 modules JWR_checker.py
, JWR_subset.py
and NanoSplicer.py
, which need to be run in order to get the final result. The first and third modules are required. JWR_subset.py
is optional but recommended and will significantly decrease the run time of NanoSplicer.py
.
Some example input files can be found at example/
to run all of the modules below. Example code is also available at example/script.sh
.
JWR_checker.py
: Find junctions within reads (JWRs) from a spliced mapping result (BAM).
JWR_subset.py
: Subset the result from JWR_checker. Based on our performance assessment, JWRs with high junction alignment quality (JAQ) are usually accurately mapped to their respective splice junctions. This module filters JWRs based on their JAQ, allowing selection of JWRs with lower JAQs that will most benefit from anlaysis with the NanoSplicer module. By default, JWR_subset.py
selects JWRs with a junction alignment quality of less than 0.95.
NanoSplicer.py
: Perform splice junction identification on the JWR_checker.py
(or JWR_subset.py
if applied) output.
NanoSplicer has been tested on python 3.6 and 3.7. Everything should work for python3.X.
For JWR_checker.py
and JWR_subset.py
:
pandas
pysam
numpy
tqdm
h5py
Additional requirements for NanoSplicer
:
tombo
ont_fast5_api
matplotlib
fcntl
intervaltree
scipy
skimage
If there are any problems installing the dependencies above, an alternative way of setting up the environment is available via a container. The dependencies
required for running NanoSplicer have been packaged into a container and can be accessed using singularity
, which is supported by most high-performance computing environments:
singularity pull NanoSplicer_container.sif docker://youyupei/nanosplicer:v1
For people not familiar with containers: You can run linux commands within the container by using singularity shell
or singularity exec
. These commands automatically bind your home directory to the home directory inside the container, which means everything under ~/
(including the sub-directories) will be accessible without extra steps. If your data are saved in a different directory, you'll need to bind the directory with -B <local path>:<path in container>
when running singularity shell
or singularity exec
. For example, if your data are in /data
, you need to add -B /data:/folder_in_container
. Everything in /data
is then accessible in /folder_in_container
. You may use the same name for convenience (e.g. -B /data:/data
). A more formal introduction to singularity can be found at https://sylabs.io/singularity/.
git clone https://github.com/shimlab/NanoSplicer.git
All the following scripts can be found at bin/
Find junctions within reads (JWRs) from a spliced mapping result (BAM). For each JWR, JWR_checker
reports the junction alignment quality for the initial mapping.
Note: Currently JWR_checker
outputs a table in HDF5 format, which allows easier read-in in the following steps. A CSV format table can also be output with --output_csv
, which is easier for users to read. The downstream scripts JWR_subset.py
and NanoSplicer.py
only accept HDF5 as input.
Finding junctions within reads (JWRs) from a spliced mapping result (BAM).
Usage: python3 JWR_checker.py [OPTIONS] <BAM file> <output hdf5 file>
Options:
-h/--help Print this help text
-w/--window Candidate search window size (nt) <default: 25>
--chrID Target a specific chromosome, chrID should match
the chromosome name in the BAM. All chromosomes
in the BAM will be searched if not specified
--genome-loc Target a specific genomic region, e.g. --genome-loc=0-10000
Use in conjunction with --chrID option. Entire
chromosome will be searched if location not specified
--threads Number of threads used <default: 32>.
--output_csv With this option, a csv file will be output along with the hdf5 file
python3 JWR_checker.py –-chrID=chr1 –-genome-loc=5296679-5297165 --output_csv input.bam example.h5
Subset the result from the JWR_checker. Based on our performance assessment, JWRs with high junction alignment quality (JAQ) are usually accurately mapped to their respective splice junctions. A subset of JWRs can be obtained by only selecting JWRs with a JAQ at or below a user-defined threshold. By default, JWR_subset.py selects all JWRs with a JAQ of 0.95 or less. Note: Currently JWR_subset
only takes the HDF5 file from JWR_checker
as input and returns the filtered subset back in HDF5 format. A table in CSV format can also be output with --output_csv
.
Usage: python3 JWR_subset.py [OPTIONS] <input file: hdf5 output from JWR_checker> <output file: hdf5>
Options:
-h/--help Print this help text
--JAQ_thres A value from 0-1, only JWRs with a junction alignment quality (JAQ)
at or below the specified value will be retained <default: 0.95>
--chrID Target a specific chromosome, chrID should match
the chromosome name in the BAM
--genome-loc Target a specific genomic region, e.g. --genome-loc=0-10000
--chrID should be also specified.
--output_csv With this option, a csv file will be output along
with the hdf5 file
python3 JWR_subset.py –-chrID=chr1 –-genome-loc=5296679-5297165 --JAQ_thres=0.8 --output_csv input.h5 output.h5
Perform splice junction identification on the JWR_checker.py (or JWR_subset.py if applied) HDF5 output.
Requires path to reads in fast5 (squiggle) format and a matched mapped .bam/.sam file for these reads. This is the same .bam file required for input into
JWR_checker
.
Usage: python3 {} [OPTIONS] <JWR_checker/JWR_subset hdf5 file>
Options:
-i .bam/.sam file (required)
-f path to fast5s (required)
-r Genome reference file (required)
-o Prefix for output files <default: './output'>.
--threads
Number of threads used <default: # of available cpus - 1>.
More option on the candidate splice junctions (optional):
User-provided splice junction (from annotation or short-read mapping):
--junction_BED
User-provided BED file containing known splice junctions. The BED
file can convert from genome annotation or short read mapping result.
If provided, NanoSplicer will include all splice junctions in the BED file
as candidates with highest preference level. e.g. --junction_BED='xx.bed'
--non_canonical_junctions
User-provided BED file containing known splice junctions. The BED
file be converted from genome annotation (GTF/GFF file) or short read mapping
result. If provided, NanoSplicer will include non-canonical (i.e., non GT-AT)
splice junctions in the BED file as candidates with highest preference level.
e.g. --non_canonical_junctions='xx.bed'
Long-read mapping:
--supported_junc_as_candidate
Add nearby long read supported splice junctions as candidates (if they have
not been included yet).
--min_support
Minimum support, i.e. number of JWRs (with minimum JAQ) mapped ,for
an additional splice junction to be added to candidate list <default: 3>
--min_JAQ_support
Minimum JAQ (range: 0-1) for JWRs that counted as supports. <default: 0>
python3 NanoSplicer.py -i example.bam -f /data/fast5s/ -r ref.fa input.h5
Note: The BAM file needs to be indexed (i.e. a .bai
file needs to be present in the same folder as the BAM file). A .bai
file can be generated with samtools
using the command: samtools index <BAM filename>
.
There are multiple options about how to select candidate splice junctions:
By default, NanoSplicer.py
selects:
- The splice junction supported by the JWR (mapped splice junction).
- Nearby canonical splice junctions. We define these as introns that start with GT and end with AG a motif present in over ∼99% of mammalian splice junctions.
Optional candidates in NanoSplicer.py
:
-
Annotated splice junctions: Genome annotation are usually in GTF/GFF3 format, but can be converted to BED. E.g., use gffread (https://github.com/gpertea/gffread). The resulting BED file can be passed onto
NanoSplicer.py
by specifying--junction_BED='xx.bed'
-
User defined list of candidate splice junctions (e.g., from short-readsequencing). Some short read aligners (e.g. STAR) output a BED file containing the splice junction information. The resulting BED file can be passed onto
NanoSplicer.py
by specifying--junction_BED='xx.bed'
-
Nearby splice junctions supported by other mapped reads (above a user-specified read count threshold). To do this, users need to add
--supported_junc_as_candidate
in the command line.
'NanoSplicer.py' output:
- Probability table (TSV file: <prefix>_prob_table.tsv), where <prefix> is specified by users with
-o <prefix>
. - BED file (<prefix>_jwr.bed) with corrected splice junctions for each JWR.
The probability table contains 10 columns:
- read_id: the Nanopore id of the read that the JWR comes from
- reference_name: the mapped reference name of the read the JWR comes from
- inital_junction: initial mapped splice junction
- JAQ: Junction alignment quality
- candidates: all candidate splice junctions
- candidate_sequence_motif_preference: Based on known preferential use of specific intronic sequence patterns near splice junctions (0: non GT-AG, 1: GT[C/T]-[A/G]AG, 2: GT[A/G]-[A/G]AG or GT[C/T]-AG[C/T], 3: GT[A/G]-[C/T]AG or user-provided junction)
- SIQ: Squiggle information quality. The squiggle is usually of poor quality if the SIQ is lower than -0.8.
- prob_uniform_prior: probability for each candidate using a uniform prior
- prob_seq_pattern_prior: probability for each candidate using the sequence pattern prior (based on the candidate sequence motif preference in column 6)
- best_prob: value of the best probability in column 9
The output BED file can be visulised in genome browsers, see the following figure for the visulisation on IGV (https://software.broadinstitute.org/software/igv/). In the output, NanoSplicer presents a splice junction using two blocks (25 bases before and after the identified splice junction)). The end of the first block and the begining of the second block represent the identified splice juncitons. The blocks are colored based on the assignment probability, black: assignment probability > 0.8; grey: assignment probability<=0.8. The identifications with assignment probability <= 0.8 means there evidence in the squiggle in not strong enough, so they are usually less accurate and recommended to be discarded.
Details
- chorm: The name of the chromosome
- start: The genome coordinate of the begining of the first block.
- end: The genome coordinate of the end of second block.
- name: read id of the read that contain the JWR
- score: Assignment probability of the splice junction.
- strand: the transcript strand
- thickStart: Not in use (NanoSplicer output 0 in this columns for every JWR)
- thickEnd: Not in use (NanoSplicer output 0 in this columns for every JWR)
- itemRgb: The JWRs will be drawn in black if the assignment probability > 0.8, and drawn in gery otherwise.
- blockCount: NanoSplicer output 0 in this columns for every JWR (NanoSplicer draw 25 bases before and after the splice junction, so there are always 2 blocks).
- blockSizes: NanoSplicer output 25,25 in this columns for every JWR (NanoSplicer draw 25 bases before and after the splice junction)
- blockStarts: start position of each block.
Note:
- The JWRs with SIQ<-0.8 are not included in the BED output, because the squiggle is usually of poor quality.
- The start and end columns (columns 2 & 3) DO NOT show the genome coordinate of the splice junctions, the exact coordinates of the splice juncitons (i.e., exon-intron boundaries) are start+25 and end-25.
- Visulisation module: ploting the alignment between the junction squiggle and the candidate squiggle(s).
- There will be a performance warning when running
JWR_checker
andJWR_subset
. The warning can be ignored. The warning is triggered when data is saved into the HDF5 file. It relates to python objects that can not be directly converted to c-type, causing non-optimal performance of the HDF5 file. - The progress bar in
NanoSplicer.py
is based on the number of fast5 files that have been processed. If there is only 1 fast5 file being processed (e.g. inexample/fast5/
), the progress bar will be at 0% (0/1 completed) until the run has finished.
If you find NanoSplcier useful for your work, please cite our paper:
Yupei You, Michael B. Clark, Heejung Shim (2021). NanoSplicer: Accurate identification of splice junctions using Oxford Nanopore sequencing. You et al. (2021).