/jasmine

Predict 5mC in PacBio HiFi reads

BSD 3-Clause Clear LicenseBSD-3-Clause-Clear

jasmine logo

Jasmine

Predict 5mC in PacBio HiFi reads


Jasmine predicts 5-Methylcytosine (5mC) of each CpG site in PacBio HiFi reads, using a Convolutional Neural Network. The jasmine model supports the Sequel II and Revio systems. Methylation is assumed to be symmetric between strands. The output is reported in the forward direction with respect to the HiFi read sequence.

Availability

Latest version can be installed via bioconda package pbjasmine.

Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.

Latest Version

Version 2.0.0: Full changelog here

Input Data

Input for jasmine are PacBio HiFi reads with kinetics. You can generate HiFi with kinetics on the command-line, more info on ccs.how:

ccs movie.subreads.bam movie.hifi_reads.bam --hifi-kinetics

Alternatively, you can use SMRT Link on your HPC or define it directly in Run Design for SQIIe instruments.

Jasmine supports ccs --by-strand single-strand HiFi reads with kinetics.

Execution

Running jasmine is as simple as:

jasmine movie.hifi_reads.bam movie.5mc.hifi_reads.bam

Output Data

The output methylation prediction for each annotated HiFi read is encoded in the MM and ML tags, defined in the SAM tag specification. The MM tag specifies the modification (5mC from jasmine) and to which base it applies (every CpG for jasmine). The ML tag specifies the probability of methylation at each base.

The output is also described in the PacBio BAM file format documentation as

Tag Type Description
MM Z Base modifications / methylation
ML B,C Base modification probabilities

Notes for ML: The continuous probability range of 0.0 to 1.0 is remapped to the discrete integers 0 to 255 inclusively. The probability range corresponding to an integer N is N/256 to (N + 1)/256.

Example

Read  AGTCTAGACTCCGTAATTACTCGCCTAG...
C        1    2 34       5 6 78
CpG              *         *

MM:Z:C+m,3,1,...   # CpG sites are at C #4 (1+3) and #6 (1+3+1+1)
ML:B:C,249,4,...   # probability of methylation at the first CpG is in [249/256,250/256); second CpG is in [4/256,5/256).

Run Time

jasmine scales nearly linear in the number of threads, achieving ~2 GBases HiFi per minute on 16 cores. Memory footprint is very low with <100 MB per thread.

$ jasmine movie.hifi_reads.bam out.bam -j 16 --log-level INFO
Reads      : 685700
Yield      : 12.5 GBases
Throughput : 1.8 GBases/min
Run Time   : 6m 46s
CPU Time   : 1h 58m
Peak RSS   : 1.096 GB

Training datasets

HiFi reads and subreads for true negative and true positive CpG methylation sites are available at https://downloads.pacbcloud.com/public/Sequel-II-CpG-training/.

The true negatives are from HG002 Whole Genome Amplification (WGA). The true positives are from HG002 WGA + CpG Methyltransferase (M.Sssl).

Changelog