nanoporetech/dorado

Question Regarding mask1_front and Barcode Demultiplexing in Direct RNA Seq"

Seongmin-Jang-1165 opened this issue · 8 comments

Issue Report

Please describe the issue:

Hello, I am currently performing target-specific custom barcode demultiplexing using Direct RNA seq data.

In the options, there is a setting for mask1_front, and the explanation states:

(Required) The leading flank for the front barcode (applies to single and double ended barcodes). Can be an empty string.

From my understanding, this option is for specifying the flank sequence for the front-attached barcode.

In my Direct RNA seq library, adapters are only attached to the rear end of the read, and I have inserted barcodes into this adapter sequence.

In this case, how should I adjust this option? I am thinking of setting it as follows:

mask1_front = ""
mask1_rear = ""
mask2_front = ""
mask2_rear = "GGCC"

What do you think of this approach?

For reference, although this is target-specific, there are multiple targets, so it’s difficult to define a single flank sequence for the front barcode. However, the rear part is clear since I can identify the custom-specific adapter sequence from the Direct RNA seq manual.

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

  • Dorado version: 0.8.0

  • Dorado command:
    /home/rnagenomics/sm/Nanopore/20240923_Histone_Direct_RNA_seq/dorado-0.8.0-linux-x64/bin/dorado basecaller sup --no-trim --barcode-arrangement barcode_arra.toml --barcode-sequences barcode_sequence.fastq /home/rnagenomics/sm/Nanopore/20240923_Histone_Direct_RNA_seq/rawdata/data/240923histone/HepG2/20240923_1720_P2S-01504-A_PAW01284_59f98b81/pod5_total/ > DORADO_Barcode_basecall_3.bam

  • Operating system:

  • Hardware (CPUs, Memory, GPUs) : A100

  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): POD5

  • Source data location (on device or networked drive - NFS, etc.):

  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):

  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

  • Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)

Hi @Seongmin-Jang-1165,

You should specify your flanks in mask1_* and additionally set rear_only_barcodes = true. You may want a longer flank sequence to ensure the mask is found correctly given you have only one side specified, and you may need to tweak the scoring parameters.

hello @malton-ont thank you for the advice!!

Although this is not directly related to the previous topic, I have a question I would like to ask.

My current plan is to perform basecalling, then demultiplex using custom barcode analysis, and subsequently categorize the raw signal (POD5) according to the demultiplexing results.

When looking at the basecalled data, each read has a unique read_id, and I am thinking of using this to match it with the raw data for classification.

Would it be possible to do this? If so, how can it be done? Is there an already established method for this?

I would appreciate your advice on this matter. Thank you!

Yes, this should be possible. Note that any reads that have been split will have new read-ids, so you'll need to look at the pi tag to get the id of the corresponding parent read that would be present in the pod5 file.

You'll probably want to take a look at https://pypi.org/project/pod5/, particularly the filter and subset commands, but that discussion may be better placed on the community forums as that isn't a dorado issue.

Just jumping into this thread with a Q: do the barcodes in RNA004 have to be RNA, or can they be DNA? It's unclear to me what the basecaller would do for read trimming as I'm guessing it removes a DNA-associated signal.

Eg. If I have ADAPTER-BARCODE-AAAAA-RNA, does the barcode have to be RNA or can it be DNA?

@malton-ont Thank you for reply!! i'll try it.


Hello @billytcl
according to SQK-RNA004 Direct RNA sequencing library kit,
it provide adapter for PolyA+ RNA(RTA) and suggest about target-specific custom adapter. I prepared the library with custom adapter.

custom adapter is made with 2 DNA primer that contains partially complementary sequence and when making library, annealing step is needed.

so i don'n know about the RTA, but it seems like RTA is also composed of DNA strand.

checking the library protocol will be helpful. (https://nanoporetech.com/document/direct-rna-sequencing-sequence-specific-sqk-rna004)

and also there are a few article about demultiplexing direct RNA seq data. it said that the raw signal is very different between RNA read region and adapter region because of difference DNA & RNA. (https://genome.cshlp.org/content/30/9/1345)

more, DORADO manual said it detects DNA adapter sequence, so i assume it auto-trim DNA adapter sequence. but it has an option --no-trim that inhibits adapter trimming.

so i think it is okay making barcode with DNA

I had similar question and searched for this, and this informations is what I found.
please share with me if there are wrong or updated information

Dorado attempts to remove any DNA signal from RNA reads - in 0.8.0 this occurs regardless of the --no-trim flag since the RNA basecall model is very unlikely to give accurate results for DNA.

@malton-ont
then, how DNA barcodes can be classified..? i designed the library like this

original : RNA - [RTA] - RLA

mine : RNA - [target-specific-sequence - barcode - rear region of RTA] - RLA

and the barcode is annealed DNA...

if i detect this barcode with the code under

dorado basecaller sup --barcode-arrangement [barcode_arra.toml] --barcode-sequences [barcode_sequence.fastq] [POD5] > [output]

it cannot detect barcode sequence??

and I have two thoughts regarding this situation:

Could it be possible that the barcode sequence is not being trimmed because it’s different from the existing RTA sequence? Although I feel like this might not be the case, as the signal itself would likely still be classified as DNA.

If that's not the case, would it be possible to correctly read only the DNA barcode if I performed basecalling using an option that specifically basecalls DNA signals? If that’s possible, I’ve heard that the POD5 file contains information about the library kit. Is it possible to ignore this and still run the Dorado basecaller?

I would expect the barcode to also be removed if it looks like DNA - this is done on signal, not sequence. It may be plausible to basecall the reads first with a DNA model and then subset by barcode and rebasecall with the RNA model - I haven't tried this though, and you may get better answers about this kind of process on the Nanopore community forums.