nanoporetech/dorado

estimate polyA doesn't find tails well with custom primer

Opened this issue · 3 comments

Dear developers,

Recently I am trying some customized methods to complete the full-length measurement of the polya tail on nanopore.
However, I don’t know whether it is because the same primer is used at both ends of the cDNA or for other reasons. Many polya that can be recognized by basecaller cannot be counted by estimate-poly-a function.

Thank you for your time,
shaohui

My Library Diagram

5' ---- ADAPTER ---- AAGCAGTGGTATCAACGCAGAGTAC----ATGGG ---- cDNA ---- poly(A) ---- GTACTCTGCGTTGATACCACTGCTT ---- 3'

As sample

read’s signal

屏幕截图 2024-10-29 193444

bam output

fd4fc9af-6ca3-49d0-9699-2af7174cd85f 4 * 0 0 * * 0 0 TGTGTACGTACTTCGTTCAGCGTATTGCTGCGCACGCACTACAGAAAGCAGTGGTATCAACGCAGAGTACATGGGCTTGTTCTGGGGACATTTCGCGATTGCGGACGTTGAGAGGCCGCTGCCAAAATGCCAGAACGAGATAGTGAGCCCTTCTCTAACCCTTTGGCTCCAGATGGCCACGATGTGGATGATCCTCATTCCTTCCACCAATCAAAACTTACCAATGAAGACTTCAGGAAACTTCTTATGACCCCAAGAGCTGCACCTACTTCTGCGCCACCTTCTAAGTCACGTCACCATGAGATGCCAAGGGAGTACAATGAGGATGAAGACCCAGCTGCACGAAGGAGGAAAAAGAAAAGTTATTATGCCAAGCTTCGCCAGCAAGAAATTGAGAGAGAGAGAACTCGCAGAGAAATACCGGGACCGTGCCAAGGAACGGAGAGATGGTGTGAACAAAGACTATGAGGAAACTGAGCTGATAAGTACCACAGCCAACTACAGGGCTGTGGGCCCCACTGCTGAGGCGGACAAATCAGCAGCAGAGAAGAGAAGACAGTTGATTCAGGAGTCCAAATTCTTGGGTGGTGATATGGAACACGCCCATTTGGTGAAAGGTTTGGATTTTGCGTTGCTTCAAAAGGTGCGCGCTGAGATTGCCAGCAAAGAGAAGGAGGAAGAGGAACTCATGGAAAAGCCCCAAAAGGAAACCAAGAAAGATGAGGATCCTGAGAACAAAATTGAATTTAAAACACGCCTTGGCCGGAATGTGTATCGGATGCTTTTCAAGAGTAAATCATATGAGCGAAATGAGCTGTTCTTACCAGGACGTATGGCCTATGTAGTAGACCTGGATGATGAGTACGCAGACACAGATATCCCCACCACTCTCATACGCAGCAAAGCTGATTGCCCCACTATGGAGGCCCAGACTACACTGACTACAAATGACATTGTTATTAGCAAGCTCACCCAGATTTTGTCATACCTGAGGCAGGGGACCCGAAACAAGAAGCTCAAGAAGAAGGATAAAGGAAAACTGGAAGAGAAGAAACCTCCTGAGGCTGACATGAACATTTTTGAAGACATTGGGGATTACGTTCCTTCTACAACCAAGACACCTCGGGACAAGGAACGTGAGAGATACCGGGAACGTGAACGTGATCGGGAACGGGACAGAGACAGGGAGCGAGACAGGGAGCGAGACCGTGAGAGAGAGGGAGAGAGAGCGAGACCGGGAACGGGAACGAGGAGGAAAAGAAAAGGCACAGCTACTTTGAGAAGCCAAAAGTGGATGATGAGCCCATGGATGTTGACAAAGGACCTGGATCTGCAAAAGAGTTGATCAAGTCCATCAATGAAAAATTCGCTGGGTCTGCTGGCTGGGAAGGCACTGAATCGTTGAAGAAGCCAGAAGATAAGAAGCAGCTGGGCGATTTCTTTGGCATGTCCAACAGTTACGCAGAATGCTATCCAGCCACGATGGATGACATGGCTGTAGATAGTGATGAAGAGGTAGATTATAGCAAAATGGACCAGGGTAACAAGAAGGGTCCCTTAGGCCGCTGGGACTTCGATACTCAGGAGGAAACAGCGAGTACATGAACAACAAGGAGGCTCTGCCCAAGGCTGCATTCCAGTATGGCATCAAGATGTCTGAAGGACGGAAAACCAGACGATTCAAAGAAACCAATGATAAGGCAGAGCTTGATCGACAGTGGAAGAAAATAAGTGCAATCATTGAGAAGAGGAAGAGGATGGAAGCAGATGGGGTCGAAGTGAAAAGACCAAAGTACTAATCTCTAGTTCCAGCTGTCACCACGTGGCTGTTCTTAGTTGCTTGCTTCTACAATTCCTCAGACGGTTGCAAACTGTTGTTGTTTGTGAAAGTTTATAAATGTTTATTGTATAACTCTTTATAGATCTGTGTCCCACATGCTAAGATTAATGGCAATGCAACACCATGTCCAGCATGTTCCATTAAATGTAGTTAAACCTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTTTCTGTAGTGCGTGCGCAGCAATA %&&&'&&%%&'''''(*)++&&&'7888HSSNSQSSSSSSIHSSLKPSSSKSMSSLSSNSSNS::::S:::::NOMLRC@;.-,,,FKLSSMSRSSSKSSSSNSSSKSSSSSSSSLISLSKD@FLMKJMLISSLSSLOSSSPLISRSSSSSSSSSSOSSJNSSSSOIISSGHIISMJSSSLKLPIIJIMMSSSLHKKRSDHMGHHHIIPSSSSSSSSSSPIS::::96=;<7/.(()3DMLSSB@@@@JLOMNSSMMSSSG@@>>10001ABCEFDDCFDJI9333DEJRNSSSOSLSSSMSSSSSSSSSSSSSSMPP===MSSSNKSSSSMSSSMSSSNSSS<;713:;C===FHILQSSSSSSMJSSRLKRSSSSSQSSSSSPJODSSSSSG>=<?3B8HSNSSSSQSSLSJFSSSSMQHGGKNSSPSSSSSSSSSSSOSSSSSSSSSSQSQSSSOSSSSSSSSOSSSSSSSSSSSSSSSSSSSSNSSSSSSSSNSSSSSQSSSSSSSRRLOJLFAAA@?LJLOSSNPNKSSSSMSMKNSSSKGGNJOIHSSLOOJLSSJEFF;;HIHSSOJKLSSSOSSSSSSSSSSQSSSPSSSSLSSSQSNSSSSSSSSSSSSSSSKONJSSSSSSSSSQMMSSSSSSSSNSMSSQNLSGSMMKLSKSPOSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSJSSSSSSSSSSPLRSSSSSSKJJSSSOSSKSSSSSSSSSSSKQSMJKKNSSSSSIOSSSQSSSLRSSSSSLSSSSMNSSNMNRMEDIFFSSSOSSSSSSSSSSMISSSSSSSSSSSSSSSSQSSSSSSSSSKSSSSSSSSOSQSSRSSSSSNSSSSSSSSSSQSSSOSNJSSSSLHKSSSSSSSSSSSQSPSSSSSSOSSSSSIMSNSSSSSSSSSSSJISSSSSSSSSGGGHSSSSSSSSSSSHKSSQSSSLIKLHQMJSSSNSSSSSSSSSSSSSSSSNSSSSKSHSSMLJGSSHSHJHJJILSSSMSSSSSSSRSSSSSSPSJHHIFFFSSSFFGFIGGFHHFSLILSSSSOSJSSNSSSSSSKHQSHOKSKKMPSSSSLLJGJ6334CDB>:43100./3:;<:96666GKOSSFEA@ABASKSHHGGHKNKHOKLHSSSPFOJSPOMIOMSMIK>JSMMHGFJFNSGSIISFCCECHHAIEHJPKSILSF>HDGSSIFFDD7333463429;CA?:9?HE@@@?9998<=IIPSSNSHSKIBB@;@GFOMHKMSSSSSSSSSSMSSSPNMMIOSSSSSSSOMSSSSNSSSSSSSSSPOMJSSSSSPSSQSSLMMSSSSSSSSSSQSMSSSLHIISDBCDDJPSSSQSQSSSSSSSSSSSSSJMSSSSSSRSQSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSRSSSSSSSSSSSSSRSSSSSSSNSSSSSSSHGHSQSQSLSSSSNSLISSSSSPSSSSSSSSSSOSSSOSSSSSSSSSSSPSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSILSSSSSSSSSPMMKSSSSQNSSSSSSSSSOSSSSSSSSSCCCDDSSSMJSSSSSSLMMLSSSSSSSSKSSJFGFHSNSSSNKSSSSMISSSSSSSSSSSSSSSSSSSSNSSSSIHD42436888999:==><>=GSSSSSMSSSSSSSRSSSSSQMSHHKSSSSSSPSSSSSSSSSSSSSJMSSMJISEIE1000007@?CCIMKNSSSSNSSSSPMPSSPMSSSSSSSSSSSSNSSSSSSSSSNQSSSNJHHGIILSSSJSKLSSSSSRSSSSSNSSSSSNLSSOSSSQSSSSSSSLSSSSSNLISSSSSSSOSSSMPSSSSOSRSSSOSSSSSSSNSSSNSRSSSSSSSSSSSSSSSSSSSSSSSPSSSSSSIMSSOSSSSOSSSSSSSSSSSSSSSSSSSSSSSSNSSSSSSSSSSSSJMSSSSLSSSSSSSSSSSSSSSSSLIQLAHSLHMSSSSSSSSQSMJSSSSSS6122374555556667789;=?@BDCGGNKISSPHIHIIHHJJCEDEFSSSFGFHGSSSSSSSSSKGFGHGFJIHIFDCDCBFGIG::92 qs:f:29.0782 du:f:5.587 ns:i:27935 ts:i:10 mx:i:1 ch:i:174 st:Z:2024-08-20T13:54:18.740+00:00 rn:i:22003 fn:Z:PAY63012_pass_2d2e6b65_9300ab27_1974.pod5 sm:f:720.842 sd:f:125.658 sv:Z:pa dx:i:0 RG:Z:9300ab2755c6080067ea07aae5b884a17ba2bb21_dna_r10.4.1_e8.2_400bps_sup@v5.0.0

poly-a-config

[anchors]
front_primer = "ATGGG"
rear_primer = "AAGCAGTGGTATCAACGCAGAGTAC"

[threshold]
flank_threshold = 0.6

[tail]
tail_interrupt_length = 10

Run environment:

  • Dorado version: 0.8.2+6b413c9
  • Dorado command: dorado basecaller --barcode-both-ends --device cuda:1 --no-trim --estimate-poly-a --poly-a-config ./conf/polya_conf.tmol --verbose ./dna_r10.4.1_e8.2_400bps_sup@v5.0.0 ./pod5s > ./call.bam
  • Operating system: Linux
  • Hardware (CPUs, Memory, GPUs): A100 80G
  • Source data type: pod5

Hi @shaohuishi,

I believe the issue could be that your front_primer definition is incomplete.

Dorado expects cDNA reads to take the form:

 5' --- ADAPTER --- FRONT_PRIMER
... --- cDNA
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

But you have described this format

 5' --- ADAPTER --- ????? --- FRONT_PRIMER 
... --- cDNA 
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

I believe the solution is to extend the definition of your front primer to include the additional part.

[anchors]
front_primer = "AAGCAGTGGTATCAACGCAGAGTACATGGG"
rear_primer = "AAGCAGTGGTATCAACGCAGAGTAC"

Kind regards,
Rich

Hi @shaohuishi,

I believe the issue could be that your front_primer definition is incomplete.

Dorado expects cDNA reads to take the form:

 5' --- ADAPTER --- FRONT_PRIMER
... --- cDNA
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

But you have described this format

 5' --- ADAPTER --- ????? --- FRONT_PRIMER 
... --- cDNA 
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

I believe the solution is to extend the definition of your front primer to include the additional part.

[anchors]
front_primer = "AAGCAGTGGTATCAACGCAGAGTACATGGG"
rear_primer = "AAGCAGTGGTATCAACGCAGAGTAC"

Kind regards, Rich

Hi Rich,

Thank you for your quick reply,

However, we tried providing the complete FRONT_PRIMER and it still did not work, we review the script dna_poly_tail_calculator.cpp and found

const bool proceed = flank_score >= threshold && std::abs(dist_v1 - dist_v2) > kMinSeparation

in L59, The default threshold kMinSeparation=10 would cause normal reads to be filtered out due to the presence of the same 25bp primer. As an example dist_v1 = 0 & dist_v2 = 5.

So, we changed this threshold kMinSeparation=2 and recompiled the program. This operation gave us a seemingly normal number of reads with polyA tails, however, it caused some poor quality reads, and even some reads that lacked an adapter at one end, to have unexpectedly long polyA tails.

To eliminate these reads, we set

flank_threshold=0.8 
tail_interrupt_length=2

to match the estimation of nanopore error rate at this primer length, and some biological assumptions.

After this change, we found that in extreme cases, based on the calculation method of flank_score, a primer mismatch on one end will benefit from a perfect match on the other end and make flank_score unable to filter the read.

Therefore, in order to use this function more flexibly and effectively, we recommend changing the calculation logic of flank_score to this:

    float flank_score = 0;
    const float top_flank_score_v1 = 1.f - static_cast<float>(top_v1.editDistance) / (front_primer.length())
    const float bottom_flank_score_v1 = 1.f - static_cast<float>(bottom_v1.editDistance) / (rear_primer.length())
    const float top_flank_score_v2 = 1.f - static_cast<float>(top_v2.editDistance) / (rear_primer.length())
    const float bottom_flank_score_v2 = 1.f - static_cast<float>(bottom_v2.editDistance) / (front_primer.length())
    if (fwd) {
        flank_score = std::min(top_flank_score_v1, bottom_flank_score_v1);
    } else {
        flank_score = std::min(top_flank_score_v2, bottom_flank_score_v2);
    }

We also hope that kMinSeparation can be an additional features that can be customized through a configuration file.

Thank you for your time and suggestions,
shaohui

@shaohuishi,

Thank you for your detailed examination! It appears that the cDNA polyA detection does not anticipate that the front and rear primers are so similar, and this arrangement makes distinguishing between a forward and a reverse strand much more difficult. We make ongoing efforts to improve the polyA detection, and we will investigate your suggestions as we move forward.