XueyiDong/LongReadBenchmark

questions about precision and recall

defendant602 opened this issue · 2 comments

Hi Xueyi,

Nice research work you've done. I have a liittle question about precision and recall of isoform detection. How did you calculate the precision and recall rates of different tools in isoform detection? If there are only a few basepairs differences for an isoform with the ground truth transcript in the exon-intron boundaries, will it be considered to be a true positive?

Hi @defendant602 ,

Thanks for reading our paper and thanks for your question!

Most isoform detection tools name known genes and isoforms using their original names. For all the isoform detection tools we tested except for Cupcake, we considered those named by known isoforms as true positives, while the remaining were false positives. For Cupcake, we used the SQANTI classification output. You can look at this script for details:

# sequins discovered (out of 160)
s_known_bambu_ont <- s_bambu_ont[str_detect(s_bambu_ont$V9, paste(c(s$NAME),collapse="|")),]
s_known_flair_ont <- s_flair_ont[str_detect(s_flair_ont$V9, paste(c(s$NAME),collapse="|")),]
s_known_flames_ont <- s_flames_ont[str_detect(s_flames_ont$V9, paste0(c(s$NAME),";",collapse="|")),]
s_known_sqanti_ont <- unique(s_sqanti_ont[str_detect(s_sqanti_ont$associated_transcript, paste(c(s$NAME),collapse="|")),"associated_transcript"])
s_known_stringtie_ont <- s_stringtie_ont[str_detect(s_stringtie_ont$V9, paste(c(s$NAME),collapse="|")),]
s_known_talon_ont <- s_talon_ont[s_talon_ont$V2=="Sequin",]
# summary table
summary <- data.frame(tool = c("bambu", "Cupcake","FLAIR", "FLAMES","StringTie2","TALON"),
seq_related_annot = c(nrow(s_bambu_ont), nrow(s_sqanti_ont),nrow(s_flair_ont), nrow(s_flames_ont),nrow(s_stringtie_ont), nrow(s_talon_ont)),
known_seq_160 = c(nrow(s_known_bambu_ont),length(s_known_sqanti_ont), nrow(s_known_flair_ont), nrow(s_known_flames_ont), nrow(s_known_stringtie_ont), nrow(s_known_talon_ont)))
summary$kept_fd <- summary$seq_related_annot - summary$known_seq_160
summary$precision <- summary$known_seq_160 / (summary$known_seq_160 + summary$kept_fd)
summary$recall <- summary$known_seq_160 / 160

Best,
Xueyi

Thanks for your quick reply, Xueyi. I understood. Closing this issue...