questions about precision and recall

Hi Xueyi,

Nice research work you've done. I have a liittle question about precision and recall of isoform detection. How did you calculate the precision and recall rates of different tools in isoform detection? If there are only a few basepairs differences for an isoform with the ground truth transcript in the exon-intron boundaries, will it be considered to be a true positive?

Hi @defendant602 ,

Thanks for reading our paper and thanks for your question!

Most isoform detection tools name known genes and isoforms using their original names. For all the isoform detection tools we tested except for Cupcake, we considered those named by known isoforms as true positives, while the remaining were false positives. For Cupcake, we used the SQANTI classification output. You can look at this script for details:

LongReadBenchmark/ONT/isoform_detection/analysis/sequins_analysis.R

Lines 62 to 77 in 834623b

    
           # sequins discovered (out of 160) 
        
           s_known_bambu_ont <- s_bambu_ont[str_detect(s_bambu_ont$V9, paste(c(s$NAME),collapse="|")),] 
        
           s_known_flair_ont <- s_flair_ont[str_detect(s_flair_ont$V9, paste(c(s$NAME),collapse="|")),] 
        
           s_known_flames_ont <- s_flames_ont[str_detect(s_flames_ont$V9, paste0(c(s$NAME),";",collapse="|")),] 
        
           s_known_sqanti_ont <- unique(s_sqanti_ont[str_detect(s_sqanti_ont$associated_transcript, paste(c(s$NAME),collapse="|")),"associated_transcript"]) 
        
           s_known_stringtie_ont <- s_stringtie_ont[str_detect(s_stringtie_ont$V9, paste(c(s$NAME),collapse="|")),] 
        
           s_known_talon_ont <- s_talon_ont[s_talon_ont$V2=="Sequin",] 
        
           # summary table 
        
           summary <- data.frame(tool = c("bambu", "Cupcake","FLAIR", "FLAMES","StringTie2","TALON"),  
        
                                 seq_related_annot = c(nrow(s_bambu_ont), nrow(s_sqanti_ont),nrow(s_flair_ont), nrow(s_flames_ont),nrow(s_stringtie_ont), nrow(s_talon_ont)), 
        
                                 known_seq_160 = c(nrow(s_known_bambu_ont),length(s_known_sqanti_ont), nrow(s_known_flair_ont), nrow(s_known_flames_ont), nrow(s_known_stringtie_ont), nrow(s_known_talon_ont))) 
        
           summary$kept_fd <- summary$seq_related_annot - summary$known_seq_160 
        
           summary$precision <- summary$known_seq_160 / (summary$known_seq_160 + summary$kept_fd) 
        
           summary$recall <- summary$known_seq_160 / 160

Best,
Xueyi

Thanks for your quick reply, Xueyi. I understood. Closing this issue...

	# sequins discovered (out of 160)
	s_known_bambu_ont <- s_bambu_ont[str_detect(s_bambu_ont$V9, paste(c(s$NAME),collapse="\|")),]
	s_known_flair_ont <- s_flair_ont[str_detect(s_flair_ont$V9, paste(c(s$NAME),collapse="\|")),]
	s_known_flames_ont <- s_flames_ont[str_detect(s_flames_ont$V9, paste0(c(s$NAME),";",collapse="\|")),]
	s_known_sqanti_ont <- unique(s_sqanti_ont[str_detect(s_sqanti_ont$associated_transcript, paste(c(s$NAME),collapse="\|")),"associated_transcript"])
	s_known_stringtie_ont <- s_stringtie_ont[str_detect(s_stringtie_ont$V9, paste(c(s$NAME),collapse="\|")),]
	s_known_talon_ont <- s_talon_ont[s_talon_ont$V2=="Sequin",]

	# summary table
	summary <- data.frame(tool = c("bambu", "Cupcake","FLAIR", "FLAMES","StringTie2","TALON"),
	seq_related_annot = c(nrow(s_bambu_ont), nrow(s_sqanti_ont),nrow(s_flair_ont), nrow(s_flames_ont),nrow(s_stringtie_ont), nrow(s_talon_ont)),
	known_seq_160 = c(nrow(s_known_bambu_ont),length(s_known_sqanti_ont), nrow(s_known_flair_ont), nrow(s_known_flames_ont), nrow(s_known_stringtie_ont), nrow(s_known_talon_ont)))

	summary$kept_fd <- summary$seq_related_annot - summary$known_seq_160
	summary$precision <- summary$known_seq_160 / (summary$known_seq_160 + summary$kept_fd)
	summary$recall <- summary$known_seq_160 / 160