circRNA_benchmarking

This repository contains all data and scripts used to generate the numbers and figures for the circRNA detection tool benchmarking paper published in Nature Methods.

The data folder contains

  • the Supplementary_Table_2_all_circRNAs.txt.gz file, which contains all predicted circRNAs in the untreated sample in this study with their annotation.
  • the Supplementary_Table_3_selected_circRNAs.txt file, which contains all 1560 circRNAs selected for validation, with their initial detection information (tool, BSJ count), their primer information (including FWD and REV primer sequence), results from three validation methods (Cq value with and without RNase R, Cq difference, and amplicon sequencing percent on-target amplification), validation metrics, and annotation information. This file was generated by the 01_calculate_val_rates.R script.
  • the Supplementary_Table_4_all_circRNAs_treated.txt.gz file, which contains all predicted circRNAs in the RNase R treated sample in this study with their annotation.
  • the Supplementary_Table_5_RNase_R_enrichment_seq.txt.gz file, which contains the RNase R enrichment factor calculated based on RNA sequencing data for each circRNA. This file was generated by the 01_calculate_val_rates.R script.
  • the Supplementary_Table_6A_precision_values.txt file, which contains the validation metrics (per-methods precision, compound precision, theoretical number of TP circRNAs) for each tool. This file was generated using the 01_calculate_val_rates.R script.
  • the Supplementary_Table_6B_sensitivity_values.txt file, which contains the validation metrics (per-methods precision, compound precision, theoretical number of TP circRNAs, estimated sensitivity) for each tool. This file was generated using the 01_calculate_val_rates.R script.
  • the Supplementary_Table_6_tool_ranking.txt file, which is a summary of Supplementary_Table_6A_precision_values.txt and Supplementary_Table_6B_sensitivity_values.txt. This file was generated using the 01_calculate_val_rates.R script.
  • the Supplementary_Table_7_combo_2tools.txt file, which contains the number of circRNAs in the intersection and union of each combination of two tools, per cell line sample (only for circRNAs with BSJ count ≥ 5). This file was generated by the 03_combination_tools.R script.
  • the Supplementary_Table_8_combo_3tools.txt file, which contains the number of circRNAs in the intersection and union of each combination of three tools, per cell line sample (only for circRNAs with BSJ count ≥ 5). This file was generated by the 03_combination_tools.R script.
  • the Supplementary_Table_9_top_tool_combinations.txt file which contains a list of the top performing combinations of two tools. The list was composed by selecting the top 5 performing combinations in terms of the total number of detected circRNAs (union between both tools) and the weighted compound precision, for each cell line. This file was generated by the 03_combination_tools.R script.
  • the details folder, which contains some files needed for the following scripts to generate some of the Supplementary Figures and Tables. circ_db_hg38.txt is a table with all circRNAs in all circRNA databases from a previous publication.

The data_analysis folder contains

  • 01_calculate_val_rates.R file, which contains the calculations of the validation metrics (per-methods precision, compound precision, theoretical number of TP circRNAs, estimated sensitivity) and generates Supplementary_Table_3_selected_circRNAs.txt, Supplementary_Table_6A_precision_values.txt, Supplementary_Table_6B_sensitivity_values.txt, Supplementary_Table_6_tool_ranking.txt, and Supplementary_Table_5_RNase_R_enrichment_seq.txt.
  • 02_calculations_paper.R file, which contains all calculations reported in the manuscript.
  • 03_combination_tools.R file, which contains all calculation for the union and intersection of two or three tools and generates Supplementary_Table_7_combo_2tools.txt, Supplementary_Table_8_combo_3tools.txt, and Supplementary_Table_9_top_tool_combinations.txt.
  • 04_annotation_and_validation.R file, which contains all calculations described in the paragraph Comparing precision values in function of circRNA annotation.

The figure_generating folder contains the R scripts and R markdowns to generate all Figures and Supplementary Figures in the manuscript.

correction of coordinates bug

One of the collaborators noticed a mistake in the published data and figures. This mistake has now been rectified in the GitHub repo. The corrections have been submitted to Nature Methods and we are currently waiting for the online publication to be updated. In summary, an accidental basepair shift changed the BSJ position of 5% of the circRNAs. All the main conclusions and the majority of the figures stay the same.

In detail: 55,238 out of 1,137,055 (~ 5%) circRNAs identified in the paper were accidentally shifted one nucleotide in both the start (-1) and end position (+1), and were therefore wrongly annotated. For example: circRNA chr18:8718424-8720495 in the original data became chr18:8718423-8720496. This set of wrongfully annotated circRNAs came from 3 tools: KNIFE, NCLscan, and NCLcomparator. This happened during a wrongly performed ‘correction’ of 1-based to 0-based annotation. This error has now been fixed.

This mistake had an inmpact on:

  • the annotation of a subset of circRNAs
  • the overlap among tools
  • the amplicon sequencing precision (subgroup BSJ count ≥ 5) is sligthly higher for KNIFE and NCLcomparator. Therefore, also their compound precision is slightly higher.
  • the sensitivity has changed as there is more overlap among the tools than initially measured. The set of true positive circRNAs is thus 949 unique circRNAs (instead of 957) (Sup Table 6B). This also slightly changes the tool ranking (Sup Table 6).
  • all tables and sup tables
  • the following figures (most of them are only small changes):
    • main panels: 2C, 2D, 3A, 3B, 4A, 5B
    • sup figures: 4, 5, 6, 14, 21, 22, 23, 24, 25, 27, 29, 30, 33, 36, 37, 38, 40

citation

Vromman, M., Anckaert, J., Bortoluzzi, S. et al. Large-scale benchmarking of circRNA detection tools reveals large differences in sensitivity but not in precision. Nat Methods 20, 1159–1169 (2023). https://doi.org/10.1038/s41592-023-01944-6