CRAFT

CRAFT is a computational pipeline that predicts circRNA sequence and molecular interactions with miRNAs and RBPs, along with their coding potential. CRAFT provides a comprehensive graphical visualization of the results, links to several knowledge databases, extensive functional enrichment analysis and combination of predictions for different circRNAs. CRAFT is a useful tool to help the user explore the potential regulatory networks involving the circRNAs of interest and generate hypotheses about the cooperation of circRNAs into the regulation of biological processes.

Installation

Installation from the Docker image

The Docker image saves you from the installation burden. A Docker image of CRAFT is available from DockerHub at https://hub.docker.com/r/annadalmolin/craft; just pull it with the command:

docker pull annadalmolin/craft:v1.0

Usage

Input data

Prepare your project directory with the following files:

list_backsplice.txt: file with circRNA coordinates. The file format is a tab-separated text file, with circRNA backsplice coordinates in the first column and circRNA strand in the second. An example of list_backsplice.txt is:
```
  4:143543509-143543972	+
  11:33286413-33287511	+
  15:64499292-64500166	+
```
path_files.txt: file with the relative paths for Ensembl annotation and genome files. The file format is a text file with a path written in each row, in the following order:
1. path to annotation file
2. path to genome file
An example of path_files.txt is:
```
  /data/input/Homo_sapiens.GRCh38.104.gtf
  /data/input/Homo_sapiens.GRCh38.dna.primary_assembly.fa
```
The gene annotation (in GTF format) and the genome sequence (in FASTA format) files must be downloaded by the user from Ensembl database and placed into the input/ directory contained in the project directory. Annotation and genome files for Homo sapiens (GRCh38) can be downloaded from http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/ and http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/, respectively.
params.txt: file with the parameters to be setted in CRAFT. The file format is a text file with a/more parameter/s written in each row, in the following order:
1. kind of prediction; it can be "M" for miRNA prediction, "R" for RBP prediction, "O" for ORF prediction, "MR", "MO", "RO" or "MRO" for a combination of the previous.
2. investigated species; it can be one of the species in miRBase database: hsa for Homo sapiens, mmu for Mus musculus, etc.
3. parameters for miRanda tool (optional); in a single row, they must be the miRanda_score and the miRanda_energy, in order, separated by tab. The user must set or both parameters or neither of the two; default values are 80 (score) and -15 (energy).
4. parameters for beRBP tool (optional); in a single row, in order and separated by a tab, they must be the PWM/s and the RBP/s investigated. The syntax is: PWM RBP; multiple PWMs (separated by ", ") and associated RBP (separated by ", ") are also allowed. The default is all all, searching for all PWMs and RBPs included in beRBP database. The user must set both parameters or none of the two.
5. prefix of the genome and indexes downloaded from UCSC website; f.i. hg38 for Homo sapiens. The human genome file can be downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ . Genome and indexes must be included in the input/ directory.
6. parameters for ORFfinder tool (optional); in order, separated by tab, the user must specify: the genetic code to use, the start codon to use, the minimal ORF length, whether to ignore nested ORFs and the strand in which putative ORFs are searched. The user must set all parameters or none of them. The allowed options for each parameter are:
  1. genetic code: 1-31, see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details; default: 1
  2. start codon: 0 = "ATG" only, 1 = "ATG" and alternative initiation codons, 2 = any sense codon; default: 0
  3. minimal ORF length (nt): allowed values are 30, 75, or 150; default: 30
  4. ignore nested ORFs (ORF completely placed within another). allowed values are "TRUE" or "FALSE"; default: "FALSE"
  5. strand (output ORFs on specified strand only): allowed values are "both", "plus" or "minus"; default: "plus"
7. parameters for the graphical output for a single circRNA investigated (optional, but advised); the default parameters are: l=50000, QUANTILE1=”FALSE”, thr1=0.95, score_miRNA=120, energy_miRNA=-22, QUANTILE2=”FALSE”, thr2=0.95, dGduplex_miRNA=-20, dGopen_miRNA=-11, QUANTILE3=”FALSE”, thr3=0.9, voteFrac_RBP=0.15, orgdb="org.Hs.eg.db", meshdb="MeSH.Hsa.eg.db", symbol2eg="org.Hs.egSYMBOL2EG", eg2uniprot="org.Hs.egUNIPROT", org="hsapiens". The user must specify only the parameters to be changed with respect to the default, in a comma-separated list format; the parameter order does not matter. Available parameters:
  1. l: maximum length of circRNAs analyzed
  2. QUANTILE: whether to filter predictions based on a quantile threshold (thr); QUANTILE1 and thr1 are set for miRanda predictions, QUANTILE2 and thr2 for PITA predictions, QUANTILE3 and thr3 for beRBP predictions
  3. score_miRNA and energy_miRNA: respectively, score and energy values of miRanda tool. Best predictions are obtained with higher score and lower energy
  4. dGduplex_miRNA and dGopen_miRNA: respectively, dGduplex and dGopen values of PITA tool. Best predictions are obtained with lower dGduplex and higher dGopen
  5. voteFrac_RBP: voteFrac value of beRBP tool. Best predictions are obtained with higher voteFrac
  6. orgdb and meshdb: databases for miRNA enrichment analysis; the default values are “org.Hs.eg.db” and “MeSH.Hsa.eg.db”, respectively (Homo sapiens)
  7. symbol2eg and eg2uniprot: databases for RBP enrichment analysis; the default values are “org.Hs.egSYMBOL2EG” and “org.Hs.egUNIPROT”, respectively (Homo sapiens)
  8. org: organism, in the form: human - ’hsapiens’, mouse - ’mmusculus’; the default value is for Homo sapiens
8. parameters for the summary graphical output for all circRNAs investigated (optional, but advised); the default parameters are the same as the previous point. The user must specify only the parameters to be changed with respect to the default, in a comma-separated list format; the parameter order does not matter. Available parameters: the same as before, except for meshdb and org. It is advised to set point 7 and point 8 parameters with the same values.
An example of params.txt file is:
```
  M
  hsa


  hg38
  
  score_miRNA=125, energy_miRNA=-25, dGduplex_miRNA=-22, dGopen_miRNA=-10
  score_miRNA=125, energy_miRNA=-25, dGduplex_miRNA=-22, dGopen_miRNA=-10, voteFrac_RBP=0.3
```

and directory:

input/: directory containing the following files:
- genome and annotation files from Ensembl database, and genome and indexes files from UCSC databases (see above)
- backsplice_gene_name.txt: file with circRNA gene names. It must be created by the user. The file format is a tab-separated text file, with circRNA backsplice in the first column and circRNA host gene name in the second; the official gene name has to be used. The header line is needed. An example of backsplice_gene_name.txt is:
```
  circ_id	gene_names
  4:143543509-143543972	SMARCA5
  11:33286413-33287511	HIPK3
  15:64499292-64500166	ZNF609
```
- AGO2_binding_sites.bed (optional): file with validated AGO2 binding sites. The file, in BED6 format, must have the following fields: chromosome, start genomic position (0-based), end genomic position, the string “AGO2_binding_site”, a dot, the strand. Keep attention to use the same genome reference version as that included in the input/ directory. An example of AGO2_binding_sites.bed is:
```
  4    143543521    143543542    AGO2_binding_site    .    +
  4    143543530    143543559    AGO2_binding_site    .    +
  4    143543562    143543607    AGO2_binding_site    .    +
```
  The number of miRNA binding sites overlapped with AGO2 binding sites is written in the standard output. Check it in order to decide to keep AGO2 overlapping or re-running the analysis without this information (i.e. when very few sites are overlapping).

Running the analysis

To run CRAFT from the Docker container use:

sudo docker run -it -v $(pwd):/data annadalmolin/craft:v1.0

All paths in path_files.txt must be relative to the directory in the container where the volumes were mounted (f.i. /data/input/file_name, as detailed above). If you want the container to give your user permissions, you need to set the owner id with "-u id -u":

sudo docker run -u `id -u` -it -v $(pwd):/data annadalmolin/craft:v1.0

Output data

After CRAFT successful run end, you will find the following new directories in your project directory:

sequence_extraction/: contains intermediary files for the sequence reconstruction step
functional_predictions/: contains final files of sequence reconstruction step and the three directories for miRNA, RBP and ORF predictions, respectively
graphical_output/: contains the directory general/ with the summary predictions of all circRNA analyzed, and a directory for each single circRNA with the specific investigation

sequence_extraction/

The output files for the sequence reconstruction step are:
- backsplice_sequence_1.fa: file with the retrieved genomic sequence for each circRNA in FASTA format
- backsplice_sequence_1.txt: tab-separated file with the retrieved genomic sequence for each circRNA in TXT format; the file appear with the circRNA backsplice coordinates in the first column and the sequence in the second
- backsplice_circRNA_length_1.txt: tab-separated file with circRNA sequence length, with circRNA backsplice in the first column and circRNA length in the second
All these files are found in the functional_predictions/ directory.
functional_predictions/

The output files of functional prediction step are (the final output of each tool is highlighted in bold):
- miRNA_detection/:
  - backsplice_sequence_per_miRNA.fa: the sequence used for miRNA prediction, obtained repeating the first 20 nt of the sequence at the end of each circRNA
  - miRanda/:
    - output_miRanda.txt: original output of miRanda
    - output_miRanda_c_per_R.txt: output of miRanda (list of miRNA binding sites), not overlapping with AGO2 binding sites, if AGO2_binding_sites.bed is provided, otherwise this file is missing
    - output_miRanda_per_R.txt: final output of miRanda (list of miRNA binding sites), overlapping with AGO2 binding sites if AGO2_binding_sites.bed is provided, otherwise it contains the list of miRNA binding sites not overlapping with AGO2 binding sites
  - PITA/:
    - pred_pita_results.tab, pred_pita_results_targets.tab, pita.err, pita.log, pred_pita_results.gxp: original output of PITA
    - pred_pita_results_targets_b.txt: output for multiple sites
    - pred_pita_results_c.txt: output of PITA (list of miRNA binding sites), not overlapping with AGO2 binding sites, if AGO2_binding_sites.bed is provided, otherwise this file is missing
    - pred_pita_results_per_R.txt: final output of PITA (list of miRNA binding sites), overlapping with AGO2 binding sites if AGO2_binding_sites.bed is provided, otherwise it contains the list of miRNA binding sites not overlapping with AGO2 binding sites
- RBP_detection/:
  - backsplice_sequence_per_RBP.fa: the sequence used for RBP prediction, obtained repeating the first 20 nt of the sequence at the end of each circRNA
  - beRBP/:
    - analysis_RBP/:
      - resultMatrix.tsv: original output of beRBP
      - resultMatrix_b.tsv: final output of beRBP in TSV (list of RBP binding sites)
      - resultMatrix_b.txt: final output of beRBP in TXT (list of RBP binding sites)
- ORF_detection/:
  - backsplice_sequence_per_ORF_MIN_LENGTH.fa: the sequence used for ORF prediction (with minimal length of the ORF = MIN_LENGTH), obtained doubling circRNA sequence twice
  - ORFfinder/:
    - result_list_ORF_MIN_LENGTH.txt, result_list_CDS_MIN_LENGTH.txt, result_text_ORF_MIN_LENGTH.txt, result_table_ORF_MIN_LENGTH.txt, ORF0_MIN_LENGTH.log, ORF1_MIN_LENGTH.log, ORF2_MIN_LENGTH.log, ORF3_MIN_LENGTH.log, ORF0_MIN_LENGTH.perf, ORF1_MIN_LENGTH.perf, ORF2_MIN_LENGTH.perf, ORF3_MIN_LENGTH.perf: original output of ORFfinder (with minimal length of the ORF = MIN_LENGTH)
    - ORF_backsplice.txt and ORF_backsplice0.txt: final output of ORFfinder (list of ORF detected), respectively with ORF start position in 1-based and in 0-based format
    - ORF_backsplice_open.txt and ORF_backsplice_open0.txt: final output of ORFfinder (list of rolling ORF detected), respectively with ORF start position in 1-based and in 0-based format
    - result_list_CDS.fa and result_list_CDS.txt: nucleotidic ORF sequence, respectively in FASTA and TXT format
    - result_list_ORF.fa and result_list_ORF.txt: amino acid ORF sequence, respectively in FASTA and TXT format
graphical_output/

The output files for the graphical output step are:
- general/: directory with the summary predictions of all circRNA analyzed:
  - functional_predictions_all_circRNAs.html: output HTML file summarizing all predictions of all circRNA tested (see CRAFT paper for more details)
  - single figures pulled out from the HTML file
  - All_validated_TGs.csv: table pulled out from the HTML file; it can be loaded into Cytoscape for network analysis
- a directory for each single circRNA with it own predictions:
  - functional_predictions_CIRC_ID.html: output HTML file with the predictions related to CIRC_ID (see CRAFT paper for more details)
  - single figures and tables pulled out from the HTML file

CircRNA sequence provided by the user

If circRNA sequences are available to the user, CRAFT doesn’t perform the sequence reconstruction step. So, to let CRAFT use the provided circRNA sequences, the user must follow these steps:

create the sequence_extraction/ directory into the project directory
add the backsplice_sequence_1.fa, backsplice_sequence_1.txt and backsplice_circRNA_length_1.txt files, in the format described above, to sequence_extraction/
add the backsplice_gene_name.txt file, in the format described above, to sequence_extraction/
if the user wants to filter for miRNA binding sites overlapped with AGO2 binding sites, he/she must also add the file region_to_extract_1.bed to sequence_extraction/. The file in BED6 format must have six tab-separated columns: circRNA chromosome, 0-based start position, 1-based end position, backsplice coordinates, score, strand. Each row represents a single separated region from which the circRNA is arranged (exon, intron, part of exon/intron or intergenic region). An example of region_to_extract_1.bed is:
```
 11	33286412	33287511	11:33286412-33287511	.	+
 15	64499291	64500166	15:64499291-64500166	.	+
 4	143543508	143543657	4:143543508-143543972	.	+
 4	143543852	143543972	4:143543508-143543972	.	+
```

Additional notes

Functional enrichments on validated target genes of miRNAs with predicted binding sites in circRNA sequences can be performed only for Homo sapiens (hsa), Mus musculus (mmu) and Rattus norvegicus (rno) species.
The output clearness and intelligibility improve at the growing of filtering stringency; f.i., if a figure is not understandable or CRAFT crashes due to too many predictions, simply re-run the graphical part of the analysis increasing CRAFT stringency.

How to cite

If you use CRAFT for your analysis, please add the following citation to your references:

Dal Molin A, Gaffo E, Difilippo V, Buratin A, Tretti Parenzan C, Bresolin S, Bortoluzzi S, CRAFT: a bioinformatics software for custom prediction of circular RNA functions, Brief Bioinform. 2022 Mar 10;23(2):bbab601.

egaffo/CRAFT