Analysis of GSE118959 circRNAs in LNCaP cell lines
Run DE_circRNA.Rmd
script to output results for DE circRNAs in Clone1 vs control & Clone9 vs control.
To get the overlapping circRNAs present in upregulated/downregulated sets, use the following bash command:
awk -F'\t' 'NR==FNR{c[$2$2]++;next};c[$2$2] > 0' clone1_upreg.txt clone9_upreg.txt > up_regulated_overlap.txt
awk -F'\t' 'NR==FNR{c[$2$2]++;next};c[$2$2] > 0' clone1_downreg.txt clone9_downreg.txt > down_regulated_overlap.txt
The above code returns matching lines using column 2 as the key. As this analysis will be an unweighted graph, there is no need to worry about which Fold Change value to take, I am simply interested in the circRNA ID and its genomic coordiantes.
Firstly, I will use targetscan (7.0) to perform this analysis on a test "UTR" file. Note that targetscan requires RNAfold and 2 perl libraries in order to work Statistics::Lite
and Bio::TreeIO
. Bio::TreeIO
was a nasty install, I had to install cpanminus
(cpanm
) which will install missing dependencies for the library you are trying to install. It failed, so the --force
option had to be used along with sudo
:
curl -L https://cpanmin.us | perl - --sudo App::cpanminus
sudo cpanm Bio::TreeIO -f -n
sudo cpanm Statistics::Lite -f -n ##-n no test
For fresh install of perl (useful for lugh)
delete /home/perl5 dir
follow instructions here: https://gist.github.com/ckandoth/1f01d8f3692bb8de7f2929f259a4035f
after install, run curl -L https://cpanmin.us | perl - App::cpanminus
installs automatically to /home/perl5
then install final 2 libraries via cpanm.
have not tested install without cpanm part.
I am still testing the software. Once it is running, I will have to develop a pipeline to extract DNA sequences, convert to RNA and then format them for targetscan input. (use samtools faidx here, also sed 's/T/U/g'
for DNA to RNA conversion).
For your benefit and others, detailed below is how to execute the TargetScan pipeline using perl scripts downloadable from http://www.targetscan.org/cgi-bin/targetscan/data_download.vert72.cgi (bottom of page has executable scripts).
Step 1: Run targetscan_70.pl
The script takes two input files:
- A tab-delimited file that lists the miRNA seed sequences and the species in which they are present.
- A tab-delimited multiple sequence alignment of the 3' UTRs of genes from the desired species.
To generate file 1, download the miR Family info file from TargetScan, and subset it for human miRNA and remove duplicate miRNA entries:
wget http://www.targetscan.org/vert_72/vert_72_data_download/miR_Family_Info.txt.zip && unzip miR_Family_Info.txt.zip && rm miR_Family_Info.txt.zip
grep "hsa" miR_Family_Info.txt | awk '{print $1, $2, $3}' | sort -u -k1 | tr ' ' '\t' > hsa_miR_Family_Info.txt
For now, the UTR file has been manually created as a tab separated file:
hsa_circ_0022393 9606 GGACCUGGCCUGGGCCGUCAGCUACUACAUCCGGUUCUUCAUCACCUACAUCCCUUUCUACGGCAUCCUGGGAGCCCUCCUUUUCCUCAACUUCAUCAGGUGCCUGGGCUUUGCUGAUUCAUCCCUGGGCCCUCCACUGGCACUGAUGAUGGCUUUGGCAUGAGAAGAGGCUCGGACGGCUCCACUUUGCCUGGGGACCUCCCAUCCUGGCCCCUUGGCAGGGGCUGUCCUGGGCUGGUUGGGAGAGUGGAUCUGUUCCCACUGUGGCUGGGCCCCCGGAGGCCCCUGCGCUGAGCUGUGUUCUCUUGCAGGUUCCUGGAGAGCCACUGGUUUGUGUGGGUCACACAGAUGAAUCACAUCGUCAUGGAGAUUGACCAGGAGGCCUACCGUGACUGGUUCAGUAGCCAGGUAGGGAAGUCAGGGCCGGUCACCAGAGCCUGGUCCCAAUGUCCAUGUCCUGGCCCCAGAUGAUCUGCACCUGCUAACACAGGGCUGUGCUGGGCCUCCCGGGGCUGCCUGUGUCCUUUUGCUGGGAGCUUCAGGGCUGAGCAAGAAAGGGCAGCAGGAAUCCCCCAGAGGGACAGGUGGGGCAUCUGGGUGGGCUGAGGCCAUCAGGCAGGACGGUAUGAUGUGGACGGGUGGCCUGGAGACCCUGGGUAAGGCUGGGCCCCCUGGGAAUGAGGCCGGGCCCUUGGGCCUUCCUUUGUUCCCUGACACUCAUCCCCUCCACUGACAGCUGACAGCCACCUGCAACGUGGAGCAGUCCUUCUUCAACGACUGGUUCAGUGGACACCUUAACUUCCAGAUUGAGCACCA
Run the command:
../../scripts/targetscan_exe/targetscan_70.pl ../../data/TargetScan/hsa_miR_Family_Info.txt ../../data/TargetScan/hsa_circ_0022392.txt targetscan_70_output.txt
Step 2: Run targetscan_70_BL_bins.pl & targetscan_70_BL_PCT.pl
Two Perl scripts calculates conserved branch length (BL) and probability of conserved targeting (PCT).
Input files for targetscan_70_BL_bins.pl
:
- a UTR file = a tab-delimited multiple sequence alignment of the 3' UTRs of genes from the desired species
which is the same as the input file for
targetscan_70.pl
. - The script also requires a directory called "PCT_parameters" containing the following file:
Tree.generic.txt
. This is a tree file describing the phylogeny of all TargetScanHuman 7 species as defined by the conservation of their UTRs.
The script must be run in the same directory as PCT_parameters
, you can write the output files to a results dir.
../../scripts/targetscan_exe/targetscan_70_BL_bins.pl hsa_circ_0022392.txt > ../../results/circRNA_TargetScan/UTRs_median_BLs_bins.my_output.txt
Input files for targetscan_70_BL_PCT.pl
:
- miRNA file, the same miRNA file used by
targetscan_70.pl
. - Predicted targets, the output file generated by
targetscan_70.pl
(targetscan_70_output.txt
) - UTR bin info, the output file generated by
targetscan_70_BL_bins.pl
(UTRs_median_BLs_bins.my_output.txt
)
The script must be run in the same directory as PCT_parameters
, you can write the output files to a results dir.
../../scripts/targetscan_exe/targetscan_70_BL_PCT.pl hsa_miR_Family_Info.txt ../../results/circRNA_TargetScan/targetscan_70_output.txt ../../results/circRNA_TargetScan/UTRs_median_BLs_bins.my_output.txt > ../../results/circRNA_TargetScan/targetscan_70_output.BL_PCT.my_output.txt
Step 3: run targetscan_70_contect_scores.pl
Perl script targetscan_70_context_scores.pl
calculates context++ scores for a set of miRNA targets predicted by targetscan_70.pl
and further processed by targetscan_70_BL_PCT.pl
. Both of these steps must precede context++ score calculation.
The script takes several input files: specified in command:
- a miRNA file: a tab-delimited text file of mature miRNA information. This is different from the file required by
targetscan_70.pl
, where we collapsed replicate miRNA entries in field 1. This contains full miRNA sequences which have variation so do not collapse the information. Generate the file usingmiR_Family_Info.txt
that we downloaded in step 1:
grep "hsa" miR_Family_Info.txt | cut -f1,3,4,5 | tr ' ' '\t' > miR_for_context_scores.txt
- a UTR file: a tab-delimited multiple sequence alignment of the 3' UTRs of genes from the desired species which is the same as the input file for
targetscan_70.pl
. - a predicted targets file with BLSs and PCTs: output from
targetscan_70_BL_PCT.pl
(targetscan_70_output.BL_PCT.my_output.txt
) - ORF lengths file (for ORFs matching 3' UTRs in UTR_file): contains the length of each ORF corresponding to aligned 3' UTRs. Is outputted in the step below.
hsa_circ_0022393 9606 816
- ORF 8mer counts file: contains the number of 8mer sites in ORFs of file (4). You generate this file using a perl script
targetscan_count_8mers.pl
:
./targetscan_count_8mers.pl hsa_miR_Family_Info.txt hsa_circ_0022392.txt > hsa_circ_0022392_8mer_counts.txt
Gives 8mer counts and sequence length file.
with names hard-coded in the analysis script: 6. "TA_SPS_by_seed_region.txt": contains TA and SPS parameters for each seed region 7. "Agarwal_2015_parameters.txt": with model parameters to calculate context++ score contributions 8. "All_cell_lines.AIRs.txt": AIRs for each region of each 3' UTR. Run the following code to download the full file and grab the first 4 columns:
wget http://www.targetscan.org/vert_70/vert_70_data_download/3Pseq_tags_AIRs.zip && unzip 3Pseq_tags_AIRs.zip && awk '{print $1,$2,$3,$4}' All_cell_lines.AIRs.txt | tr ' ' '\t' > All_cell_lines.AIRs_tmp.txt && rm All_cell_lines.AIRs.txt && mv All_cell_lines.AIRs_tmp.txt All_cell_lines.AIRs.txt
Run the command
../../scripts/targetscan_exe/targetscan_70_context_scores.pl miR_for_context_scores.txt hsa_circ_0022392.txt ../../results/circRNA_TargetScan/targetscan_70_output.BL_PCT.my_output.txt hsa_circ_0022392.lengths.txt hsa_circ_0022392_8mer_counts.txt ../../results/circRNA_TargetScan/targetscan_70_context_scores_output.txt