Manual of TSscan 1. System Requirement The TSscan pipeline is executed on the 64-bit Linux operation system (e.g., Bio-Linux 6; also see http://nebc.nerc.ac.uk/ for more information). The BLAT and BFAST aligners can be downloaded at http://genome.ucsc.edu/ (the UCSC Genome Browser) and http://sourceforge.net/apps/mediawiki/bfast/, respectively. All source codes can be compiled by g++. The makefile that can automatically generate all executable programs is also provided. Of note, the system should support OpenMP to compile the source codes. The complied programs of TSscan are also accessible from our website at http://idv.sinica.edu.tw/trees/TSscan/TSscan.html 2. Preparation The initial input data include the reference sequences, the long read data and the short read data. 2.1 Reference sequences The following three data sets are retrieved from the reference sequences (e.g., hg19 or GRCh37). (1) Date set 1: the whole reference genomic sequences. The whole reference genomic sequences should be completely downloaded from the UCSC Genome Browser, which includes the sequences from chromosomes and the mitochondrion genome and the unplaced/unlocalized sequences (i.e., chr*_random and chrUn_*). (2) Data set 2: the processed mitochondrion genomic sequences. The mitochondrion genomes are formed in a circular fashion. To comprehensively detect possible fusion sequences in the mitochondrion genomes, for each mitochondrion genome we generate a copy and then assemble these two copies together. Such generated genomic sequences are designated as "processed mitochondrion genomic sequences". The processed mitochondrion genomic sequences can be generated by the mitochondrion genome with following UNIX instructions. head -1 chrM.fa > chrM.title cat chrM.fa | grep -v "^>" > chrM.seq cat chrM.title chrM.seq chrM.seq > RepChrM.fa (3) Data set 3: the annotated RNA sequences. The annotated RNAs are downloaded from the UCSC Genome Browser and the Ensembl Genome Browser (http://www.ensembl.org/). To minimize mapping errors due to unsequenced gaps, it would be better to detect trans-splicing candidates on a model species with high-quality genomic sequences and annotations. 2.2 Long read data The polyA tails of the 454-reads should be removed, and the raw sequencing data of the long 454-reads should be converted into a fasta format. 2.3 Short read data The raw sequencing data of the short reads should be converted into a fastq format. After that, install all data sets and the TSscan files in the same folder. During the process of TSscan, do not move any file or change any file name. 3. The Pipeline of TSscan The TSscan processes include the following steps (see Fig. 1). Step 1: Identifying chimeric RNA candidates by BLAT-aligning long reads against the reference genome. 1.1: Mapping the long reads onto the Data set 1 (the whole reference genomic sequences) by BLAT Example: blat RefGenome.fa longreads.fa out_step1_1.psl Note: If the BLAT alignments are processed by chromosomes, all the results should be integrated into a file in a psl format and be sorted according to the long read IDs (i.e., "query ID", the 10th column of the psl-formatted file). 1.2 TSscan1of4 out_step1_1.psl longreads.fa out_step1_2.fa Usage: TSscan1of4 [psl] [fasta] [output] [psl] the result of the BLAT-alignment between the long reads and the reference genome. [fasta] the long reads in a fasta format. [output] name of the output file. 1.3 Mapping the output file of Step 1.2 into the Data set 2 and Data set 3 (the processed mitochondrion genomic sequences and the annotated RNA sequences) by BLAT Example: blat out_step1_2.fa longreads.fa out_step1_3.psl 1.4 TSscan2of4 RefRNA.blat longreads.fa out_step1_4.fa Usage: TSscan2of4 [psl] [fasta] [output] [psl] the output file of step 3. [fasta] the long reads in a fasta format. [output] name of the output file. 1.5 Mapping the output file of Step 1.4 into the unplaced/unlocalized sequences (i.e., chr*_random and chrUn_*) by BLAT Example: blat out_step1_4.fa longreads.fa out_step1_5.psl 1.6 TSscan3of4 out_step1_5.psl longreads.fa out_step1_6.fa Usage: TSscan3of4 [psl] [fasta] [output] [psl] the output file of Step 1.5. [fasta] the long reads in a fasta format. [output] name of the output file. Step 2 Excluding candidates without the support of short RNA-Seq reads. 2.1 Mapping the short reads into the output file of Step 1.6 by BFAST Note: Please see the BFAST page at http://sourceforge.net/projects/bfast/files/ for details. 2.2 For illumina RNA-Seq reads: cat out_step2_1.sam | ./TSscanSamParser.NT out_step1_6.fa > out_step2_2.sam For color space reads (SOLiD reads): cat out_step2_1.sam | ./TSscanSamParser.CS50 out_step1_6.fa > out_step2_2.sam Note: TSscan-parsing the output of Step 1.6. For the current version, the length of illumina RNA-Seq reads is limited to 50 bases and the length of the color space reads must be exactly 50 bases. 2.3 cat shortreads.fastq | ./FastqOut out_step2_2.sam 1 > out_step2_3.fastq Note: Extracting short reads which remain in the output SAM file of Step 2.2. 2.4 Mapping the output file of Step 2.3 into the Data sets 1~3 by BFAST. All the SAM files are then merged into a SAM file Note: Please see the BFAST page at http://sourceforge.net/projects/bfast/files/ for details. 2.5 TSscan4of4 out_step2_2.sam out_step2_4.sam longreads.fa out_step2_5.out Usage: TSscan4of4 [sam1] [sam2] [fasta] [output] [sam1] result file of mapping short reads to junction sequences (in a SAM format). [sam2] result file of mapping short reads to the reference genomic sequences (in a SAM format). [fasta] the long reads in a fasta format. [output] name of the output file. After that, the users can manually filter out potential experimental artifacts (Step 3 of Fig. 1) and potential genetic rearrangement events (Step 4 of Fig. 1) by the criteria stated in the text and Figure 1.