Corset-tools: A Python repository from patrick-douglas

#Corset-tools Companion scripts for working with transcript clusters generated by Corset.

##Fetch Cluster Sequences Given a list of clusters of interest, a transcript fasta, and a Corset mapping file, fetchClusterSeqs.py will return a fasta file of transcripts belonging to those clusters (with cluster names appended).

This script is intended for use, following differential expression analysis, to extract and annotate transcript belonging to DE clusters.

###Usage fetchClusterSeqs.py -i transcripts.fa -t myfavClusters.csv -o transcriptsWithclusterTags.fa -c clusters.txt

###Options -i, --inFasta Multi fasta to extract subset from.

-t, --targetClust Comma delimited file with target cluster names in column one.

-c, --clustMap Corset transcript-to-cluster mapping file.

-h, --help Show help message and exit.

-o, --outFasta Directory for new sequence file to be written to. i.e. output/newfile.fa

-l --longest If set only report the longest transcript per cluster

Cross Cluster Counter

When using Corset to cluster transcriptomes from two species/isolates/cultivars/etc., sequence variation is likely to exist between transcriptomes and it is important to check that your Corset clustering has not excessively split clusters based on transcriptome set.

The crossClustCount.py script will take a corset cluster mapping (transcript --> cluster), as well as fasta files for those transcriptomes, and calculate the number of members from each source transcriptome in each cluster.

Beware of clusters that only contain transcripts from one transcript set when running differential expression analyses - these may create a false differential expression signal.

If you have many clusters in which reads from only one transcriptome set are present, you might consider adjusting your --score-min threshold in Bowtie2 to be more tolerant of variation when mapping reads between transcriptomes.

###Usage crossClustCount.py -X transcripts_X.fa -Y transcripts_Y.fa -x SetX_Label -y SetY_Label -o outfile.tab -c clusters.txt

###Arguments -h, --help Show help message and exit.

-X, --transFastaX Fasta file for transcriptome X.

-Y, --transFastaY Fasta file for transcriptome Y.

-x, --transNameX Unique lable for transcriptome X.

-y, --transNameY Unique lable for transcriptome Y.

-o, --outFile Location for summary file to be written to.

-c, --clustMap Corset transcript-to-cluster mapping file.

Transcriptome Distance Calculator

When using Corset to cluster transcriptomes from two different isolates/individuals/cultivars/species natural sequence variation is likely to occur between input datasets. For equivalent transcripts to cluster correctly it is essential that reads are able to cross-map to their eqivalent location in the non-self transcriptome.

In Bowtie2, the minimum score for a 'valid' read alignment is set via the --score-min option. The format is [type,intercept,slope], with the default setting -L,-0.6,-0.6.

The equation for a 100 bp read would be -0.6 + (-0.6 * 100) = -60.6

The default Bowtie 2 penalty for a mismatched base is -6. An alignment gap of length 2 receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension).

Therefore a 100bp read alignment is still considered valid with 5 mismatches (-5 * 5 = -30), or one gap + one mismatch (-6 + (-5 + -3) = -14).

Failing to set a minimum score threshold that is below (more negative than) the expected distance between homologous transcripts will likely result in transcript clusters that split by source transcriptome; or if homologous clusters are predicted, lower mapping efficiency between sets may create a false differential expression signal in later analyses (genes that are more variable between isolate two isolates may appear to be differentially expressed when exposed to the same treatment condition.)

Conversely, setting --score-min to high will likely result in excessive collapse of paralogues.

The Transcriptome Distance Calculator compares equivalent transcript pairs from two transcriptomes and determines the distance (gaps and mismatches) for each pair, and whether mapping would be successful for a given read length and --score-min setting.

Transcript pairs may be provided as a tab delimited list; alternatively, the script can perform a reciprocal best blast analysis if provided with tab formated blast results for SetA vs SetB, and SetB vs SetA.

Pending feature: Advise user of the minimum alignment score that must be tolerated in order to allow cross-mapping for 95% of the transcript population.

###Usage Pairs provided: transDistCalc.py -r 100 -i -0.6 -s -0.6 -n pairnames.txt -a FASTA_A.fa -b FASTA_B.fa -g -5 -e -3 -m -6 -o alignmentreports.txt

Get pairs from reciprocal best blast: transDistCalc.py -r 100 -i -0.6 -s -0.6 -x BLAST_AvsB.tab -y BLAST_BvsA.tab -a FASTA_A.fa -b FASTA_B.fa -g -5 -e -3 -m -6 -o alignmentreports.txt -w

###Options -h, --help Show help message and exit.

-r, --readLength Length of reads intended for mapping to transcriptomes. (after quality trimming).

-i, --scoreMinIntercept Intercept value for Bowtie2 --score-min function.

-s, --scoreMinSlope Slope value for Bowtie2 --score-min function.

-n, --pairNames Two column tab delimited file containing names for sequences from fastaA and fastaB to be aligned.

-x, --blastAvB Blast tab result file for fastaA query against fastaB subject.

-y, --blastBvA Blast tab result file for fastaB query against fastaA subject.

-a, --fastaA Multifasta set A.

-b, --fastaB Multifasta set B.

-g, --gapOpen Penalty for opening a gap.

-e, --gapExtend Penalty for extending a gap.

-m, --mismatch Pentaly for a mismatched position.

-o, --outFile Write output table to file.

-w, --recipFile Write reciprocal blast pairs to file. True if flag is set.

patrick-douglas/Corset-tools

Cross Cluster Counter

Transcriptome Distance Calculator