module load java
module load samtools/1.10
module load jellyfish/2.3.0
module load bowtie2/2.4.4
module load python/3.8.5
module load salmon/0.13.1
#run commands on SLURM's srun
~/Trinity/trinityrnaseq-v2.11.0/Trinity --seqType fq --left 39869_TTAGGC_C8W3EANXX_3_20160523B_20160523.1.fastq,39870_AGTTCC_C8W3EANXX_3_20160523B_20160523.1.fastq --right 39869_TTAGGC_C8W3EANXX_3_20160523B_20160523.2.fastq,39870_AGTTCC_C8W3EANXX_3_20160523B_20160523.2.fastq --CPU 20 --max_memory 99G
This keeps the longest isoform per gene. It works only for trinity as it uses the names of the transcripts. It comes as a part of the Trinity package but you can find it here too
perl ~/Trinity/trinityrnaseq-v2.11.0/util/misc/get_longest_isoform_seq_per_trinity_gene.pl Trinity.fasta > Trinity_longest.fasta
Description here But the main thing it does is in this direct quote from the same page: "EvidentialGene tr2aacds.pl is my new, "easy to use" pipeline script for processing large piles of transcript assemblies, from several methods such as Velvet/O, Trinity, Soap, .., into the most biologically useful "best" set of mRNA, classified into primary and alternate transcripts." You can use it on multiple assemblies or just one. The output files will have .okay.xxx (.okay.cds for instance) as an extension.
module load java
module load blast
module load exonerate
module load cdhit
cat Assembly_1.fasta Assembly_2.fasta > assemblies.fasta
perl /nfs/scistore18/vicosgrp/melkrewi/genome_assembly_december_2021/774.genome_guided_transcriptome_assembly/evigene/scripts/prot/tr2aacds.pl -cdnaseq assemblies.fasta
You can use faFilter to filter transcripts based on length (we use 500bp):
module load faFilter
faFilter -minSize=500 assemblies.okay.cds assemblies_500bp.cds
In case you are interested in the idea of supertranscripts, Trinity includes scripts do that and take the output of the normal assembly pipeline here
Genome-Guided assemblies are normally limited by the quality of the genome, so if a good genome is available, it is worth trying. The first step is to align the reads to the genome using tophat after indexing the genome using Bowtie2:
module load tophat
module load bowtie2/2.4.4
srun bowtie2-build Artemia_sinica_genome_29_12_2021.fasta genome_w
tophat -p 40 \
-o tophat_all \
genome_w \
398_1.fastq \
398_2.fastq
The bam file from tophat is sorted using samtools and then used as input to Trinity:
module load java
module load samtools/1.10
module load jellyfish/2.3.0
module load bowtie2/2.4.4
module load python/3.8.5
module load salmon/0.13.1
samtools sort ~/tophat_all/accepted_hits.bam -o rnaseq.coordSorted.bam
~/Trinity/trinityrnaseq-v2.11.0/Trinity --genome_guided_bam rnaseq.coordSorted.bam --genome_guided_max_intron 10000 --max_memory 99G --CPU 20