Transcriptome_assemby_suggsestions

De-novo Transcriptome Assembly:

module load java

module load samtools/1.10

module load jellyfish/2.3.0

module load bowtie2/2.4.4

module load python/3.8.5

module load salmon/0.13.1

#run commands on SLURM's srun

~/Trinity/trinityrnaseq-v2.11.0/Trinity --seqType fq --left 39869_TTAGGC_C8W3EANXX_3_20160523B_20160523.1.fastq,39870_AGTTCC_C8W3EANXX_3_20160523B_20160523.1.fastq --right 39869_TTAGGC_C8W3EANXX_3_20160523B_20160523.2.fastq,39870_AGTTCC_C8W3EANXX_3_20160523B_20160523.2.fastq --CPU 20 --max_memory 99G

Optional: keep longest isoform using the trinity script

This keeps the longest isoform per gene. It works only for trinity as it uses the names of the transcripts. It comes as a part of the Trinity package but you can find it here too

perl ~/Trinity/trinityrnaseq-v2.11.0/util/misc/get_longest_isoform_seq_per_trinity_gene.pl Trinity.fasta > Trinity_longest.fasta

Evigene:

Description here But the main thing it does is in this direct quote from the same page: "EvidentialGene tr2aacds.pl is my new, "easy to use" pipeline script for processing large piles of transcript assemblies, from several methods such as Velvet/O, Trinity, Soap, .., into the most biologically useful "best" set of mRNA, classified into primary and alternate transcripts." You can use it on multiple assemblies or just one. The output files will have .okay.xxx (.okay.cds for instance) as an extension.

module load java

module load blast

module load exonerate

module load cdhit

cat Assembly_1.fasta Assembly_2.fasta > assemblies.fasta
perl /nfs/scistore18/vicosgrp/melkrewi/genome_assembly_december_2021/774.genome_guided_transcriptome_assembly/evigene/scripts/prot/tr2aacds.pl -cdnaseq assemblies.fasta

Filtering based on length using faFilter:

You can use faFilter to filter transcripts based on length (we use 500bp):

module load faFilter     
faFilter -minSize=500 assemblies.okay.cds assemblies_500bp.cds

Extras:

super transcripts:

In case you are interested in the idea of supertranscripts, Trinity includes scripts do that and take the output of the normal assembly pipeline here

Genome-Guided Transcriptome Assembly

Genome-Guided assemblies are normally limited by the quality of the genome, so if a good genome is available, it is worth trying. The first step is to align the reads to the genome using tophat after indexing the genome using Bowtie2:

module load tophat
module load bowtie2/2.4.4

srun bowtie2-build Artemia_sinica_genome_29_12_2021.fasta genome_w

tophat -p 40 \
    -o tophat_all \
    genome_w \
 398_1.fastq \
 398_2.fastq

The bam file from tophat is sorted using samtools and then used as input to Trinity:

module load java

module load samtools/1.10

module load jellyfish/2.3.0

module load bowtie2/2.4.4

module load python/3.8.5

module load salmon/0.13.1

samtools sort ~/tophat_all/accepted_hits.bam -o rnaseq.coordSorted.bam

~/Trinity/trinityrnaseq-v2.11.0/Trinity --genome_guided_bam rnaseq.coordSorted.bam --genome_guided_max_intron 10000 --max_memory 99G --CPU 20

arianamacon/Transcriptome_assemby_suggsestions