SystemsGenetics/pynome

Changes for Salmon and Kallisto

spficklin opened this issue · 1 comments

It seems Pynome is using the entire genome assembly for Salmon and Kallisto indexing rather than the cDNA sequences. The following should be used instead.

Ensembl

For both Salmon and Kallisto we need to use the cDNA file that Ensembl provides. You can find the cDNA files in an FTP directory similar to the following:

ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/oryza_sativa/cdna/

We need to retrieve the file with the suffix .cdna.all.fa.gz. The file just needs to be uncompressed and used instead of the whole genome FASTA file.

NCBI

If a GTF file exists we need to download it and run the following command to create a cDNA FASTA file:

gffread -w transcripts.fa -g genome.fa transcripts.gtf

Where:

  • transcript.fa is the name of the output FASTA file that will have the cDNA entries.
  • genome.fa is the name of the whole genome FASTA file
  • transcript.gtf is the name of the GTF file.

We want to name the output file transcript.fa to follow our naming convention for all assembly files.

If a GTF file is not available but a GFF is then I believe Pynome already has code to convert it to GTF. Although, I think currently happens after genome indexing. Instead the converion of a GFF to GTF should happen prior to indexing and then that GTF file can be used to create a cDNA FASTA file just as described above.

Fixed. Thanks @4ctrl-alt-del