Admin_EnsemblReferenceFileGenerator.pl not producing any result for a list of transcripts

Question

Admin_EnsemblReferenceFileGenerator.pl not producing any result for a list of transcripts

Closed this issue 4 years ago · 7 comments

Hello,
I am trying to generate a vagrent.cache file for a pre-specified list of Ensembl transcripts. The list looks as a list of IDs:

ENST00000389048
ENST00000343823
ENST00000447712
ENST00000409261
...

Then I try running this script:
perl VAGrENT/bin/Admin_EnsemblReferenceFileGenerator.pl --species human --assembly GRCh38 --database homo_sapiens_core_99_38 --ccds CCDS2Sequence.current.txt --features Homo_sapiens.GRCh38.99.gtf --cdna_fa Homo_sapiens.GRCh38.cdna.all.fa --ncrna_fa Homo_sapiens.GRCh38.ncrna.fa -output vagrent_cache/ --trans_list List_of_transcripts.tsv --fai GRCh38.fa.fai
But it produces empty files, despite the fact that the transcripts in the list are grep-able in the CDNA and GTF files. Am I doing something wrong? The GTF, DNA and NCRNA come from Ensembl v.99.

Answer 1 · 2020-06-12T13:38:54.000Z

Hi Using the dynamic download reference generation works for Ensembl 99 Human Admin_EnsemblReferenceFileGenerator.pl -sp Human -as GRCh38 -d homo_sapiens_core_99_38 -f ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/ -o /noddy/output/location If I add a transcript list with the your 4 transcripts from below, this also works and generates a cache file with the specified transcripts. Admin_EnsemblReferenceFileGenerator.pl -sp Human -as GRCh38 -d homo_sapiens_core_99_38 -f ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/ -tl trans.list -o /noddy/output/location Running it this way will use the gff3 annotation file on Ensembl’s ftp site in preference to the gtf file, and that might be the difference. Could you download the gff3 file and try it with that instead of the gtf file? Andy From: nvolkovaGEL <notifications@github.com> Reply-To: cancerit/VAGrENT <reply@reply.github.com> Date: Friday, 12 June 2020 at 12:56 To: cancerit/VAGrENT <VAGrENT@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: [cancerit/VAGrENT] Admin_EnsemblReferenceFileGenerator.pl not producing any result for a list of transcripts (#41) [EXT] Hello, I am trying to generate a vagrent.cache file for a pre-specified list of Ensembl transcripts. The list looks as a list of IDs: ENST00000389048 ENST00000343823 ENST00000447712 ENST00000409261 ... Then I try running this script: perl VAGrENT/bin/Admin_EnsemblReferenceFileGenerator.pl --species human --assembly GRCh38 --database homo_sapiens_core_99_38 --ccds CCDS2Sequence.current.txt --features Homo_sapiens.GRCh38.99.gtf --cdna_fa Homo_sapiens.GRCh38.cdna.all.fa --ncrna_fa Homo_sapiens.GRCh38.ncrna.fa -output vagrent_cache/ --trans_list List_of_transcripts.tsv --fai GRCh38.fa.fai But it produces empty files, despite the fact that the transcripts in the list are grep-able in the CDNA and GTF files. Am I doing something wrong? The GTF, DNA and NCRNA come from Ensembl v.99. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_VAGrENT_issues_41&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=78548Dn6j4HdK2O7bZKJ_Q&m=ixOMWEWxMiQ78_K-OSpe2LtcFE6l92ENcz_x7HiwKIA&s=BzhtZwUxCNPexjEghon5lTbRty1k7bZBbaFQ_x6mTbQ&e=>, or unsubscribe [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB2KUTJKYBKYD2XPE3MSGHDRWIJUNANCNFSM4N4ILJYA&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=78548Dn6j4HdK2O7bZKJ_Q&m=ixOMWEWxMiQ78_K-OSpe2LtcFE6l92ENcz_x7HiwKIA&s=sp7iODKctiM5ZrsSSo2fBBkfmYz3ZasnzS09YOv5S3o&e=>.

…

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Answer 2 · 2020-06-12T14:19:45.000Z

Hi Andy,
Unfortunately, using Homo_sapiens.GRCh38.99.gff3 did not help, it still reports that everything worked fine:

Downloading Files -------- Skipped, files locally supplied
Obtaining Filtered Transcript List ----- Skipped, files locally supplied
Building Cache Files ----- Done

But returns a set of empty files:

-rwxrwxrwx 28 Jun 12 14:55 vagrent.human.GRCh38.homo_sapiens_core_99_38.cache.gz
-rwxrwxrwx 75 Jun 12 14:55 vagrent.human.GRCh38.homo_sapiens_core_99_38.cache.gz.tbi
-rwxrwxrwx 0 Jun 12 14:55 vagrent.human.GRCh38.homo_sapiens_core_99_38.fa

Connecting via FTP fails due to time , which is why I tried using local files.
I am running it using the cgpwgs Docker image, can it be the an issue? Does it need all the set of references specified in the cgpwgs docker wiki for the full pipeline run to create the cache for VAGrENT?
Upd. After installing all dependencies and using a local copy, I keep getting same empty cache with local Ensembl files.

Answer 3 · 2020-06-15T10:26:55.000Z

Hi I hate to say this, but it works for me. If I get the external reference files wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz wget ftp://ftp.ensembl.org/pub/release-99/gff3/homo_sapiens/Homo_sapiens.GRCh38.99.gff3.gz wget https://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS2Sequence.current.txt And then run Admin_EnsemblReferenceFileGenerator.pl --species Human --assembly GRCh38 --database homo_sapiens_core_99_38 --features Homo_sapiens.GRCh38.99.gff3.gz --cdna_fa Homo_sapiens.GRCh38.cdna.all.fa.gz --ncrna_fa Homo_sapiens.GRCh38.ncrna.fa.gz --trans_list trans.list -fai genome.fa.fai -ccds CCDS2Sequence.current.txt -o out_dir/ Where my genome.fa.fai file is of the format chr1 248956422 112 80 81 chr2 242193529 252068602 80 81 chr3 198295559 497289663 80 81 chr4 190214555 698064029 80 81 …. And my trans.list looks like this ENST00000389048 ENST00000343823 ENST00000447712 ENST00000409261 I get a cache file and fasta file with content. Question I should probably have asked at the start, which version of Vagrent are you using? Andy From: nvolkovaGEL <notifications@github.com> Reply-To: cancerit/VAGrENT <reply@reply.github.com> Date: Friday, 12 June 2020 at 15:20 To: cancerit/VAGrENT <VAGrENT@noreply.github.com> Cc: Andrew Menzies <am3@sanger.ac.uk>, Comment <comment@noreply.github.com> Subject: Re: [cancerit/VAGrENT] Admin_EnsemblReferenceFileGenerator.pl not producing any result for a list of transcripts (#41) [EXT] Hi Andy, Unfortunately, using Homo_sapiens.GRCh38.99.gff3 did not help, it still reports that everything worked fine: Downloading Files -------- Skipped, files locally supplied Obtaining Filtered Transcript List ----- Skipped, files locally supplied Building Cache Files ----- Done But returns a set of empty files:

…

-rwxrwxrwx 28 Jun 12 14:55 vagrent.human.GRCh38.homo_sapiens_core_99_38.cache.gz -rwxrwxrwx 75 Jun 12 14:55 vagrent.human.GRCh38.homo_sapiens_core_99_38.cache.gz.tbi -rwxrwxrwx 0 Jun 12 14:55 vagrent.human.GRCh38.homo_sapiens_core_99_38.fa Connecting via FTP fails due to time , which is why I tried using local files. I am running it using the cgpwgs Docker image, can it be the an issue? Does it need all the set of references specified in the cgpwgs docker wiki [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_dockstore-2Dcgpwgs_wiki_Running-2Dunder-2Dsingularity&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=78548Dn6j4HdK2O7bZKJ_Q&m=N4GoHbp1843kSzvUwFNdhRflCPfiEcLR8jBSiZ89Z4s&s=XSiLZp1pZ_wL8Af11uVz75pfvivSl3zFKdT8Bbv21Nc&e=> for the full pipeline run to create the cache for VAGrENT? — You are receiving this because you commented. Reply to this email directly, view it on GitHub [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_VAGrENT_issues_41-23issuecomment-2D643295209&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=78548Dn6j4HdK2O7bZKJ_Q&m=N4GoHbp1843kSzvUwFNdhRflCPfiEcLR8jBSiZ89Z4s&s=OP_ecv94bC7LiOcNMGkTaMXYcjR9_CzAAHg9wqjev08&e=>, or unsubscribe [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB2KUTMJ3KUZCEBHHUM2H4DRWI2RBANCNFSM4N4ILJYA&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=78548Dn6j4HdK2O7bZKJ_Q&m=N4GoHbp1843kSzvUwFNdhRflCPfiEcLR8jBSiZ89Z4s&s=j3bLxmos3v5WSoFZUAUryF2aeMgj98JiKEA9bWQa_Rs&e=>.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Answer 4 · 2020-06-15T15:26:48.000Z

Hi Andy,
This sounds like a mystery!

I get a cache file and fasta file with content. Question I should probably have asked at the start, which version of Vagrent are you using?

I tried both with versions 3.6.1 (cloned from this GitHub) and 3.5.0 (latest release). All my files come from the same links or look the same way as yours.

This update somehow disappeared from my previous comment, but I wanted to add that I am able to generate a full cache (without the list of transcripts)! Do you think it is some silent perl library misbehavior on my side?

Answer 5 · 2020-06-15T15:38:29.000Z

Possibly. Can you try running it via the docker container? That should eliminate dependancy shenanigans.

https://quay.io/repository/wtsicgp/vagrent

I used the 3.6.1 container (released in May 6th) to generate my results.

Answer 6 · 2020-06-19T12:34:51.000Z

Hello Andy,
I have solved the issue! Turned out to be the list of transcripts being saved with the wrong end of line symbol, which prevented the parsing script from finding the relevant transcripts in the database. Would you consider adding a message that the cache builder would output if it does not find a transcript in the reference?
Thanks for your help and sorry for the fuss!
All the best,
Nadezda

Answer 7 · 2020-06-19T15:54:23.000Z

No problem, glad you solved it. I've been bitten by that in the past, but not in this context. I'll add a note to the wiki page.