Genome Fetcher
bkatiemills opened this issue · 18 comments
Once a match is found between some genetic code of interest and a phage via the process described in #1, it's interesting to learn about the surrounding genetic code in the phage; not just the matched string that the BLAST query returned, but what genes are nearby that sequence, as well. The BLAST query doesn't give us this; we need to find that information elsewhere.
NCBI provides a command-line tool, the SRA Toolkit, to download extensive genomic data from their open database, genbank. This is done via accession numbers, essentially unique identifiers for genomic data in their database. Furthermore, phagesdb.org kindly provides a master metadata table, found here in PhagesDB_Data.txt
that contains the accession numbers and genbank status of each of their phages.
From these resources, two python functions are needed:
- function
getAccession(name)
that takes the name of a phage, checks if it's in genbank, and if so, returns the corresponding accession number. The phage name is contained as a string in the first column ofPhagesDB_data.txt
; the genbank status is in column 4 as 'True' (if present in genbank) or 'False' (if not), and the accession number is a string in column 5. So,getAccession('Arbiter B2')
should return the string 'JN618996', whilegetAccession('Ariel J')
doesn't have a genbank entry, so should return perhaps 0. - function
fetchAccession(accession)
, which takes a string argumentaccession
and uses the SRA toolkit mentioned above to download the corresponding dataset. Once installed and working, the SRA toolkit can be used as:
prefetch <accession number>
to download the dataset corresponding toaccession number
.
Using the SRA toolkit didn't quite work out at Codefest; trying to prefetch
some of the accession numbers listed as present in PhagesDB_data.txt
were not found by the tool, so perhaps something has been misunderstood here.
Luckily, it looks like someone else has a solution for us: https://www.biostars.org/p/66921/
This solution takes advantage of Entrez searches and the biopython package, and is specifically designed for fetching stuff from genbank - getting this example up and running should allow the completion of this issue.
I have pushed a refined version of the code which depends on db pulled from phagesdb.org
I checked at ncbi search but it is returning too many results sometimes with different accession values. Does the data at phagesdb.org keeps changing which might be the reason we are looking for adaptive solution?
Can you describe fetchAccession function on what dataset we are looking to fetch and how it needs to be parsed?
It was my understanding that accession numbers should correspond to exactly one dataset in genbank - @goyalsid, please correct me if this is wrong! That being the case, for this issue we just want to download the dataset - no processing required. Issue #3 discusses the data processing we'd like to perform afterwards.
I think that biostars thread I link above has a reasonable solution - but we need to try it out.
Yep, that worked without too much trouble - a bit different than the original issue since this takes a file with accession numbers in it, but that'll do. Still need getAccession(name)
as described in the original post above.
@goyalsid, give this functionality a try and tell me if it makes sense; for usage, make a file with a single accession number in it called test.accession
, for example (I used the first one in the example file, DQ398041
). Then just do
cat test.accession | python acc2gb.py somename@somehost.com > out.gb
Where that email is whatever email you use for your NCBI account, and the data should be downloaded into the file out.gb
(may need to install Biopython first if you haven't, pip install Biopython
).
A couple questions for Sid:
- in the script I grabbed from biostars, a database name
nuccore
is hard-coded into the script. Not sure what this means or if it's appropriate. - the data type for download is hard-coded as
rettype="gb"
- again, I just took this from the biostars post, not sure if it's what we want. Presumably the entrez utility has other options, hopefully it's a simple matter to just pick the right one.
Hi Bill, I'm a Master's student in Sid's group working on bacteria-phage interactions.
"Nuccore", to the best of my knowledge, refers to the NCBI Nucleotide Database, which itself includes a few other databases. It seems like a fine choice of database, since it looks like it contains the broadest data. It's good to know that it could be changed though.
I'm not sure what rettype="gb" means though.
Cool, thanks @mbonsma for clearing that up - actually, I've been meaning to come back to this, since I was chatting with some bioinformatics people the other day, and they pointed out that it's really important to query NCBI properly (via biopython), lest we run afoul of their access policies - but as I mentioned above, biopython appears to be a dependency here, so I think we're ok.
Anyway - I'd love to chat with you sometime more broadly about this project before the Toronto event; I was guiding things day-to-day for this project last year, but am really excited to hand off to you, hear what you have planned and support you in any way possible. Get ahold of me at bill@mozillafoundation.org if you'd like to discuss!
The example @BillMills gave (cat test.accession | python acc2gb.py somename@somehost.com > out.gb
) works great. Incidentally, it seems to work regardless of what type of accession number you give it. For example, I tried NC_015138
, which is the format that the bacteria and archaea are in, and it also works.
The only thing is that it seems to be returning an abbreviated version of the data - on the website, there's an option to save just a 'Genbank' record or a 'Genbank (full)' record. What it's returning now is the Genbank record. Maybe this is what that parameter rettype=gb
is referring to (gb = Genbank?)? I'll look into it. There are some situations in which we'd want the full record (for example, to see all the coding sequences and the DNA sequence itself).
Got it - setting rettype
to rettype = "gbwithparts"
will fetch the entire record. I changed this in acc2gb.py
.
Now all that remains in this issue is the function getAccession(name)
.
Let me see if I got this straight. Can getAccession
be an independent script? So can it be done in Perl?
Yup. So far everything is modular, so each step is more or less independent. It could be done in perl. And just to clarify, this would be using the file PhagesDB_data.txt
.
Great, thanks!
2015-06-05 17:09 GMT+02:00 mbonsma notifications@github.com:
Yup. So far everything is modular, so each step is more or less
independent. It could be done in perl. And just to clarify, this would be
using the file PhagesDB_data.txt.—
Reply to this email directly or view it on GitHub
#2 (comment).
JJ
getAccession.py
works well, but it would be good to rig it so that it reads in a list of names and outputs a file - similar to the functioning of acc2gb.py
.
My most recent comment - no, that hasn't been done yet.
I think getAccession
is good as it exists now, and if we happen to need to modify it to work with a future workflow we can do that (and make a new issue). So, I'm closing this as finished. Thanks @shirishgoyal, @JJ, and @BillMills for all the awesome work!
Thanks, Madeleine.
2016-06-02 15:58 GMT+02:00 Madeleine Bonsma notifications@github.com:
I think getAccession is good as it exists now, and if we happen to need
to modify it to work with a future workflow we can do that (and make a new
issue). So, I'm closing this as finished. Thanks @shirishgoyal
https://github.com/shirishgoyal, @JJ https://github.com/JJ, and
@BillMills https://github.com/BillMills for all the awesome work!—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAAB9IZf-sZVgQfF6GDjgqI2FZ8sP4n3ks5qHuGWgaJpZM4CZdDz
.
JJ