pachterlab/ffq

False metadata fetched when using BioSample accession number

trickovicmatija opened this issue · 8 comments

Hey,

Nice and useful tool! I am trying it to access metadata based on BioSample number (unfortunately the info I am interested in can only be found on BioSample entry, and not ERS, ERR...) and it returns completely different sample. When I use ERS and ERR numbers, it works normally!
Some of SAMEA numbers I am interested in: SAMEA3449213, SAMEA3449357, SAMEA3449368 ...
Maybe I am doing something wrong.. I have installed it with pip (as in README) and just use ffq SAMEA

Thanks!

Hello trickovicmatija, thanks for the kind words and for raising this issue. You are not doing anything wrong. We have looked into this and this seems to be due to an inconsistency between the NCBI website and the NCBI API, and not with how ffq fetches metadata.

For example, the NCBI website for SAMEA3449213 displays the correct metadata: https://www.ncbi.nlm.nih.gov/biosample/?term=SAMEA3449213

However, NCBI's API to retrieve metadata for that accession contains false metadata (which is what you are seeing in ffq): https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=biosample&id=SAMEA3449213&rettype=fasta&retmode=xml)

Additionally, the ENA api, which we use for some accessions, returns the correct metadata (https://www.ebi.ac.uk/ena/browser/api/xml/SAMEA3449213).

So this appears to be an undocumented problem with the NCBI API and not with how ffq retrieves metadata. We are figuring out how to make a judgment call as to which API to rely on when situations like this arise (which in our experience has been quite rare). We are also in the process of getting in touch with NCBI to alert them of this inconsistency. We have added this to the ffq's enhancements issue (#21) and will close this issue for now since since this is a NCBI issue, but will keep you posted on their response. Thanks again!

Hi there,

I just hit this and the behaviour was pretty confusing - I was just harvesting the URLs returned and downloading them, and when in subsequent analysis nothing mapped to my target genome I thought it must be an issue in my mapping and spent considerable time troubleshooting that. It looks like it applies to all SAMEA samples. You could consider rejecting these until the fix is in at NCBI, or throwing an error if the NCBI API returns an accession that does not match.

Thanks for this tool, which in general will be really useful

Hello theosanderson. We are very sorry this happened to you. We agree with your comment and we have switched the API to retrieve biosamples from NCBI to ENA in the devel branch. Would you mind posting the accessions that gave you trouble? We want to test this further before merging these changes to master. Thanks a lot and our apologies again!

Thanks @agalvezm -- no worries at all, just wanted to highlight that it was having downstream effects.

Sure, here is an arbitrary set:

SAMEA14281402
SAMEA14281720
SAMEA14207017
SAMEA14281736
SAMEA11855741
SAMEA14281742
SAMEA14207018
SAMEA14207029
SAMEA14206896

It looks like that NCBI endpoint is designed to return only NCBI SAMN IDs and that it is parsing the integer bit of SAMEA ids and returning the SAMNxxxx equivalent. ENA appears to cope with both SAMEA and SAMN correctly. Thanks for making this change!

For additional clarity, I have included parts of the correspondence between us and the NCBI team about this issue:

To: sra@ncbi.nlm.nih.gov

Hello,

Some of our users have found that E-utils returns wrong metadata for Biosample’s SAMEA accessions. For example, for accession SAMEA3449213, there is a mismatch between the metadata available on the website (https://www.ncbi.nlm.nih.gov/biosample/?term=SAMEA3449213) and that available from E-utils (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=biosample&id=SAMEA3449213&rettype=fasta&retmode=xml), this last one being incorrect. This problem seems to apply to all SAMEA accessions, and it looks like E-utils is parsing the integer bit of SAMEA ids and returning the SAMNxxxx equivalent instead.

We wanted to bring this problem to your attention in case you could take a look at it. Thank you so much for your help!

From: sra@ncbi.nlm.nih.gov

Dear Colleague,

Unfortunately this a known bug with no fix. Records being in SAMEA come from ENI (our European partners) and due to that our API cannot search them properly due to indexing differences between NCBI records and ENA/EBI. The web works because it can search the raw text and therefore finds a match.

Your option is to obtain the GIs, which will work with efetch, e.g. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=29004985&db=biosample which is another SAMEA record. But obtaining the GIs cannot be done programmatically if your starting point is SAMEA identifiers. Your better option is to use EBI’s api: https://www.ebi.ac.uk/ena/browser/api/

If I understand this is still a major issue, isn't it?
So annoying that NCBI cannot make a stable API.

It's a major issue with the NCBI api but it is not an issue at all for ffq since we switched to using the ENA api for stability (see devel branch)

pip install git+https://github.com/pachterlab/ffq.git@devel

Can I ask again. Is this feature implemented in a released version?