Establish a BioThings API for access to PFOCR data

Question

Establish a BioThings API for access to PFOCR data

Closed this issue 5 years ago · 31 comments

Create an API compliant with the SmartAPI standard using the BioThings SDK.

Answer 1 · 2020-01-22T19:54:00.000Z

First step here would be to create a JSON file file with some canonical gene ID (Entrez Gene or Ensembl) as the top-level key, and then an array of objects that corresponds to pathways that gene is a member of. Any reasonable object structure is probably fine, but would be good to confirm that with @kevinxin90 before you lock it in.

@kevinxin90: if we want two endpoints in the API -- one to query by gene ID and one to query by pathway ID -- do we need two separate JSON files? Or can that be handled as part of the BioThings API creation?

Answer 2 · 2020-01-28T03:10:49.000Z

Here is a sample JSON file:
{'_id': 'PMC5395363', 'associatetWith': { "genes": [1000, 207, 208, 51384], "additionalData": "..." } }

For all BioThings APIs, we need a “_id” for each document serving as the primary key. In your case, it might be the PMC ID related to the gene-linked pathway figures.

That “_id” has to be unique. In case you have multiple records for the same PMC ID, you could structure your output JSON as below:

{'_id': 'PMC5395363', 'associatetWith': [ { "genes": [1000, 207, 208, 51384], "additionalData": "..." }, { "genes": [1033, 84], "additionalData": "..." } }

The file you sent us could be a list (each element is a JSON document) or a text file with each line representing a JSON document.

Answer 3 · 2020-01-28T04:33:58.000Z

In an email back in October 2019, I also asked

would it be possible to capture the link directly to the image? e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5729535/figure/fig1/

to which Alex replied

Yes, we can generate links directly to the image using this URL pattern: https://www.ncbi.nlm.nih.gov/pmc/articles/<PMCID>/bin/<FIG_FILENAME>

I think whatever reasonable way you propose of adding that information to the JSON document would be fine.

Answer 4 · 2020-01-28T18:06:32.000Z

@kevinxin90 and @andrewsu, how would you prefer that we include information on which figure the genes are mentioned in? For example, PMC3955956 has a Figure 6 with filename nihms531061f6.jpg.

Answer 5 · 2020-01-28T18:07:56.000Z

I think whatever reasonable way you propose of adding that information to the JSON document would be fine.

Do we still want to group the results by PMC ID?

Answer 6 · 2020-01-28T18:13:13.000Z

I assume one PMC ID corresponds to one figure, right?
If that's the case, we could just structure the JSON output as below:
{'_id': 'PMC5395363', 'associatedWith': { "genes": [1000, 207, 208, 51384], "figure": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5395363/figure/nihm531061f6.jpg/" } }

Answer 7 · 2020-01-28T18:21:52.000Z

One PMC ID corresponds to one paper, so there can be multiple ~~PMC IDs per figure~~ figures per PMC ID. I exported an initial draft of the results as this table. You can preview a sample in the last table in this notebook.

Answer 8 · 2020-01-29T17:49:38.000Z

@ariutta I'm a little bit confused here. Could you clarify why would a figure correspond to multiple papers?
If a figure might be related to multiple PMC IDs, and each PMC ID is related to all the genes in that figure, we could then structure the JSON output as below:
{'_id': 'pcbi.1000512.g004', 'associatedWith': {"genes": [1000, 207, 208, 51384], "pmc": ["PMC2735650", "PMC2735651"], "figure_url": ["https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735650/bin/pcbi.1000512.g004.jpg", "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735651/bin/pcbi.1000512.g004.jpg"]}}.

Answer 9 · 2020-02-05T22:11:09.000Z

@ariutta Hi Anders, wanna follow up with you if there are any updates on this issue? Thanks!

Answer 10 · 2020-02-07T20:52:26.000Z

You're right, I reversed them! It should be multiple figures per paper.

Answer 11 · 2020-02-07T21:03:11.000Z

How about this format?

{"_id": "PMC5395363__nihm531061f6",
 "associatedWith": {
  "genes": ["1000", "207", "208", "51384"],
  "pmc": "PMC5395363",
  "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5395363/bin/nihm531061f6.jpg" }
}

Answer 12 · 2020-02-10T18:26:00.000Z

@ariutta Hi Anders, thanks for the clarification. That structure looks great to me! We can proceed setting up an API for it once the JSON file is ready. Thank you!

Answer 13 · 2020-02-15T00:47:08.000Z

Add "pmid" as well.

Answer 14 · 2020-02-15T00:49:24.000Z

Note @ariutta: the example figureUrl is one of the ones that does not resolve. I'm assuming it's just a bad example?

Answer 15 · 2020-02-15T00:59:30.000Z

@kevinxin90 We can give your a smaller JSON on Monday and then provide a much larger one a week later. Or we could just wait and provide the larger one later. Do you have a preference? Would it just be "busy work" to process our JSON twice or would you prefer to solidify the path on an early version of the file and then re-process again later?

Answer 16 · 2020-02-19T04:42:55.000Z

@AlexanderPico Hi Alex, sorry I just saw this thread. It's fine to provide the final JSON when everything is ready. The parser should be very straightforward.

Answer 17 · 2020-02-21T22:45:42.000Z

Here's a newline delimited JSON file with the format from my earlier comment:
https://www.dropbox.com/s/cbtamwk9u0xdhgo/pfocr_biothings.ndjson?dl=0

These are from figures that our system has identified as pathways and that had at least three recognized genes in our OCR process.

Answer 18 · 2020-02-21T22:49:17.000Z

@AlexanderPico I'll work on adding the PMID soon, but I wasn't able to get it into this file.

@kevinxin90 we will have additional results coming, probably next week.

Answer 19 · 2020-02-21T23:02:00.000Z

@kevinxin90 for _id, I left the .jpg extension on. Also, I made genes strings instead of numbers. If you want either of these changed, just let me know.

Answer 20 · 2020-02-28T23:58:37.000Z

@andrewsu and @kevinxin90, I'm going to mark this as done. Here are summary stats for our exported file pfocr_biothings.ndjson (the same file I mentioned above):

figure source: pfocr20191102_93k. These were the figures we collected on 2019-11-02, limited to the top 93,000 (as sorted by PMC relevance score) from our figure query.
OCR: Google Cloud Vision (GCV), performed in January
Image classification: GCV AutoML model trained, validated and tested on a set of 10k figures manually labeled as pathway or other. Performed in February. Yielded a pathway score between 0 (not a pathway) and 1 (is a pathway).

The results were further limited to the 33,179 figures that both:

had a pathway score greater than 0.5
mentioned three or more recognized human genes

From these figures, we recognized:

736,260 total genes
12,201 unique genes

If you'd like, we can provide you with additional hits we got when we removed the limitation of "top 93,000 (as sorted by PMC relevance score)".

Answer 21 · 2020-02-29T01:15:09.000Z

I think the ball is now in Kevin's court, but I don't think this issue should be closed until we actually complete the stated milestone -- ie, the creation of the BioThings API to serve PFOCR data. Kevin I know is working on this -- should be done in the next week or so.

(Minor issue, but does the latest version have PMIDs? Would be a nice-to-have, but obviously the PMCIDs will suffice too...)

Answer 22 · 2020-02-29T04:20:04.000Z

Roger that! Anders has PMIDs in the pipeline. It will be a part of all future depositions. We wanted to give you file asap meet the milestone. We can update it later next week if you think PMIDs will be helpful for the Segment 1 demo. Otherwise, we'll save it for the next update, which will include a ton more content early in Segment 2.

Answer 23 · 2020-02-29T04:54:27.000Z

I think PMIDs can wait until segment 2. Once Kevin creates the API based on the dropbox file linked above, we'll ask you to check it out. After we're all happy with it, we can close this ticket...

Answer 24 · 2020-03-02T17:51:06.000Z

@andrewsu @AlexanderPico @ariutta
The PFOCR API is alive now, please check it up, here are some query examples:

Query for a specific figure: https://pending.biothings.io/pfocr/geneset/PMC100008__mb2411709009.jpg
Query for a specific gene:
https://pending.biothings.io/pfocr/query?q=associatedWith.genes:107
Query for a specific PMC ID:
https://pending.biothings.io/pfocr/query?q=associatedWith.pmc:PMC2494582

Answer 25 · 2020-03-02T18:52:15.000Z

Cool! I verified a handful of queries. It looks good to me.

Answer 26 · 2020-03-02T19:37:35.000Z

Looks good!

Minor question: if we're going to add a pmid field, should we rename the pmc field to pmcid?

Answer 27 · 2020-03-02T20:27:48.000Z

Nice! I'll also note that multi-gene queries works, e.g., https://pending.biothings.io/pfocr/query?q=associatedWith.genes:27115%20AND%20associatedWith.genes:55811. I think this will work out perfectly for how we envisioned a second layer of BTE prioritization. For example, suppose for a given query, BTE comes back with ~100 reasoning chains (expressed as paths of biomedical entities). We could query PFOCR to look for pathway figures that contain multiple ones of those entities. Right now the API is limited to genes, but later we will expand to other entity types as well...

Before the demo, we should try to flesh out an example like this in one of our example notebooks (PREDICT_demo, EXPLAIN_demo, or tidbit 2).

Answer 28 · 2020-03-06T18:06:37.000Z

@AlexanderPico @kevinxin90 I assume the answer to this question is "no, we'll use pmc and pmid":

Minor question: if we're going to add a pmid field, should we rename the pmc field to pmcid?

Answer 29 · 2020-03-06T18:16:15.000Z

@ariutta Hi Anders, my bad. I missed this thread. I think "pmc" is fine. Normally, we label "pmid" as "pubmed". Thanks!

Answer 30 · 2020-03-10T17:57:12.000Z

@andrewsu APIs confirmed. Is this one ready to close? Do you need anything on this one for the demo?

Answer 31 · 2020-03-10T18:08:37.000Z

I think we are good, please close it out! (We are trying to put together a notebook that demonstrates the use of the PFOCR API with BTE for the demo next week. If anyone on your end has bandwidth to work on that, let me know!)