Establish a BioThings API for access to PFOCR data
Closed this issue · 31 comments
Create an API compliant with the SmartAPI standard using the BioThings SDK.
First step here would be to create a JSON file file with some canonical gene ID (Entrez Gene or Ensembl) as the top-level key, and then an array of objects that corresponds to pathways that gene is a member of. Any reasonable object structure is probably fine, but would be good to confirm that with @kevinxin90 before you lock it in.
@kevinxin90: if we want two endpoints in the API -- one to query by gene ID and one to query by pathway ID -- do we need two separate JSON files? Or can that be handled as part of the BioThings API creation?
Here is a sample JSON file:
{'_id': 'PMC5395363', 'associatetWith': { "genes": [1000, 207, 208, 51384], "additionalData": "..." } }
For all BioThings APIs, we need a “_id” for each document serving as the primary key. In your case, it might be the PMC ID related to the gene-linked pathway figures.
That “_id” has to be unique. In case you have multiple records for the same PMC ID, you could structure your output JSON as below:
{'_id': 'PMC5395363', 'associatetWith': [ { "genes": [1000, 207, 208, 51384], "additionalData": "..." }, { "genes": [1033, 84], "additionalData": "..." } }
The file you sent us could be a list (each element is a JSON document) or a text file with each line representing a JSON document.
In an email back in October 2019, I also asked
would it be possible to capture the link directly to the image? e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5729535/figure/fig1/
to which Alex replied
Yes, we can generate links directly to the image using this URL pattern: https://www.ncbi.nlm.nih.gov/pmc/articles/<PMCID>/bin/<FIG_FILENAME>
I think whatever reasonable way you propose of adding that information to the JSON document would be fine.
@kevinxin90 and @andrewsu, how would you prefer that we include information on which figure the genes are mentioned in? For example, PMC3955956 has a Figure 6 with filename nihms531061f6.jpg.
I think whatever reasonable way you propose of adding that information to the JSON document would be fine.
Do we still want to group the results by PMC ID?
I assume one PMC ID corresponds to one figure, right?
If that's the case, we could just structure the JSON output as below:
{'_id': 'PMC5395363', 'associatedWith': { "genes": [1000, 207, 208, 51384], "figure": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5395363/figure/nihm531061f6.jpg/" } }
One PMC ID corresponds to one paper, so there can be multiple PMC IDs per figure figures per PMC ID. I exported an initial draft of the results as this table. You can preview a sample in the last table in this notebook.
@ariutta I'm a little bit confused here. Could you clarify why would a figure correspond to multiple papers?
If a figure might be related to multiple PMC IDs, and each PMC ID is related to all the genes in that figure, we could then structure the JSON output as below:
{'_id': 'pcbi.1000512.g004', 'associatedWith': {"genes": [1000, 207, 208, 51384], "pmc": ["PMC2735650", "PMC2735651"], "figure_url": ["https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735650/bin/pcbi.1000512.g004.jpg", "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735651/bin/pcbi.1000512.g004.jpg"]}}
.
@ariutta Hi Anders, wanna follow up with you if there are any updates on this issue? Thanks!
You're right, I reversed them! It should be multiple figures per paper.
How about this format?
{"_id": "PMC5395363__nihm531061f6",
"associatedWith": {
"genes": ["1000", "207", "208", "51384"],
"pmc": "PMC5395363",
"figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5395363/bin/nihm531061f6.jpg" }
}
@ariutta Hi Anders, thanks for the clarification. That structure looks great to me! We can proceed setting up an API for it once the JSON file is ready. Thank you!
Add "pmid" as well.
Note @ariutta: the example figureUrl is one of the ones that does not resolve. I'm assuming it's just a bad example?
@kevinxin90 We can give your a smaller JSON on Monday and then provide a much larger one a week later. Or we could just wait and provide the larger one later. Do you have a preference? Would it just be "busy work" to process our JSON twice or would you prefer to solidify the path on an early version of the file and then re-process again later?
@AlexanderPico Hi Alex, sorry I just saw this thread. It's fine to provide the final JSON when everything is ready. The parser should be very straightforward.
Here's a newline delimited JSON file with the format from my earlier comment:
https://www.dropbox.com/s/cbtamwk9u0xdhgo/pfocr_biothings.ndjson?dl=0
These are from figures that our system has identified as pathways and that had at least three recognized genes in our OCR process.
@AlexanderPico I'll work on adding the PMID soon, but I wasn't able to get it into this file.
@kevinxin90 we will have additional results coming, probably next week.
@kevinxin90 for _id
, I left the .jpg
extension on. Also, I made genes
strings instead of numbers. If you want either of these changed, just let me know.
@andrewsu and @kevinxin90, I'm going to mark this as done
. Here are summary stats for our exported file pfocr_biothings.ndjson
(the same file I mentioned above):
- figure source:
pfocr20191102_93k
. These were the figures we collected on 2019-11-02, limited to the top 93,000 (as sorted by PMC relevance score) from our figure query. - OCR: Google Cloud Vision (GCV), performed in January
- Image classification: GCV AutoML model trained, validated and tested on a set of 10k figures manually labeled as
pathway
orother
. Performed in February. Yielded a pathway score between0
(not a pathway) and1
(is a pathway).
The results were further limited to the 33,179
figures that both:
- had a pathway score greater than
0.5
- mentioned three or more recognized human genes
From these figures, we recognized:
736,260
total genes12,201
unique genes
If you'd like, we can provide you with additional hits we got when we removed the limitation of "top 93,000 (as sorted by PMC relevance score)".
I think the ball is now in Kevin's court, but I don't think this issue should be closed until we actually complete the stated milestone -- ie, the creation of the BioThings API to serve PFOCR data. Kevin I know is working on this -- should be done in the next week or so.
(Minor issue, but does the latest version have PMIDs? Would be a nice-to-have, but obviously the PMCIDs will suffice too...)
Roger that! Anders has PMIDs in the pipeline. It will be a part of all future depositions. We wanted to give you file asap meet the milestone. We can update it later next week if you think PMIDs will be helpful for the Segment 1 demo. Otherwise, we'll save it for the next update, which will include a ton more content early in Segment 2.
I think PMIDs can wait until segment 2. Once Kevin creates the API based on the dropbox file linked above, we'll ask you to check it out. After we're all happy with it, we can close this ticket...
@andrewsu @AlexanderPico @ariutta
The PFOCR API is alive now, please check it up, here are some query examples:
- Query for a specific figure: https://pending.biothings.io/pfocr/geneset/PMC100008__mb2411709009.jpg
- Query for a specific gene:
https://pending.biothings.io/pfocr/query?q=associatedWith.genes:107 - Query for a specific PMC ID:
https://pending.biothings.io/pfocr/query?q=associatedWith.pmc:PMC2494582
Cool! I verified a handful of queries. It looks good to me.
Looks good!
Minor question: if we're going to add a pmid
field, should we rename the pmc
field to pmcid
?
Nice! I'll also note that multi-gene queries works, e.g., https://pending.biothings.io/pfocr/query?q=associatedWith.genes:27115%20AND%20associatedWith.genes:55811. I think this will work out perfectly for how we envisioned a second layer of BTE prioritization. For example, suppose for a given query, BTE comes back with ~100 reasoning chains (expressed as paths of biomedical entities). We could query PFOCR to look for pathway figures that contain multiple ones of those entities. Right now the API is limited to genes, but later we will expand to other entity types as well...
Before the demo, we should try to flesh out an example like this in one of our example notebooks (PREDICT_demo, EXPLAIN_demo, or tidbit 2).
@AlexanderPico @kevinxin90 I assume the answer to this question is "no, we'll use pmc
and pmid
":
Minor question: if we're going to add a pmid field, should we rename the pmc field to pmcid?
@ariutta Hi Anders, my bad. I missed this thread. I think "pmc" is fine. Normally, we label "pmid" as "pubmed". Thanks!
@andrewsu APIs confirmed. Is this one ready to close? Do you need anything on this one for the demo?
I think we are good, please close it out! (We are trying to put together a notebook that demonstrates the use of the PFOCR API with BTE for the demo next week. If anyone on your end has bandwidth to work on that, let me know!)