rwst/yaccl

[challenge] Aspergillus terpenoids

Adafede opened this issue · 16 comments

Hi again!

Small question in the form of a challenge:

Would yaccl be able to perform a query that allows reproducing the listed compounds in https://doi.org/10.1016/j.phytochem.2021.113011 (without the need of npclassifier or classyfire)?

As starting point those two existing queries might help:
https://w.wiki/4ShY
https://w.wiki/3HMD

Best,

rwst commented

In general, yaccl is taxon-agnostic. What we do is take a list of compounds and make sure they are classified correctly. In this case we don't have access to the paper, and the paper is also not in Wikidata. But the following query gets all InChI strings of compounds from Aspergillus species:

SELECT DISTINCT ?inchi
WHERE 
{
  ?item wdt:P703 ?tax.
  ?tax wdt:P171 wd:Q335130.
  ?item wdt:P234 ?inchi.
}

Saving the list of 4,349 compounds in a file aspergillus.txt, we want to check if any are not recognized as natural products. There is no ready option for this but the bash script
for i in `cat aspergillus.txt`; do python3 classify.py -d ./ -j -m "$i"; done
can give a first impression which would be about half of compounds are classified with the current version. The list contains all Aspergillus metabolites, so, if your question is specifically which of the compounds in that paper are recognized as terpenoids we would first need that list. Can you help?

rwst commented

I may have misnderstood. Did you mean to extract all terpenoids from the list of 4,349 compounds? That should be doable.

Sorry, my question was badly formulated. I did not really want to reproduce what is in the article, rather generate the WD+yaccl equivalent. Rather...yes...you were faster

Here is a slightly adapted query:

https://w.wiki/4T7k

rwst commented

I have pushed a new version of the classify script such that JSON output also includes the molecule. This would allow more comfortable processing of the output of my small bash script given above.

rwst commented

For the sake of speed there should be better handling when both -j and -t are given. Noted.

Beautiful. Thanks!

rwst commented

Alternatively, if you are satisfied with what is already in WD, going without yaccl should work too:
https://w.wiki/4T9V
But it times out...

rwst commented

Pushed the addition of InChI key too...

https://w.wiki/4T9n
no time out :)

Wooops... forgot to filter:

https://w.wiki/4T9t

rwst commented

These are only the subclasses, you need to include P31/P279* in order to get all. Either with UNION or using the pipe symbol.

Oh, indeed nice catch
https://w.wiki/4T9y

rwst commented

I knew because the yaccl run found 360, as well. Now, for the interpretation, the subclasses might contain duplicates where the stereochemistry is unspecified.

rwst commented

This one is interesting https://www.wikidata.org/wiki/Q77573987
a cyclic farnesan that was misrecognized as macrolide (not trivial).

Well...this one is in the end a real challenge! x)

I am not sure a lot of humans would do better

rwst commented

So, you see, I nearly always add P31/P279 to WD compounds at the same time I add SMARTS to classes. Exceptions: I still need to add P31/P279 for unspecified alkaloids and macrolides in WD.

Having it all in WD simplifies searches as this one. The downside is that as followup the WD entries need to be maintained, e.g. by frequent scanning.

This issue also demands improvements in yaccl/WD integration. I'll leave it open until I think it is resolved.