petermr/CEVOpen

๐Ÿ“• Documentation: Dictionary.xml and DictionaryDescription.md of: eoActivityAgents

EmanuelFaria opened this issue ยท 0 comments

I've assembled a list of 8850 "activity agents" (I don't know what else to call them), that will need to normalized against either Wikidata or perhaps chebi.

I created this list by doing a GREP query on the almost 250,000 articles I pulled down with GetPapers last year. The query included up to four words before the term "agent" or "agents". Then with a LOT of cleaning, I trimmed the leading words and got this list down from about 50,000 to its present state (there were a lot of duplicates).

All the articles I pulled all had to do with various terms describing for the two main themes: Plant Extracts (or essential oils, etc.) AND Activities (medicinal, pharmacological, phyto-medicinal, etc.) NOT (petrol, shale, "oil", ... nothing "animal feed-related") etc.,

I ran the cleanest getpapers queries I could. Overall, there are very few terms that are out of the ballpark. Some of them have to do with what I consider "formulation" terms, (excipients, adhesives, abrasives, etc..,) but for the most part, these would be useful for any biomedical project, including Covid.

I did most of the work months ago, as a way to see what the literature had in it, and flex my growing GREP skills. But I pulled it out a couple of days ago and decided to do a bunch of find (junk/stop words) and replace them with , and it came out really nice. I wish I could have kept the discarded words separately (turns out scientists use a lot of puffery in their descriptions, just like marketers do!), but I couldn't think of a way of doing that that would have been practical or efficient.

Anyhow, it's still useful even without further disambiguation or descriptions, but adding those would definitely make it more useful โ€” especially, if we could split them up into different dictionaries, for example, having to do with different pathways. But that's different kettle of fish.

EDIT: Also, I ran some random tests by pulling out multi-word terms I'd never heard of, and putting them โ€” in quotes โ€” in EUPMC searches, and all of them had a decent number of hits.

EDIT 2: Plus, I never would otherwise have found so many different ways (synonym terms) to find things I'm actually interested in. For example:

  • anti-oxidant agent
  • anti-oxidants agent
  • anti-oxidation agent
  • anti-oxidative agent
  • anti-oxidative protecting agent
  • anti-oxidative stress agent
  • anti-oxidizing agent
  • anti-oxygenic agent

Who knew? ๐Ÿ˜€๐ŸŽ‰