Remove redundant calls to normalize cached objects
Closed this issue · 1 comments
Re: comments from @leokim-l on Slack, and my own suspicions that the oaklib normalization functions were being called too often, I tried to find out how much impact connection speed/availability had on entity grounding performance.
$ echo "nucleus" > nucleus.txt
$ poetry shell
$ time ontogpt -vvv extract -i nucleus.txt -m gpt-4o -t go_simple
...
real 0m7.334s
user 0m4.230s
sys 0m1.488s
So that's uncached - running the same again should be with the cache
$ time ontogpt -vvv extract -i nucleus.txt -m gpt-4o -t go_simple
...
real 0m4.085s
user 0m3.859s
sys 0m1.307s
OK, so it's faster - but I'm on a pretty decent wired connection here. After manually limiting my ethernet speed to 10 Mbps:
real 0m5.963s
user 0m5.502s
sys 0m1.617s
That could be coincidental, but after restoring the full speed it's back in the 4s range. So it looks like a call is still being made somewhere.
but...
if I have no connection at all, no errors are raised, the grounding process just doesn't happen (so if I wanted to map to some other namespace, that process wouldn't work).
Full debug output looks like this:
$ time ontogpt -vvv extract -i nucleus.txt -m gpt-4o -t go_simple
WARNING:ontogpt.clients:llm_gpt4all module not found. GPT4All support will be disabled.
WARNING:ontogpt.engines.knowledge_engine:GPT4All client not available. GPT4All support will be disabled.
INFO:root:Logger root set to level 10
INFO:root:Found model: MODEL_GPT_4o, provided by OpenAI.
INFO:root:Input file: nucleus.txt
INFO:root:Input text: nucleus
INFO:ontogpt.io.template_loader:Loading schema for go_simple
INFO:root:Importing core as core from source /home/harry/ontogpt/src/ontogpt/templates/go_simple.yaml; base_dir=/home/harry/ontogpt/src/ontogpt/templates
INFO:root:Importing linkml:types as /home/harry/ontogpt/.venv/lib/python3.10/site-packages/linkml_runtime/linkml_model/model/schema/types from source /home/harry/ontogpt/src/ontogpt/templates/go_simple.yaml; base_dir=None
INFO:ontogpt.io.template_loader:Getting class for template go_simple
INFO:root:Using template OntologyTermSet
INFO:root:Using mappers (currently hardcoded)
INFO:root:Model source is openai
INFO:root:API KEY path = /home/harry/.config/ontology-access-kit/openai-apikey.txt
DEBUG:httpx:load_ssl_context verify=True cert=None trust_env=True http2=False
DEBUG:httpx:load_verify_locations cafile='/home/harry/ontogpt/.venv/lib/python3.10/site-packages/certifi/cacert.pem'
INFO:root:Setting up OpenAI client API Key
INFO:root:API KEY path = /home/harry/.config/ontology-access-kit/openai-apikey.txt
INFO:root:Will use this client: <class 'ontogpt.clients.openai_client.OpenAIClient'>
INFO:ontogpt.clients.openai_client:Complete: engine=gpt-4o, prompt[210]=From the text below, extract the following entities in the following format:
terms: <A semicolon-se...
INFO:ontogpt.clients.openai_client:Caching OpenAI responses to /home/harry/ontogpt/.openai_cache.db
INFO:ontogpt.clients.openai_client:Using cached payload for prompt: From the text below, extract the following entities in the following format:\n\nte...
INFO:root:RAW TEXT: terms: GO:0005634
label: nucleus
INFO:root:PARSING LINE: terms: GO:0005634
DEBUG:root: FIELD: terms
DEBUG:root:SLOT: terms INL: False VALS: ['GO:0005634']
INFO:root:PARSING LINE: label: nucleus
DEBUG:root: FIELD: label
DEBUG:root:SLOT: label INL: None VALS: ['nucleus']
DEBUG:root:RAW: {'terms': ['GO:0005634'], 'label': 'nucleus'}
DEBUG:root:Grounding annotation object {'terms': ['GO:0005634'], 'label': 'nucleus', 'id': 'c0a98f49-5cb6-4d9c-8535-85b8918ac042'}
DEBUG:root:FIELD: terms SLOT: terms
DEBUG:root: VAL: GO:0005634
INFO:ontogpt.engines.knowledge_engine:GROUNDING GO:0005634 using OntologyTerm
INFO:ontogpt.engines.knowledge_engine:Grounding GO:0005634 to GO:0005634; next step is to normalize
INFO:ontogpt.engines.knowledge_engine:Normalized GO:0005634 with GO:0005634 to GO:0005634
DEBUG:root:FIELD: label SLOT: label
DEBUG:root: VAL: nucleus
DEBUG:root:FIELD: id SLOT: id
DEBUG:root: VAL: c0a98f49-5cb6-4d9c-8535-85b8918ac042
DEBUG:root:Creating object from dict {'terms': ['GO:0005634'], 'label': 'nucleus', 'id': 'c0a98f49-5cb6-4d9c-8535-85b8918ac042'}
INFO:root:{'terms': ['GO:0005634'], 'label': 'nucleus', 'id': 'c0a98f49-5cb6-4d9c-8535-85b8918ac042'}
---
input_text: |
nucleus
raw_completion_output: |-
terms: GO:0005634
label: nucleus
prompt: |+
From the text below, extract the following entities in the following format:
terms: <A semicolon-separated list of any Gene Ontology terms.>
label: <The label (name) of the named thing>
Text:
nucleus
===
extracted_object:
id: c0a98f49-5cb6-4d9c-8535-85b8918ac042
label: nucleus
terms:
- GO:0005634
named_entities:
- id: GO:0005634
label: GO:0005634
The issue is cases like this:
INFO:ontogpt.engines.knowledge_engine:GROUNDING GO:0005634 using OntologyTerm
INFO:ontogpt.engines.knowledge_engine:Grounding GO:0005634 to GO:0005634; next step is to normalize
INFO:ontogpt.engines.knowledge_engine:Normalized GO:0005634 with GO:0005634 to GO:0005634
I think this should be fixed by #373 as that incorporates litellm
's disk cache, though that doesn't do anything for normalization.
On my system, a cached hit takes this long:
$ time ontogpt -vvv extract -i nucleus.txt -m gpt-4o -t go_simple
...
real 0m3.771s
user 0m3.583s
sys 0m1.304s