Support huge JSON schemas

Question

Support huge JSON schemas

wjn0 opened this issue 4 months ago · 3 comments

Is your feature request related to a problem? Please describe.
Huge JSON schemas fail when generating the template.

Describe the solution you'd like
Ideally, the whole schema should not be processed at once. Instead, process it on the fly as generation proceeds.

Describe alternatives you've considered
Provide some kind of progress indicator to see if the current implementation is in fact hopeless, or I'm just impatient/under-resourced hardware-wise :)

Additional context

Minimum reproducible example (note: downloads a 3.5MB public-facing JSON Schema from a well-known org, probably cache this if you're gonna run it a bunch):

from typing import List
import requests

from transformers import AutoModelForCausalLM, AutoTokenizer
import guidance
from pydantic import BaseModel


print("guidance version: ", guidance.__version__)


model_name = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)

# download latest FHIR schema build
schema = requests.get("https://build.fhir.org/fhir.schema.json").json()


messages = [
    {"role": "system",
     "content": "Construct a FHIR Bundle resource from the provided patient information. Provide your output in valid FHIR JSON."},
    {"role": "user",
     "content": "John Smith was diagnosed with diabetes on 2021-01-01. He is prescribed metformin 500mg daily. He is allergic to penicillin."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

lm += prompt

print(lm)

# lm += guidance.gen(max_tokens=100)
lm += guidance.json(schema=schema)

print(lm)

I ran it against wjn0/guidance@improve-json-schema-support in order to have hackish fixes of #887 and #888 (which hopefully are not the reason for the hang...)

Answer 1 · 2024-06-06T16:30:54.000Z

Where do you observe the hang?

Note that calling guidance.json(schema=schema) essentially "compiles" the schema down to a guidance grammar. Adding that grammar to the lm then kicks off the generation process. So my question is really: does it hang at "compile time" or at generation time?

Either way, your proposed solution is interesting:

Ideally, the whole schema should not be processed at once. Instead, process it on the fly as generation proceeds.

This seems to imply a stateful version of the grammar that only compiles sub-grammars as needed. I could see this potentially being a lot more efficient for massive schemas like this... Let's first track down the source of the issue you're seeing and keep this idea in our back pocket 😉

@riedgar-ms tag

Answer 2 · 2024-06-06T16:41:49.000Z

In addition to the schema, is there a sample JSON file which conforms to it? It could be interesting to turn that into a test with the Mock, to see (as @hudson-ai asks) whether it's a problem related to building the grammar or to generating the output.

Answer 3 · 2024-06-06T17:34:18.000Z

I think it's hanging at compilation time/building the grammar (when I interrupt it after several minutes it's in replace_grammar_node but build_definitions is definitely complete). I can leave it running over the weekend on a bigger machine if that's helpful, or if you'd like me to pop a breakpoint or something somewhere I can try that too.

Re: test JSON, absolutely. Here's an example: https://build.fhir.org/bundle-response-medsallergies.json.html

I think my mental model for JSON generation from schema is definitely more stateful templating, less grammar, in no small part because I started from working with these massive loopy/recursive schemas and had tool use in mind from the jump.

I don't necessarily expect guidance.json to support it - although it would be very cool if it did :) Other guidance templating features are more than enough for me to roll my own provided I can track down the source of #876 :) I do think there's a difference in spirit between generating a small, lightly nested schema (a common use case being, say, generating tool arguments as the OpenAI API nominally intends) where a grammar-based approach is likely the most performant and the simplicity is a boon; versus situations where templating likely will always provide some kind of boost at the expense of simplicity due to the size of the schema.