mit-ll-responsible-ai/hydra-zen

YAML roundtrip not working correctly

petosa opened this issue · 5 comments

Hello, I am using hydra-zen to save config yaml files, ship them over to another system, and then load them back in and instantiate them. However, load_from_yaml is not working as expected - it loads back in a DictConfig rather than a hydrazen type and thus none of the type validation + coercion works when instantiating the loaded yaml. This is particularly an issue for enum types.

Repro:

import omegaconf
from omegaconf import MISSING, OmegaConf
import dataclasses
from enum import Enum

from hydra_zen import instantiate, builds, save_as_yaml, load_from_yaml, to_yaml

class Color(Enum):
    red = "RED"
    blue = "BLUE"

@dataclasses.dataclass
class Example:
    color: Color =  MISSING
    number: int = MISSING
    
    
built = builds(Example, color="blue", number="3")


print(type(built))
print(built)
print(instantiate(built))

save_as_yaml(built, "saved.yaml")
loaded = load_from_yaml("saved.yaml")

print()
print(type(loaded))
print(loaded)
print(instantiate(loaded))

The first 3 prints make sense:

<class 'type'>
<class 'types.Builds_Example'>
Example(color=<Color.blue: 'BLUE'>, number=3)

Proper type validation and coercion is done. When the yaml is saved and loaded back in however, type validation + coercion does not happen, since load_from_yaml returns a DictConfig rather than a type. The results of instantiate do not match. From the last 3 prints:

<class 'omegaconf.dictconfig.DictConfig'>
{'_target_': '__main__.Example', 'color': 'blue', 'number': 3}
Example(color='blue', number=3)

As you can see, Example is instantiated with the string blue rather than the enum <Color.blue: 'BLUE'>.

rsokl commented

Thanks for taking the time to write this up, and for the nice repro.

The issue that you are hitting here is unfortunate... While I think this is more a Hydra/OmegaConf issue, I would certainly like to help hydra-zen users avoid this issue.

I think this quick solution may suit your needs and may work quite generally:

def load_object_from_yaml(x: str):
    loaded = load_from_yaml("saved.yaml")
    return OmegaConf.to_object(OmegaConf.create(instantiate(loaded)))

Here it is in action for your example:

from omegaconf import MISSING, OmegaConf
import dataclasses
from enum import Enum

from hydra_zen import instantiate, builds, save_as_yaml, load_from_yaml, to_yaml, just

def load_object_from_yaml(x: str):
    loaded = load_from_yaml("saved.yaml")
    return OmegaConf.to_object(OmegaConf.create(instantiate(loaded)))

def pp(x):
    print(x, "::", type(x))

class Color(Enum):
    red = "RED"
    blue = "BLUE"

@dataclasses.dataclass
class Example:
    color: Color =  MISSING
    number: int = MISSING
    
    
built = builds(Example, color="blue", number="3")


pp(instantiate(built))

save_as_yaml(built, "saved.yaml")
loaded = load_object_from_yaml("saved.yaml")

print()
pp(loaded)

it prints:

Example(color=<Color.blue: 'BLUE'>, number=3) :: <class '__main__.Example'>

Example(color=<Color.blue: 'BLUE'>, number=3) :: <class '__main__.Example'>
  • instantiate(loaded)) returns Example('blue', number=3). This is where Hydra might change its behavior and see that the returned dataclass has an enum-type field that had not been converted.
  • OmegaConf.create(...) returns the omegaconf DictConfig-type {'color': <Color.blue: 'BLUE'>, 'number': 3}. Here, omegaconf leverages the annotation in Example to produce the appropriate enum value.
  • OmegaConf.to_object(...) converts the DictConfig into the original backing dataclass.

load_from_yaml is currently just an alias for OmegaConf.load, which does not perform any instantiation. That being said, I think it would be reasonable for hydra-zen to add a load_from_yaml(..., to_object: bool) option, which would effectively do these extra steps for the user.

@petosa let me know if the above solution works for you in general and if the proposed enhancement to load_from_yaml seems sufficient.

@Jasha10 perhaps you have a better solution.

@rsokl thank you for your fast reply and suggestion. After playing around with your solution for a while, I noticed some shortcomings. In general, these are due to differences in functionality between omegaconf and hydrazen (the latter is more robust than the former).

Here are 3 specific situations where the proposed solution does not work (can attach some code snippets later if that'd help).

  • enum type coercion does not work if Example is a function instead of a dataclass.
  • Omegaconf runs into a validation error if Example is a normal object (not a dataclass) with an __init__ function.
  • Omegaconf runs into a MissingMandatoryValue error if it loads a yaml with missing fields (hydrazen handles missing values gracefully).

I think the right thing is to have load_from_yaml parse the yaml file into a hydrazen build type. That way the call to instantiate will benefit from robust hydrazen instantiation without having to fall back on Omegaconf.

rsokl commented

Thanks so much for looking further into this, @petosa . I will investigate those failcases and will try to devise a more robust solution.

One workaround, which does not rely on OmegaConf's implicit conversion of str to Enum, is to call builds(Color, value="BLUE") so as to create an object whose _target_ points to Color:

import dataclasses
from enum import Enum
from hydra_zen import builds, instantiate, load_from_yaml, save_as_yaml
from omegaconf import MISSING

def pp(x):
    print(x, "::", type(x))

class Color(Enum):
    red = "RED"
    blue = "BLUE"

@dataclasses.dataclass
class Example:
    color: Color = MISSING
    number: int = MISSING

built = builds(Example, color=builds(Color, value="BLUE"), number="3")

pp(instantiate(built))

save_as_yaml(built, "saved.yaml")
loaded = load_from_yaml("saved.yaml")

pp(instantiate(loaded))

print()
pp(loaded)
$ python script.py
Example(color=<Color.blue: 'BLUE'>, number=3) :: <class '__main__.Example'>
Example(color=<Color.blue: 'BLUE'>, number=3) :: <class '__main__.Example'>

{'_target_': '__main__.Example', 'color': {'_target_': '__main__.Color', 'value': 'BLUE'}, 'number': 3} :: <class 'omegaconf.dictconfig.DictConfig'>

Here are 3 specific situations where the proposed solution does not work (can attach some code snippets later if that'd help).

I believe my proposal above should work around the 3 situations you mentioned, @petosa.

@Jasha10 thank you for this suggestion - it does seem to address my cases. I guess the only inconvenience is that I need to have special handling via builds for enums. I will try it out in my project and re-open if I run into more errors. Appreciate all the help!