nextstrain/seasonal-flu

Shared auspice config JSON is not appropriate for all builds

huddlej opened this issue · 9 comments

The current auspice configuration file (config/auspice_config.json) is not always appropriate for all builds in the Snakefile. The most obvious example of this is that the auspice config exports a color-by for “RBS mutations” even though A/H3N2 HA builds are the only ones that define RBS mutations.

This feels like an issue that should be resolved at the level of the Snakefile and not at the augur or auspice level (it's not necessary for those tools to know that this is how we are using them). Possible solutions at the Snakefile-level include:

  • define one separate config JSON per build (duplicating any shared content) and reference these files by wildcards (config/auspice_config_{lineage}_{segment}_{resolution}.json)
  • define one default config JSON shared by most builds and allow build-specific copies of this default to be defined through a Snakemake input function (e.g., config/auspice_config_h3n2_ha.json)
  • define a single templated config JSON (as with jinja, etc.) that contains all shared and build-specific content and render one config JSON per build from this template using conditional logic

The first solution seems the least ideal in that most of these configs will be identical and any update to the shared content between them will require the same updates to many files.

The second solution still duplicates shared content but benefits from defining build-specific conditional logic in Python in the Snakefile. For seasonal flu, it seems likely that the custom config will be the exception and not the rule, so a little duplicated content may not be the worst outcome.

The third solution is guaranteed to avoid content duplication, but it does move the build-specific conditional logic into a templating language. This might be the best solution in the long run if each build ends up having its own slight modifications to the shared settings. In the short term, though, it feels like an over-engineered approach.

A fourth option is using overlays where a default base config is defined in e.g. config/auspice_config.json and optional partial configs are defined in config/auspice_config_h3n2_ha.json. The partial configs are overlaid on top of, or merged into, the default config. Using your example, the config/auspice_config_h3n2_ha.json file would contain just:

{
  "color_options": {
    "rb": {
      "menuItem": "RBS-adjacent mutations",
      "legendTitle": "RBS-adjacent mutations",
      "type": "continuous",
      "key": "rb"
    }
  }
}

All the other color by entries would be picked up from the default base config. There could even be config/auspice_config_h3n2.json and config/auspice_config_h3n2_ha.json, and config/auspice_config_h3n2_ha_2y.json overlaid in order, if necessary, although hopefully it won't be.

This is a flexible system and a standard approach used for configuration files. It is the same approach Snakemake takes with its own configfile: … directives and the config global, and if we wanted we could even re-use that same machinery very easily.

(Besides re-using machinery that exists, I do think it's appealing for other reasons to put the contents of auspice_config.json under some "auspice": { … } key in a general config.json file read into the Snakefile using the configfile: … directives.)

Thanks, @tsibley! I forgot the format that was staring right at me. Overlays seem like the best solution.

I'm also in favor of using config.json to store most parameters. I tend to think of that file as a way to configure Snakefile rules while the auspice config is more of a way to configure the auspice interface. If you weren't buying into the Snakemake ecosystem, for whatever reason, it could still be helpful to have auspice config JSON(s) that stand apart from the Snakemake config.

I also don't feel strongly about this one way or the other though. Maybe we can discuss at Wednesday's meeting.

Nod, I can definitely see times it would be useful to have the auspice config be a separate file. Although I also see the auspice config file as inherently tied to the augur commands run, thus the Snakefile rules (in our case), since the rules generate node attributes that need to be mentioned in the auspice config.

If we don't re-use Snakemake's config file system, then doing overlays ourselves looks something like:

import json
import sys
from pathlib import Path
from snakemake.utils import update_config

def read_json(path): 
    print(f"Loading {path}", file = sys.stderr)
    return json.loads(path.read_text(encoding = "UTF-8"))

def read_config_set(base_filename, stem_suffixes = []):
    base_path = Path(base_filename)
    
    config = read_json(base_path)

    for stem_suffix in stem_suffixes:
        new_stem = "_".join([base_path.stem, *stem_suffix])
        overlay_path = base_path.with_name(new_stem + base_path.suffix)

        if overlay_path.exists():
            update_config(config, read_json(overlay_path))

    return config


auspice_config = read_config_set("config/auspice_config.json", [("h3n2"), ("h3n2", "ha"), ("h3n2", "ha", "2y")])
trvrb commented

@tsibley, @huddlej ---

I'm in favor of this overlay option: #5 (comment), but want to be sure I'm clear on the proposal:

I imagine that this should be built into the augur export command directly. Augur export can then be passed in --auspice-config config/auspice_config.json config/auspice_config_h3n2_ha.json. If config/auspice_config_h3n2_ha.json has an attribute already present in config/auspice_config.json it overrides.

@trvrb I admit, I was originally thinking augur would be none-the-wiser about the multiple config files and just receive a single JSON structure that had been composed together by the Snakefile rules.

Your proposal of having augur do the composition seems reasonable to me and does make the config composition more readily-available to other builds using augur. The Snakefile may still have to do an existence check before passing the set of files to --auspice-config, but I don't think that'll be too onerous, and I don't see any unintended consequences with the approach.

is this something we are planning to follow up?

trvrb commented

I think we should move this to be an augur issue. I do think that composing multiple auspice config files in augur export will be a generally useful feature and will streamline seasonal-flu config files.

trvrb commented

I'm closing this issue in favor of nextstrain/augur#298. Once augur functionality is in place, we can revisit auspice config format for seasonal-flu repo.