nextstrain/auspice

Remove hardcoding of dataset descriptions on nextstrain.org

Opened this issue · 10 comments

This text:

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions.

is added to any dataset on https://nextstrain.org without a description as a result of these lines:

const preambleContent = "This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions.";
const genericPreamble = (<div>{preambleContent}</div>);
if (window.location.hostname === 'nextstrain.org') {
return hardCodedFooters(dispatch, genericPreamble);
}

I don't think it applies to all datasets served by https://nextstrain.org, especially not for groups/community/fetch datasets which are not maintained by us.

Possible solutions

  1. ⛔️ Include only a hardcoded list of prefixes (e.g. /ncov, /mpox, all core datasets)
  2. ⛔️ Update the condition to be exclusionary based on path prefixes (e.g. /groups, /community, /fetch)
  3. Ensure all of our own datasets (coreBuildPaths) have a description field in their metadata and remove the hardcoding entirely (ref)

I strongly prefer not hardcoding nextstrain.org-specific bits into Auspice. This rules out both possible solutions (1) and (2) above IMO. We've been moving away from this practice of special-casing over time, and I don't think we should slide backwards.

I suggest possible solution (3): ensuring all of our own datasets have a description field in their metadata so this codepath is used

/**
* If the metadata contains a description key, then it will take precedence the hard-coded
* acknowledgements. Expects the text in the description to be in Markdown format.
* Jover. December 2019.
*/
if (metadata.description) {
return (
<Suspense fallback={<div />}>
<MarkdownDisplay className="acknowledgments" mdstring={metadata.description} />
</Suspense>
);
}

and then doing away with the hardcoding for nextstrain.org in Auspice.

Many (most?) of our datasets already do have a description, often with acknowledgements and links to download data. Those are displayed in the footer, but they're not displayed in the "Download data" modal because getAcknowledgements() is called without the dataset metadata:

{getAcknowledgments({}, {preamble: {fontWeight: 300}, acknowledgments: {fontWeight: 300}})}

Given the expected contents of description is acknowledgements (and maybe data download info), it seems to me we'd want to include that specific info in the "Download data" modal. That might be somewhat of a breaking change to the expected UI though?

Related to #1800

Originally I said:

This text [...] is added to all datasets on https://nextstrain.org/ as a result of these lines:

At dev chat I was corrected – it's not added to all datasets. It's only added if the dataset doesn't provide its own description. The proper solution is what @tsibley proposed in #1809 (comment), and this would also apply to #1800. I've updated the issue description to reflect this.

Given the expected contents of description is acknowledgements (and maybe data download info), it seems to me we'd want to include that specific info in the "Download data" modal. That might be somewhat of a breaking change to the expected UI though?

I added this to the dev chat agenda for next time; if we can assemble consensus here before that, we can skip it, but hopefully it's a backstop to getting a decision made. (Adding it seems like the way to go, to me?)

That might be somewhat of a breaking change to the expected UI though?

We could display it in both places? Similar reasoning to #1715 (comment)

Issue description updated with a list of all core datasets. I've checked off the ones that have a description already – there are only 4 that need updating. I'll make issues in the individual pathogen repos to figure out what to include in the custom description.

That might be somewhat of a breaking change to the expected UI though?

We could display it in both places? Similar reasoning to #1715 (comment)

Ah, I didn't mean "move it", I meant "also include it". So, yeah, both places. By breaking UI change, I meant that the dataset-provided description can currently get quite long and may 1) break the UI of the download modal and/or 2) be written with the expectation of only being in the footer and not make sense in the context of the download modal (e.g. relative references like "use filters above").

Issues have been created. A few options going forwards:

  1. Wait for the custom config to be added to the live dataset.
    • This would require not only adding the description and hooking it up to the workflow, but a re-run of the workflow itself to generate a new dataset file that includes the description.
  2. Manually inject the current description into the current dataset file.
    • This would require editing the JSON file on S3 and manually adding

      "meta": {
          "description": "This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions."
      }
  3. Continue to remove hardcoding without ensuring live builds have been updated.

(1) seems like the proper approach. However, some of the datasets that need updating aren't actively maintained. This means progress on removing hardcoding is likely to be blocked for a long time. Maybe this is fine though as it's not a pressing issue.

Generating the description from a re-run of the build is the best way forward, where possible. I loaded the default dataset URL for each of the checked pathogens in your original message to confirm the (current) live build does have the custom description in it. They all do with the exception of:

  • seasonal-flu (this is a bug I think - issue)
  • WNV/NA. @j23414 it looks like the live dataset wasn't rebuild after nextstrain/WNV#10. Do you have plans to keep this dataset continually updated? (if so, let's rebuild now with the new footer), or will you focus your efforts on /global? (then lets just modify the existing JSON on S3, which I can do).

That leaves the three pathogens without an associated description markdown file:

  • d68. Dataset last updated in August, presumably by @emmahodcroft. Since it's been rebuilt recently we should add a description markdown & rebuild this.
  • MERS. The mers dataset isn't from the mers repo. We should probably sort this out at some point, but if the generic description is no longer injected into this dataset that's just fine for the short term.
  • TB. Dataset last updated 2018. I'd suggest adding this manually to the JSON on S3.

Thanks @jameshadfield! re: WNV/NA, I plan to focus my efforts on /global. I recommend modifying the existing JSON on S3 when you have a moment.

the longer (and probably too detailed) answer
I don't plan to touch WNV/NA since that is a historical document linked to a paper and a Nextstrain narrative. My current plan for WNV is to first create a WNV/global build, then a WNV/north-america which won't clash with the existing WNV/NA. Additionally, I'm hitting a bunch of lowercase/uppercase errors so the final names might be closer to wnv/global and wnv/north-america