nextstrain/nextclade_data

Version missing from pathogen.json for datasets released on or after Jan 29

Closed this issue · 3 comments

I was working with the RSV datasets, and noticed that the "version" key is missing entirely from pathogen.json on new datasets.

I originally saw this with the 1/29 versions of both RSV datasets, but while looking into this I found it's also present on nextstrain/flu/yam/ha/JN993010, but not on anything older.

I don't see anything in the changelog (either here or on the CLI) about the "version" key being removed entirely from pathogen.json, and this being missing seems to make it so you can't identify the version of a downloaded dataset (only specify a particular version to download), which the documentation still says you can do.

I did see the commit about removing the version from the files in data/ to be auto-generated in data_output/ while I was trying to figure this out, but the version seems to be missing from the actual downloaded datasets from nextclade dataset get (as well as from data_output/ on the repo).

Hi @Valiec,

Thanks for the report!

This was a bug and now it should be fixed in fca1b15 and 369219e

Nextclade CLI 3.0.1 should be able to download correct pathogen.json with version fields now. Here is how I tested:

$ docker run -it --rm nextstrain/nextclade:3.0.1 bash -c 'apt-get update -qq >/dev/null && apt-get install -yqq jq >/dev/null && (for v in flu_yam_ha rsv_a rsv_b; do nextclade dataset get -v --name=${v} --output-dir=${v} && jq ".version" ${v}/pathogen.json; done)'
debconf: delaying package configuration, since apt-utils is not installed
2024-01-31 12:17:09.597 [I] http_client.rs:101: HTTP 'GET' request to 'https://data.clades.nextstrain.org/v3/index.json'
2024-01-31 12:17:10.100 [I] http_client.rs:101: HTTP 'GET' request to 'https://data.clades.nextstrain.org/v3/nextstrain/flu/yam/ha/JN993010/2024-01-30--16-34-55Z/dataset.zip'
{
  "updatedAt": "2024-01-30T16:34:55Z",
  "tag": "2024-01-30--16-34-55Z"
}
2024-01-31 12:17:10.149 [I] http_client.rs:101: HTTP 'GET' request to 'https://data.clades.nextstrain.org/v3/index.json'
2024-01-31 12:17:10.199 [I] http_client.rs:101: HTTP 'GET' request to 'https://data.clades.nextstrain.org/v3/nextstrain/rsv/a/EPI_ISL_412866/2024-01-29--10-29-43Z/dataset.zip'
{
  "updatedAt": "2024-01-29T10:29:43Z",
  "tag": "2024-01-29--10-29-43Z"
}
2024-01-31 12:17:10.258 [I] http_client.rs:101: HTTP 'GET' request to 'https://data.clades.nextstrain.org/v3/index.json'
2024-01-31 12:17:10.298 [I] http_client.rs:101: HTTP 'GET' request to 'https://data.clades.nextstrain.org/v3/nextstrain/rsv/b/EPI_ISL_1653999/2024-01-29--10-29-43Z/dataset.zip'
{
  "updatedAt": "2024-01-29T10:29:43Z",
  "tag": "2024-01-29--10-29-43Z"
}

And to check that all output datasets in the repo have version fields:

$ jq -s 'map({ attributes, version }) | map(select(.version == null))' data_output/**/pathogen.json
[]

Comment or open another issue if it does not work for you.

Thanks @Valiec for reporting this so quickly, much appreciated!

I downloaded RSV-A again to test and I'm seeing the version field now, thanks!