nextstrain/nextclade_data

Error downloading sars-cov-2 dataset after release 2024-02-16--04-00-32Z

Closed this issue · 3 comments

Hi Nextclade team! I'm encountering some issues after the release of tag 2024-02-16--04-00-32Z. I see errors with both the v3.0.0 and v3.2.0 nextclade CLI versions.

  1. The explicit tag 2024-02-16--04-00-32Z downloads successfully, as does omitting the tag. But the output version is "unreleased":

    nextclade dataset get --name sars-cov-2 --tag "2024-02-16--04-00-32Z" --output-dir "sars-cov-2_2024-02-16--04-00-32Z"
    nextclade dataset get --name sars-cov-2 --output-dir "sars-cov-2_no-tag"
    "version": {
    	"tag": "unreleased"
    },
  2. Any other tag raises an error.

    nextclade dataset get --name sars-cov-2 --tag "2024-01-16--20-31-02Z" --output-dir "sars-cov-2_2024-01-16--20-31-02Z"
    nextclade dataset get --name sars-cov-2 --tag "latest" --output-dir "sars-cov-2_latest"
    Error:
       0: Dataset not found: 'sars-cov-2'.
    
          Did you mean:
          - nextstrain/sars-cov-2/XBB
          - nextstrain/sars-cov-2/BA.2
          - nextstrain/sars-cov-2/BA.2.86
          - nextstrain/sars-cov-2/wuhan-hu-1/orfs
          - nextstrain/sars-cov-2/wuhan-hu-1/proteins
          - community/isuvdl/mazeller/prrsv2/orf5/yimim2023
          - nextstrain/mpox/all-clades
          - nextstrain/rsv/a/EPI_ISL_412866
          - nextstrain/rsv/b/EPI_ISL_1653999
          ?
    
          Type `nextclade dataset list` to show available datasets.
    
    Location:
       packages/nextclade-cli/src/cli/nextclade_dataset_get.rs:79
    
    Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
    Run with RUST_BACKTRACE=full to include source snippets.
    

Hi Katherine @ktmeaton

Thanks for reporting! You've hit 2 bugs simultaneously - one in software and another in data! That's a bingo!

The problem with "tag": "unreleased" was caused by incorrect files being put into the zip archives which Nextclade CLI relies on. I fixed the dataset build logic for the future dataset releases in #177, and in #178 I also retroactively re-uploaded to our servers the corrected dataset zips for already released datasets, so that the tags should now show up in the pathogen.json correctly once you re-download the datasets. There is no new dataset release, the tags are the same and the files inside datasets are the same, except for the tag in pathogen.json.

The problem with "Dataset not found" was caused by the bug in Nextclade software itself. I fixed it in nextstrain/nextclade#1420 and will release a new version shortly. You will need to update Nextclade CLI to get the fix.

In the meantime you can manually download dataset.zip files from one of the subdirectories here: https://github.com/nextstrain/nextclade_data/tree/master/data_output/nextstrain/sars-cov-2/wuhan-hu-1/orfs. These are the exact same zips which nextclade dataset get downloads (and extracts) for you.

Nextclade CLI 3.2.1 is now released with the mentioned bugfix.

Here is how I tested:

$ docker run -it --rm nextstrain/nextclade:3.2.1 bash -c "nextclade dataset get --name=sars-cov-2 --tag='2024-01-16--20-31-02Z' --output-dir=out && grep 'tag' 'out/pathogen.json'"
    "tag": "2024-01-16--20-31-02Z"

$ docker run -it --rm nextstrain/nextclade:3.2.1 bash -c "nextclade dataset get --name=sars-cov-2 --tag='2024-02-16--04-00-32Z' --output-dir=out && grep 'tag' 'out/pathogen.json'"
    "tag": "2024-02-16--04-00-32Z"

$ docker run -it --rm nextstrain/nextclade:3.2.1 bash -c "nextclade dataset get --name=sars-cov-2 --tag='latest' --output-dir=out && grep 'tag' 'out/pathogen.json'"
    "tag": "2024-02-16--04-00-32Z",

$ docker run -it --rm nextstrain/nextclade:3.2.1 bash -c "nextclade dataset get --name=sars-cov-2 --output-dir=out && grep 'tag' 'out/pathogen.json'"
    "tag": "2024-02-16--04-00-32Z"

I will close the issue. Please comment or open a new issue if there's still problems.

Thank you, it works great now with no errors!