owid/etl

Allowing public ETL steps to depend on private steps

Opened this issue · 2 comments

We have a single case where public dataset (data://garden/covid/latest/combined and hence our full covid dataset) depends on private dataset data-private://garden/covid/latest/sequence.

data://garden/covid/latest/combined:
    - data://garden/covid/latest/testing
    - data://garden/covid/latest/cases_deaths
    - data-private://garden/covid/latest/sequence
    - data://garden/demography/2024-07-15/population

An error is raised when you try to run ETL without using --private flag. So running full ETL etl run fails with

ValueError: Public step data://garden/covid/latest/combined depends on private step data-private://garden/covid/latest/sequence. Use --private flag.

This is a bit annoying as we have to exclude covid dataset from running in nightly builds. It'd also be confusing for anyone trying to build it.

Should we exclude steps depending on private steps by default and raise a warning instead of failing?

@lucasrodes why isdata-private://garden/covid/latest/sequence private? Maybe the solution would be to make it public (given that it's used by a public step).

hi @pabloarosado

why isdata-private://garden/covid/latest/sequence private?

It must be private, as requested by the data provider since they have a very restrictive license. That's GISAID.

Maybe the solution would be to make it public (given that it's used by a public step).

That's not possible; we cannot share this data publicly. The data://garden/covid/latest/combined processes and aggregates a private indicator to compute a ratio ー that's fine as public.