Code review

Can't do it in the usual UI as there's no pull request to reference, so here's the points I observed.

monkeypox-report/src/test_build.py

Line 105 in b09edfd

assert all(url.endswith("csv") for url in build.get_archives_list("csv"))

consider using monkeypatch to substitute requests.get here, or any problem accessing GitHub will make the test runner exit.
monkeypox-report/src/build.py

Line 79 in b09edfd

logging.error("Failed to get archives list, aborting")

(and elsewhere this pattern appears) raise an exception and log/exit in if __name__ == '__main__' so that you can also test unhappy paths.
monkeypox-report/src/test_build.py

Line 132 in b09edfd

def test_counts():

there are a lot of numbers the same in this comparison: how do I know that n_countries_suspected_only is 1 because it correctly counted those, not because it actually used the same logic for n_countries_discarded? My preference would be to introduce functions that calculate the summary statistics separately and test that they work correctly, which might help with:
monkeypox-report/src/build.py

Line 166 in b09edfd

countries_suspected_or_confirmed = set(

there's a bit of unnecessary complexity making this function hard to read: sometimes sets are calculated as unions/intersections of other sets; sometimes not; sometimes the same set is calculated multiple times; many of them are assigned values in L166-175 but then others are inlined in the return value or even walrus-assigned in the return value and used again on a subsequent line. Maybe refactoring toward helper functions would make it easier to understand what's going on: this is probably the place where errors like including the wrong status in the calculation of a statistic will occur.
monkeypox-report/src/build.py

Line 225 in b09edfd

def mid_bucket_age(age_interval: str) -> float:

I don't understand why the mid-bucket-age of "50" isn't 50. I'm happy for that to be correct behaviour, but if so could you write a docstring explaining it please? :)
monkeypox-report/src/build.py

Line 243 in b09edfd

51 - 60, 61 - 70, 71 - 80, 81 -

in G.h we have an upper bound of 120 on age, which is not quite the oldest a human ever gets but acts as a QC check against typos. Worth implementing here?
monkeypox-report/src/test_build.py

Line 169 in b09edfd

[

the negative (raise ValueError) case isn't tested.
monkeypox-report/src/build.py

Line 268 in b09edfd

return {

nit: there's a function to calculate occurrences as percentages from data frames that can be extracted here.
there is a whole lot of setup code in the first 101 lines of test_build.py that seems to mostly generate dictionaries with random content that are used in a two-line test file. Could that be replaced with a constant dictionary, maybe loaded from json or yaml?
monkeypox-report/src/figures/genomics.py

Line 30 in b09edfd

genome_data = genome_data[genome_data.date >= "2022-05"]

this seems like a big inconsistency: if we don't know which cases are associated with the outbreak then we will consider all cases after May 2022, which includes endemic cases (though maybe not all of them; I don't know if the genomic data is only the West African clade).

Thanks @iamleeg! Great suggestion on refactoring counts(). I will add the age upper limit, not sure how to handle the genome data, restricting it to a clade maybe a good approximation.

@abhidg also consider stating which methodology was used for genomic data in the reports containing the figures, so people know whether they can do like-for-like comparisons.

Really insightful comments. I will add the upper bound age to the figure script as well.

It would be worth discussing how to handle the genome data on tomorrow's call.