Inflated daily vaccines 28-06-2021 to 05-07-2021
davipt opened this issue · 4 comments
The official daily vaccine counts between 28-06-2021 and 05-07-2021 are inflated, apparently IMHO due to the inclusion of the historic counts for the islands throughout those days.
The daily vaccination counts have been reported as including only the continent, excluding islands.
The daily values for 28-06 are similar to the weekly values for the continent.
The daily values for 05-07 are similar to the weekly values for national (continent+islands)
The daily values inbetween have a daily growth that would not be possible to reach - not "281 thousand doses in a day", when the public news report a record of 141K only on the 6th.
This means the vacinas.csv
values for 28-06-2021 to 05-07-2021 for doses
/ doses1
/ doses2
is still exactly as reported by the official authorities, for consistency, but may have changed scope from "continent only" to "national", and be much higher than they should. Apparently.
This also means the values for pessoas_
+ total
/ parcial
and vacinas
, adjusted with the weekly report, is currently overinflated for those days, but corrected back to real values on the 5th (thus will show negative values for the daily diff _novas
)
This is being sorted out. Any additional information is welcome. See also #966
The way I see it, the problem is not that. Since the 'relatório da vacinação nº20' the main 2 counters changed labels and now they don't say 1st and 2nd dose but instead people that started vaccination and people that ended it.
That means that for example, someone that gets a Janssen vaccine, is immediately added to both counters.
Although the daily counters still have the old labels, it's easy to match those with the weekly "Relatório de Vacinação" and see that their meaning also changed around 28-06. So the daily numbers published on the covid dashboard and that you consume the API, started to mean a different thing.
It is now, effectively, impossible to know how many doses are administered per day because of the different meaning of the counter. The counter no longer counts doses but people.
The only way to know that is on the weekly report where the weekly doses are shown.
2 facts prove that the data is wrong when compared to the Relatório da Vacinação Nº21:
- on the report, sum of doses (use page 2) is 808042 but on vacinas.csv it is around 1.1M.
- on 01-07, vacinas.csv reports 281428 doses administered. That is simply impossible as you said.
About those days with very high numbers I have 2 speculative theories:
- Before, the Janssen vaccines were all wrongly reported only on the '1 dose' column. When they changed the counters meaning, they also did a bulk correction of that because their SUM started working properly :) That can explain about 265562 more added on the second counter "people totally vaccinated".
- Also, everyone else that only needs 1 dose is now correctly accounted for. People that already got covid for example, are only given 1 dose and considered 'done'.
Maybe they also took that opportunity to fix some bugs on their side. Still, this is very speculative because there are at least 4 days with 'impossible' numbers according to the task force coordinator (anything above 140000)
The bigger issue is not only historical, but the daily counters. I believe those are incorrect because they don't mean doses anymore. If it's a day with 0 Janssen vaccination, they will match, otherwise it's an error.
I also don't have solutions. As of now, it's impossible to know daily number of doses. Just a daily snapshot of total people with vaccination done and started.
Here's my reasoning so far. (or more of a brain dump to see myself if it makes sense 😁)
The reports 20 and 21 got new columns that do match perfectly with the calculations we've been doing for a while to calculate people out of doses.
Weekly CUMUL
+ VAC_1
+ VAC_2
refers to doses, likewise daily doses
+ doses1
+ doses2
, with CUMUL
and doses
being the sum of the other two, always, in every single day and every weekly combination (including regions and ages).
Update: to be clear, weekly report up to 19th uses their CSV columns CUMUL_VAC_1 and *_2 whilst their 20 and 21 uses the new CUMUL_VAC_LEAST and *_COMPLETE. The data underneath did not change AFAICS. VAC_2 and COMPLETE are the same. VAC_1 and LEAST does indeed adjust for Janssen.
The difference between weekly and daily is that IMHO unidoses are added correctly to VAC_2
on the weekly report, but to doses1
on the daily report. This miscategorization checks out when comparing the difference between the weekly and daily values and matching with the known number of Janssen doses administered.
Until this week I haven't seen any evidence contrarily to this. Daily total always matched weekly cumul for the continent, and doses1 and doses2 do match VAC_1 and VAC_2 by taking ${Jansen} doses out of the former and into the later.
Further proof is to just compare the weekly dataset 21 (and 20) for CUMUL_VAC_LEAST
minus CUMUL_VAC_1
and the result matches perfectly with Janssen numbers: (looks like a perfect match, doesn't it?)
47 48 1317 13628 31366 55206 104404 160788 188719 223259 265500 …
68 70 1346 13661 31411 55247 104457 160742 188658 223391 265718 288357
Another way is to see weekly values have COMPLETE and VAC_2 the same, and LEAST is VAC_1 plus the known Janssen values. 5452515 + 288357 = 5740872, with LEAST=5740878
So let's focus first solely on total vaccines, the one that matched almost perfectly between daily.doses
and weekly.doses_continente
up to 28-06, and then matches weekly.doses
(national with islands) on the last comparable date, 05-07. (doses1
and doses2
also matches perfectly like this, considering the adjustments for Janssen above)
on the report, sum of doses (use page 2) is 808042 but on vacinas.csv it is around 1.1M.
That is correct. That's a difference of 400k, which is quite different from 265k, but quite similar to the cumulated vaccines of the islands.
I believe those are incorrect because they don't mean doses anymore.
Daily numbers for 05-07 are: ("vacinas administradas" / doses | "1ª dose" / doses1 | "vacinação completa" / doses2 )
9138620 | 5702799 | 3435821
Weekly numbers for 05-07 for national including islands are: (CUMUL, CUMUL_VAC1, CUMUL_VAC2)
9173720 | 5452515 | 3720680
Janssen numbers for 03-07 is 288357
Adjusting daily values to move Janssen from doses1 to doses2 yields:
… | 5414442 | 3724178
Doesn't this look the same?
The exact same algorithm yields the same results for any date up to 28-06, as long as we use the weekly values for the continent (total minus islands, or sum of all ARS)
See how it feels it can't really be about Janssen, but looks quite a lot like the islands being dumped bit by bit over that week, inflating the daily values to those 281k and 218k daily "impossible" numbers?
Please help me verify if this is incorrect.
Here are the numbers for Janssen:
2021-04-17 68
2021-04-24 70
2021-05-01 1346
2021-05-08 13661
2021-05-15 31411
2021-05-22 55247
2021-05-29 104457
2021-06-05 160742
2021-06-12 188658
2021-06-19 223391
2021-06-26 265718
2021-07-03 288357
And here's the code to check the difference between the weekly and daily numbers:
data_vacinas = pd.read_csv(Path.cwd() / '..' / '..' /'vacinas.csv')
data_vacinas_detalhes = pd.read_csv(Path.cwd() / '..' / '..' /'vacinas_detalhe.csv')
df = pd.merge(data_vacinas, data_vacinas_detalhes, how='right', on='data', suffixes=("_diario", "_semanal"))
df['data'] = df['data'].apply(lambda x: f"{x[6:]}-{x[3:5]}-{x[0:2]}")
df.set_index("data", inplace=True)
for k in ['', '1', '2']:
df[f'doses{k}_diff'] = df[f'doses{k}_diario'] - df[f'doses{k}_continente']
#df[f'doses{k}_diff'] = df[f'doses{k}_diario'] - df[f'doses{k}_semanal']
df['doses12_diff'] = df['doses1_diff'] + df['doses2_diff']
df[ [c for c in df.columns if '_diff' in c] ]
doses_diff doses1_diff doses2_diff doses12_diff
data
2021-01-11 -384.0 -369.0 -15.0 -384.0
2021-01-18 298.0 -353.0 651.0 298.0
2021-01-25 NaN NaN NaN NaN
2021-02-01 -17312.0 -15122.0 -2182.0 -17304.0
2021-02-08 -14803.0 -12642.0 -2152.0 -14794.0
2021-02-15 -16238.0 -9351.0 -6866.0 -16217.0
2021-02-22 -12519.0 -6804.0 -5685.0 -12489.0
2021-03-01 -13100.0 -8202.0 -4866.0 -13068.0
2021-03-08 -14066.0 -8133.0 -5899.0 -14032.0
2021-03-15 -13001.0 -7300.0 -5665.0 -12965.0
2021-03-22 -10800.0 -6035.0 -4716.0 -10751.0
2021-03-29 -10693.0 -7756.0 -2885.0 -10641.0
2021-04-05 -10112.0 -7364.0 -2688.0 -10052.0
2021-04-12 -10880.0 -8799.0 -2015.0 -10814.0
2021-04-19 -10811.0 -9850.0 -891.0 -10741.0
2021-04-26 -10780.0 -9902.0 -795.0 -10697.0
2021-05-03 -12204.0 -10110.0 -2004.0 -12114.0
2021-05-10 -13895.0 1331.0 -15125.0 -13794.0
2021-05-17 -12748.0 20065.0 -32693.0 -12628.0
2021-05-24 -12682.0 43756.0 -56298.0 -12542.0
2021-05-31 -13409.0 93569.0 -106752.0 -13183.0
2021-06-07 -10852.0 151619.0 -162191.0 -10572.0
2021-06-14 -9796.0 178464.0 -187959.0 -9495.0
2021-06-21 57399.0 248854.0 -191121.0 57733.0
2021-06-28 -1238.0 255530.0 -256395.0 -865.0
2021-07-05 413577.0 504037.0 -90050.0 413987.0
Notice some additional adjustment on 21-06 and then 413k when the cumulative of the islands is 232172 + 214100
:confused
Aligned with OWID I've removed the calculated values for 30-06-2021 to 04-07-2021 inclusive until there's a way to calculate them properly, avoiding the inflated numbers, but also preventing the negative daily diffs.
On the README it's clear that missing values will be present if the data is unknown (or invalid in this case).
Confirma-se que houve integração entre os sistemas das regiões autónomas e a plataforma VACINAS o que inflacionou os valores diários. Todos os valores estão correctos, excepto quando contabilizam coisas diferentes e em dias diferentes.
Os dados diários de dia 12 e 13 voltaram ao normal - apenas continente, e com as unidoses no campo "com uma dose".
Assim sendo mantêm-se os dados doses/doses1/doses2 alinhado com os valores oficiais, quaisquer que tenham sido, mas os calculados de pessoas/vacinas deixam de existir de 30-06 a 04-07 (para não ter os 280 mil/dia) , e a 08-07 (182k/dia).
Os dias 9 a 11 estão vazios pois não foram divulgados dados diários.
Temos todos de adaptar o código para lidar com a falta de dados e não mostrar a diferença diária quando não há dados para o dia anterior, e também ter cuidado com cálculos de médias (7 dias por exemplo) caso calhem em dias em falta.
Obrigado a todos.