Inconsistências na base de dados
danielfsbarreto opened this issue · 13 comments
Estava ocorrendo um problema recorrente com a execução da action do goodtables
do projeto, que foi resolvido em #178. Agora é preciso resolver todas as inconsistências que se acumularam no decorrer desse tempo.
Job: https://github.com/turicas/covid19-br/runs/814979053?check_suite_focus=true#step:3:911
2020-06-28T01:10:34.8665625Z DATASET
2020-06-28T01:10:34.8667294Z =======
2020-06-28T01:10:34.8668860Z {'error-count': 35,
2020-06-28T01:10:34.8670649Z 'preset': 'nested',
2020-06-28T01:10:34.8671296Z 'table-count': 10,
2020-06-28T01:10:34.8671754Z 'time': 54.346,
2020-06-28T01:10:34.8672417Z 'valid': False}
2020-06-28T01:10:34.8672565Z
2020-06-28T01:10:34.8672771Z TABLE [1]
2020-06-28T01:10:34.8672978Z =========
2020-06-28T01:10:34.8673469Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8673924Z 'encoding': 'no',
2020-06-28T01:10:34.8674388Z 'error-count': 3,
2020-06-28T01:10:34.8674832Z 'format': 'inline',
2020-06-28T01:10:34.8675342Z 'headers': ['date', 'notes', 'state', 'url'],
2020-06-28T01:10:34.8675807Z 'resource-name': 'boletim',
2020-06-28T01:10:34.8676259Z 'row-count': 3310,
2020-06-28T01:10:34.8676716Z 'schema': 'table-schema',
2020-06-28T01:10:34.8677166Z 'scheme': 'inline',
2020-06-28T01:10:34.8677658Z 'source': '/app/data/output/boletim.csv',
2020-06-28T01:10:34.8678134Z 'time': 2.043,
2020-06-28T01:10:34.8678569Z 'valid': False}
2020-06-28T01:10:34.8678974Z ---------
2020-06-28T01:10:34.8679588Z [-,2] [non-matching-header] Header in column 2 doesn't match field name "state" in the schema
2020-06-28T01:10:34.8680249Z [-,3] [non-matching-header] Header in column 3 doesn't match field name "url" in the schema
2020-06-28T01:10:34.8680894Z [-,4] [non-matching-header] Header in column 4 doesn't match field name "notes" in the schema
2020-06-28T01:10:34.8681058Z
2020-06-28T01:10:34.8681247Z TABLE [2]
2020-06-28T01:10:34.8681715Z =========
2020-06-28T01:10:34.8682221Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8684061Z 'encoding': 'no',
2020-06-28T01:10:34.8684569Z 'error-count': 0,
2020-06-28T01:10:34.8685012Z 'format': 'inline',
2020-06-28T01:10:34.8685457Z 'headers': ['date',
2020-06-28T01:10:34.8685887Z 'state',
2020-06-28T01:10:34.8686329Z 'city',
2020-06-28T01:10:34.8686785Z 'place_type',
2020-06-28T01:10:34.8687478Z 'confirmed',
2020-06-28T01:10:34.8687928Z 'deaths',
2020-06-28T01:10:34.8688404Z 'order_for_place',
2020-06-28T01:10:34.8688861Z 'is_last',
2020-06-28T01:10:34.8689345Z 'estimated_population_2019',
2020-06-28T01:10:34.8689826Z 'city_ibge_code',
2020-06-28T01:10:34.8690342Z 'confirmed_per_100k_inhabitants',
2020-06-28T01:10:34.8690812Z 'death_rate'],
2020-06-28T01:10:34.8691271Z 'resource-name': 'caso',
2020-06-28T01:10:34.8691732Z 'row-count': 263945,
2020-06-28T01:10:34.8692198Z 'schema': 'table-schema',
2020-06-28T01:10:34.8692627Z 'scheme': 'inline',
2020-06-28T01:10:34.8693119Z 'source': '/app/data/output/caso.csv',
2020-06-28T01:10:34.8693577Z 'time': 54.154,
2020-06-28T01:10:34.8694012Z 'valid': True}
2020-06-28T01:10:34.8694133Z
2020-06-28T01:10:34.8694335Z TABLE [3]
2020-06-28T01:10:34.8694537Z =========
2020-06-28T01:10:34.8695006Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8695470Z 'encoding': 'no',
2020-06-28T01:10:34.8695908Z 'error-count': 0,
2020-06-28T01:10:34.8696350Z 'format': 'inline',
2020-06-28T01:10:34.8696779Z 'headers': ['state',
2020-06-28T01:10:34.8697247Z 'state_ibge_code',
2020-06-28T01:10:34.8697721Z 'city_ibge_code',
2020-06-28T01:10:34.8698169Z 'city',
2020-06-28T01:10:34.8698647Z 'estimated_population'],
2020-06-28T01:10:34.8699143Z 'resource-name': 'populacao-estimada',
2020-06-28T01:10:34.8699600Z 'row-count': 5571,
2020-06-28T01:10:34.8700050Z 'schema': 'table-schema',
2020-06-28T01:10:34.8700496Z 'scheme': 'inline',
2020-06-28T01:10:34.8701313Z 'source': '/app/data/populacao-estimada-2019.csv',
2020-06-28T01:10:34.8701799Z 'time': 0.907,
2020-06-28T01:10:34.8702229Z 'valid': True}
2020-06-28T01:10:34.8702344Z
2020-06-28T01:10:34.8702547Z TABLE [4]
2020-06-28T01:10:34.8702750Z =========
2020-06-28T01:10:34.8703220Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8703681Z 'encoding': 'no',
2020-06-28T01:10:34.8704125Z 'error-count': 0,
2020-06-28T01:10:34.8704550Z 'format': 'inline',
2020-06-28T01:10:34.8705041Z 'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8705528Z 'resource-name': 'schema-boletim',
2020-06-28T01:10:34.8706201Z 'row-count': 5,
2020-06-28T01:10:34.8706674Z 'schema': 'table-schema',
2020-06-28T01:10:34.8707120Z 'scheme': 'inline',
2020-06-28T01:10:34.8707606Z 'source': '/app/schema/boletim.csv',
2020-06-28T01:10:34.8708043Z 'time': 0.013,
2020-06-28T01:10:34.8708479Z 'valid': True}
2020-06-28T01:10:34.8708618Z
2020-06-28T01:10:34.8708820Z TABLE [5]
2020-06-28T01:10:34.8709008Z =========
2020-06-28T01:10:34.8709479Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8709927Z 'encoding': 'no',
2020-06-28T01:10:34.8710364Z 'error-count': 0,
2020-06-28T01:10:34.8710806Z 'format': 'inline',
2020-06-28T01:10:34.8711296Z 'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8711836Z 'resource-name': 'schema-caso',
2020-06-28T01:10:34.8712292Z 'row-count': 13,
2020-06-28T01:10:34.8712747Z 'schema': 'table-schema',
2020-06-28T01:10:34.8713192Z 'scheme': 'inline',
2020-06-28T01:10:34.8713667Z 'source': '/app/schema/caso.csv',
2020-06-28T01:10:34.8714116Z 'time': 0.091,
2020-06-28T01:10:34.8714551Z 'valid': True}
2020-06-28T01:10:34.8714663Z
2020-06-28T01:10:34.8714862Z TABLE [6]
2020-06-28T01:10:34.8715063Z =========
2020-06-28T01:10:34.8715465Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8715672Z 'encoding': 'no',
2020-06-28T01:10:34.8715863Z 'error-count': 0,
2020-06-28T01:10:34.8716162Z 'format': 'inline',
2020-06-28T01:10:34.8716405Z 'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8716645Z 'resource-name': 'schema-populacao-estimada',
2020-06-28T01:10:34.8716856Z 'row-count': 6,
2020-06-28T01:10:34.8717069Z 'schema': 'table-schema',
2020-06-28T01:10:34.8717278Z 'scheme': 'inline',
2020-06-28T01:10:34.8717510Z 'source': '/app/schema/populacao-estimada-2019.csv',
2020-06-28T01:10:34.8717796Z 'time': 0.075,
2020-06-28T01:10:34.8718000Z 'valid': True}
2020-06-28T01:10:34.8718064Z
2020-06-28T01:10:34.8718145Z TABLE [7]
2020-06-28T01:10:34.8718242Z =========
2020-06-28T01:10:34.8718461Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8718671Z 'encoding': 'no',
2020-06-28T01:10:34.8718875Z 'error-count': 0,
2020-06-28T01:10:34.8719080Z 'format': 'inline',
2020-06-28T01:10:34.8719436Z 'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8719633Z 'resource-name': 'schema-epidemiological-week',
2020-06-28T01:10:34.8719815Z 'row-count': 4,
2020-06-28T01:10:34.8720001Z 'schema': 'table-schema',
2020-06-28T01:10:34.8720184Z 'scheme': 'inline',
2020-06-28T01:10:34.8720388Z 'source': '/app/schema/epidemiological-week.csv',
2020-06-28T01:10:34.8720572Z 'time': 0.138,
2020-06-28T01:10:34.8720745Z 'valid': True}
2020-06-28T01:10:34.8720790Z
2020-06-28T01:10:34.8720872Z TABLE [8]
2020-06-28T01:10:34.8720959Z =========
2020-06-28T01:10:34.8721148Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8721319Z 'encoding': 'no',
2020-06-28T01:10:34.8721497Z 'error-count': 0,
2020-06-28T01:10:34.8721676Z 'format': 'inline',
2020-06-28T01:10:34.8721872Z 'headers': ['field_name', 'field_type'],
2020-06-28T01:10:34.8722073Z 'resource-name': 'schema-obito_cartorio',
2020-06-28T01:10:34.8722254Z 'row-count': 35,
2020-06-28T01:10:34.8722624Z 'schema': 'table-schema',
2020-06-28T01:10:34.8722819Z 'scheme': 'inline',
2020-06-28T01:10:34.8723051Z 'source': '/app/schema/obito_cartorio.csv',
2020-06-28T01:10:34.8723261Z 'time': 0.03,
2020-06-28T01:10:34.8723468Z 'valid': True}
2020-06-28T01:10:34.8723520Z
2020-06-28T01:10:34.8723615Z TABLE [9]
2020-06-28T01:10:34.8723712Z =========
2020-06-28T01:10:34.8723931Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8724136Z 'encoding': 'no',
2020-06-28T01:10:34.8724340Z 'error-count': 0,
2020-06-28T01:10:34.8724544Z 'format': 'inline',
2020-06-28T01:10:34.8724797Z 'headers': ['date', 'epidemiological_year', 'epidemiological_week'],
2020-06-28T01:10:34.8725048Z 'resource-name': 'epidemiological-week',
2020-06-28T01:10:34.8725261Z 'row-count': 3289,
2020-06-28T01:10:34.8725475Z 'schema': 'table-schema',
2020-06-28T01:10:34.8725681Z 'scheme': 'inline',
2020-06-28T01:10:34.8725918Z 'source': '/app/data/epidemiological-week.csv',
2020-06-28T01:10:34.8726129Z 'time': 2.168,
2020-06-28T01:10:34.8726318Z 'valid': True}
2020-06-28T01:10:34.8726385Z
2020-06-28T01:10:34.8726482Z TABLE [10]
2020-06-28T01:10:34.8726582Z =========
2020-06-28T01:10:34.8726787Z {'datapackage': 'datapackage.json',
2020-06-28T01:10:34.8727572Z 'encoding': 'no',
2020-06-28T01:10:34.8727833Z 'error-count': 32,
2020-06-28T01:10:34.8728359Z 'format': 'inline',
2020-06-28T01:10:34.8728625Z 'headers': ['date',
2020-06-28T01:10:34.8728880Z 'state',
2020-06-28T01:10:34.8729220Z 'epidemiological_week_2019',
2020-06-28T01:10:34.8729509Z 'epidemiological_week_2020',
2020-06-28T01:10:34.8729789Z 'new_deaths_sars_2019',
2020-06-28T01:10:34.8730065Z 'new_deaths_pneumonia_2019',
2020-06-28T01:10:34.8730363Z 'new_deaths_respiratory_failure_2019',
2020-06-28T01:10:34.8730657Z 'new_deaths_septicemia_2019',
2020-06-28T01:10:34.8730947Z 'new_deaths_indeterminate_2019',
2020-06-28T01:10:34.8731229Z 'new_deaths_others_2019',
2020-06-28T01:10:34.8731511Z 'new_deaths_sars_2020',
2020-06-28T01:10:34.8731796Z 'new_deaths_pneumonia_2020',
2020-06-28T01:10:34.8732205Z 'new_deaths_respiratory_failure_2020',
2020-06-28T01:10:34.8732492Z 'new_deaths_septicemia_2020',
2020-06-28T01:10:34.8732782Z 'new_deaths_indeterminate_2020',
2020-06-28T01:10:34.8733062Z 'new_deaths_others_2020',
2020-06-28T01:10:34.8733340Z 'new_deaths_covid19',
2020-06-28T01:10:34.8733613Z 'deaths_sars_2019',
2020-06-28T01:10:34.8733892Z 'deaths_pneumonia_2019',
2020-06-28T01:10:34.8734260Z 'deaths_respiratory_failure_2019',
2020-06-28T01:10:34.8734530Z 'deaths_septicemia_2019',
2020-06-28T01:10:34.8734814Z 'deaths_indeterminate_2019',
2020-06-28T01:10:34.8735091Z 'deaths_others_2019',
2020-06-28T01:10:34.8735363Z 'deaths_sars_2020',
2020-06-28T01:10:34.8735641Z 'deaths_pneumonia_2020',
2020-06-28T01:10:34.8735932Z 'deaths_respiratory_failure_2020',
2020-06-28T01:10:34.8736216Z 'deaths_septicemia_2020',
2020-06-28T01:10:34.8736505Z 'deaths_indeterminate_2020',
2020-06-28T01:10:34.8736767Z 'deaths_others_2020',
2020-06-28T01:10:34.8737038Z 'deaths_covid19',
2020-06-28T01:10:34.8737315Z 'new_deaths_total_2019',
2020-06-28T01:10:34.8737595Z 'new_deaths_total_2020',
2020-06-28T01:10:34.8737866Z 'deaths_total_2019',
2020-06-28T01:10:34.8738138Z 'deaths_total_2020'],
2020-06-28T01:10:34.8738418Z 'resource-name': 'obito_cartorio',
2020-06-28T01:10:34.8738667Z 'row-count': 9882,
2020-06-28T01:10:34.8739041Z 'schema': 'table-schema',
2020-06-28T01:10:34.8739279Z 'scheme': 'inline',
2020-06-28T01:10:34.8739647Z 'source': '/app/data/output/obito_cartorio.csv',
2020-06-28T01:10:34.8739856Z 'time': 4.42,
2020-06-28T01:10:34.8740058Z 'valid': False}
2020-06-28T01:10:34.8740251Z ---------
2020-06-28T01:10:34.8740547Z [-,3] [non-matching-header] Header in column 3 doesn't match field name "new_deaths_pneumonia_2019" in the schema
2020-06-28T01:10:34.8740884Z [-,4] [non-matching-header] Header in column 4 doesn't match field name "new_deaths_pneumonia_2020" in the schema
2020-06-28T01:10:34.8741371Z [-,5] [non-matching-header] Header in column 5 doesn't match field name "new_deaths_respiratory_failure_2019" in the schema
2020-06-28T01:10:34.8741714Z [-,6] [non-matching-header] Header in column 6 doesn't match field name "new_deaths_respiratory_failure_2020" in the schema
2020-06-28T01:10:34.8742042Z [-,7] [non-matching-header] Header in column 7 doesn't match field name "new_deaths_covid19" in the schema
2020-06-28T01:10:34.8742369Z [-,8] [non-matching-header] Header in column 8 doesn't match field name "epidemiological_week_2019" in the schema
2020-06-28T01:10:34.8742691Z [-,9] [non-matching-header] Header in column 9 doesn't match field name "epidemiological_week_2020" in the schema
2020-06-28T01:10:34.8743001Z [-,10] [non-matching-header] Header in column 10 doesn't match field name "deaths_covid19" in the schema
2020-06-28T01:10:34.8743341Z [-,11] [non-matching-header] Header in column 11 doesn't match field name "deaths_respiratory_failure_2019" in the schema
2020-06-28T01:10:34.8743680Z [-,12] [non-matching-header] Header in column 12 doesn't match field name "deaths_respiratory_failure_2020" in the schema
2020-06-28T01:10:34.8744004Z [-,13] [non-matching-header] Header in column 13 doesn't match field name "deaths_pneumonia_2019" in the schema
2020-06-28T01:10:34.8744321Z [-,14] [non-matching-header] Header in column 14 doesn't match field name "deaths_pneumonia_2020" in the schema
2020-06-28T01:10:34.8744592Z [-,15] [extra-header] There is an extra header in column 15
2020-06-28T01:10:34.8744836Z [-,16] [extra-header] There is an extra header in column 16
2020-06-28T01:10:34.8745094Z [-,17] [extra-header] There is an extra header in column 17
2020-06-28T01:10:34.8745349Z [-,18] [extra-header] There is an extra header in column 18
2020-06-28T01:10:34.8745603Z [-,19] [extra-header] There is an extra header in column 19
2020-06-28T01:10:34.8745926Z [-,20] [extra-header] There is an extra header in column 20
2020-06-28T01:10:34.8746190Z [-,21] [extra-header] There is an extra header in column 21
2020-06-28T01:10:34.8746566Z [-,22] [extra-header] There is an extra header in column 22
2020-06-28T01:10:34.8746786Z [-,23] [extra-header] There is an extra header in column 23
2020-06-28T01:10:34.8747006Z [-,24] [extra-header] There is an extra header in column 24
2020-06-28T01:10:34.8747212Z [-,25] [extra-header] There is an extra header in column 25
2020-06-28T01:10:34.8747487Z [-,26] [extra-header] There is an extra header in column 26
2020-06-28T01:10:34.8747705Z [-,27] [extra-header] There is an extra header in column 27
2020-06-28T01:10:34.8747923Z [-,28] [extra-header] There is an extra header in column 28
2020-06-28T01:10:34.8748139Z [-,29] [extra-header] There is an extra header in column 29
2020-06-28T01:10:34.8748356Z [-,30] [extra-header] There is an extra header in column 30
2020-06-28T01:10:34.8748573Z [-,31] [extra-header] There is an extra header in column 31
2020-06-28T01:10:34.8748796Z [-,32] [extra-header] There is an extra header in column 32
2020-06-28T01:10:34.8749004Z [-,33] [extra-header] There is an extra header in column 33
2020-06-28T01:10:34.8749221Z [-,34] [extra-header] There is an extra header in column 34
Os erros são nas tabelas :
- boletim
- obito_cartorio
Os schemas que são usados no projeto são mantidos aqui o que acaba criando uma duplicidade, sempre que houver manutenção aí tem que atualizar o datapackage.json
Pois é, o ideal seria esses esquemas serem gerados a partir do datapackage.json
, e não o contrário.
Pois é, o ideal seria esses esquemas serem gerados a partir do
datapackage.json
, e não o contrário.
cabe uma issue ou PR aí, identificar onde no código tem referência aos schemas/*.csv
, e o datapackage já tá nas dependências do projeto, certamente daria para automatiza isso
Pois é, o ideal seria esses esquemas serem gerados a partir do
datapackage.json
, e não o contrário.
Essa issue aí ainda é outro caminho, diferente do que estamos sugerindo aqui.
Aqui:
datapackage.json
-> esquemas em formatocsv
customizado na pastaschema/*.csv
datapackage.json
-> documentação da API
Lá:
- esquemas no banco de dados -> documentação da API
A questão toda passa pelo processo de desenvolvimento. Hoje, quem desenvolve é o @turicas, e parece que ele prefere começar a definir o esquema pelo banco. Enquanto continuar assim, o banco de dados é que teria que ser então o ponto de partida.
Se o datapackage.json
atender à demanda que temos hoje (já explico abaixo), então acho que o ideal seria termos apenas o datapackage.json
no repositório, assim o Brasil.IO poderia consumir desse arquivo e os arquivos schema/*.csv
poderiam ser gerados automaticamente a partir do datapackage.json
(ou, quando a rows
suportar pgimport
e csv2sqlite
com data package, eles poderiam ser deletados).
As demandas atualmente são:
- Especificação dos nomes e tipos das colunas (como já existe no
schema/*.csv
) - Metadados gerais, como: nome da coluna (slug), título da coluna (com acentos, espaços etc.), descrição da coluna
- Metadados específicos para o Brasil.IO, como: quais colunas aparecerão como filtro na interface, quais colunas serão exibidas no frontend, quais colunas serão usadas para compor o índice de busca por texto completo etc. (esses eu escolho manualmente quando vou adicionar um dataset na plataforma)
Eu não conheço muito da especificação do datapackage, mas se tiver como embutirmos metadados personalizados (esses do Brasil.IO), então podemos começar um processo de migração (ficará bem melhor se for uniformizado assim :).
@augusto-herrmann você, que conhece mais a especificação do data package, acha que atende a essas necessidades acima? Se sim, vamos criar uma issue no repositório do Brasil.IO para tratar disso?
Sobre a geração de documentação da API: como os metadados precisam ficar armazenados na base do Brasil.IO (e não serão exatamente iguais a esse datapackage.json
que propus, pois nem sempre o dataset estará super atualizado com relação ao repositório), então faz sentido a geração da documentação da API ser feita automaticamente a partir do banco de dados do Brasil.IO e não do (futuro) datapackage.json
.
acho que deveríamos estar discutindo isso lá na issue turicas/brasil.io#204
acho que deveríamos estar discutindo isso lá na issue turicas/brasil.io#204
Concordo. Colei esses meus comentários lá.
Os testes estão dando erro novamente. Reabrir esta issue ou criar uma nova?
Os testes estão dando erro novamente. Reabrir esta issue ou criar uma nova?
cria uma nova
deve ter adicionado campos ou mudado a ordem
Criada #193.