/Covid_19

COVID 19 death and new cases comparison between selected countries and PT

Primary LanguageR

Updated: 2020-06-07

Synopsis

This is a data analysis report of the public available data about Covid19 in Portugal. The report demonstrates how to download and process data, ranks Portugal by cases and deaths and demonstrates the pandemic evolution in this country. Some final remarks are made about the limited availability of the data provided.

This report is also an example of reproducible research in data analysis making it possible to anyone to reproduce or adapt for any country.

Raw Data

World Data Source

The primarily data source used for this work was available by the European Centre for Disease Prevention and Control. By analyzing this data we can get a gist of the evolution of covid19 in the world. Some information were added trough time to this data, such as new formats. I choose the JSON format.

library(jsonlite)
library(dplyr)
url <- "https://opendata.ecdc.europa.eu/covid19/casedistribution/json"
dataRaw <- read_json(url, simplifyVector = TRUE)
data <- as_tibble(dataRaw$records)
str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    21965 obs. of  11 variables:
##  $ dateRep                : chr  "07/06/2020" "06/06/2020" "05/06/2020" "04/06/2020" ...
##  $ day                    : chr  "7" "6" "5" "4" ...
##  $ month                  : chr  "6" "6" "6" "6" ...
##  $ year                   : chr  "2020" "2020" "2020" "2020" ...
##  $ cases                  : chr  "582" "915" "787" "758" ...
##  $ deaths                 : chr  "18" "9" "6" "24" ...
##  $ countriesAndTerritories: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ geoId                  : chr  "AF" "AF" "AF" "AF" ...
##  $ countryterritoryCode   : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ popData2018            : chr  "37172386" "37172386" "37172386" "37172386" ...
##  $ continentExp           : chr  "Asia" "Asia" "Asia" "Asia" ...

World Data Pre-Processing - getting Tiddy

Since all available variables are in character format I transformed the numeric values and date values into their respective formats.

library(lubridate)
data$dateRep <- dmy(data$dateRep)
data$day <- as.numeric(data$day)
data$month <- as.numeric(data$month)
data$year <- as.numeric(data$year)
data$cases <- as.numeric(data$cases)
data$deaths <- as.numeric(data$deaths)
data$popData2018 <- as.numeric(data$popData2018)
str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    21965 obs. of  11 variables:
##  $ dateRep                : Date, format: "2020-06-07" "2020-06-06" ...
##  $ day                    : num  7 6 5 4 3 2 1 31 30 29 ...
##  $ month                  : num  6 6 6 6 6 6 6 5 5 5 ...
##  $ year                   : num  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ cases                  : num  582 915 787 758 759 545 680 866 623 580 ...
##  $ deaths                 : num  18 9 6 24 5 8 8 3 11 8 ...
##  $ countriesAndTerritories: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ geoId                  : chr  "AF" "AF" "AF" "AF" ...
##  $ countryterritoryCode   : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ popData2018            : num  37172386 37172386 37172386 37172386 37172386 ...
##  $ continentExp           : chr  "Asia" "Asia" "Asia" "Asia" ...

Now we can work with these numbers and make some exploratory analysis. Let’s check out a summary of our data:

summary(data)

##     dateRep                day            month             year     
##  Min.   :2019-12-31   Min.   : 1.00   Min.   : 1.000   Min.   :2019  
##  1st Qu.:2020-03-18   1st Qu.: 7.00   1st Qu.: 3.000   1st Qu.:2020  
##  Median :2020-04-16   Median :16.00   Median : 4.000   Median :2020  
##  Mean   :2020-04-08   Mean   :15.73   Mean   : 3.817   Mean   :2020  
##  3rd Qu.:2020-05-12   3rd Qu.:24.00   3rd Qu.: 5.000   3rd Qu.:2020  
##  Max.   :2020-06-07   Max.   :31.00   Max.   :12.000   Max.   :2020  
##                                                                      
##      cases             deaths         countriesAndTerritories
##  Min.   :-2461.0   Min.   :-1918.00   Length:21965           
##  1st Qu.:    0.0   1st Qu.:    0.00   Class :character       
##  Median :    3.0   Median :    0.00   Mode  :character       
##  Mean   :  311.2   Mean   :   18.15                          
##  3rd Qu.:   52.0   3rd Qu.:    1.00                          
##  Max.   :48529.0   Max.   : 4928.00                          
##                                                              
##     geoId           countryterritoryCode  popData2018        continentExp      
##  Length:21965       Length:21965         Min.   :1.000e+03   Length:21965      
##  Class :character   Class :character     1st Qu.:2.083e+06   Class :character  
##  Mode  :character   Mode  :character     Median :9.485e+06   Mode  :character  
##                                          Mean   :4.888e+07                     
##                                          3rd Qu.:3.199e+07                     
##                                          Max.   :1.393e+09                     
##                                          NA's   :326

Ok, something awkward is going on. The new cases and deaths variable have negative values, because, as one may notice, the minimum value is -2461 and -1918 for each. There shouldn’t be negative values, the only explanation available is mistakes in counts that are then corrected this way. This is possible the wrongest way to correct mistakes because one doesn’t know for sure the real values of the mistaken day nor the negative cases/deaths day.

One thing to always have in mind is the source of the data: our source must be reliable and trustworthy.

Now that I have a somewhat tidy data, I can think a bit about the variables available:

names(data)

##  [1] "dateRep"                 "day"                    
##  [3] "month"                   "year"                   
##  [5] "cases"                   "deaths"                 
##  [7] "countriesAndTerritories" "geoId"                  
##  [9] "countryterritoryCode"    "popData2018"            
## [11] "continentExp"

The most interesting data I will address will be the number of detected cases and deaths, by date reported.

Portugal in the World

How is Portugal rated in death and cases counts? There is some social media discussion about how to address this rates. Let us compare absolute numbers with percentages related to country population:

CasesRank <- data %>% 
        group_by(countriesAndTerritories) %>%
        summarise("TotalCases" = sum(cases)) %>%
        arrange(desc(TotalCases))

PTrankCases <- grep("Portugal", CasesRank$countriesAndTerritories)

numberofCases <- CasesRank %>% filter(countriesAndTerritories == "Portugal")
numberofCases <- as.integer(numberofCases[2])

knitr::kable(head(CasesRank, 10))
countriesAndTerritories TotalCases
United_States_of_America 1920061
Brazil 645771
Russia 458689
United_Kingdom 284868
India 246628
Spain 241310
Italy 234801
Peru 191758
Germany 183979
Iran 169425

The above table reflects the ranked 10 worst countries in COVID-19 diagnosed cases. Portugal is ranked 30, with 34351 total cases.

DeathsRank <- data %>% 
        group_by(countriesAndTerritories) %>%
        summarise("TotalDeaths" = sum(deaths)) %>%
        arrange(desc(TotalDeaths))

PTrankDeaths <- grep("Portugal", DeathsRank$countriesAndTerritories)

numberofDeaths <- DeathsRank %>% filter(countriesAndTerritories == "Portugal")
numberofDeaths <- as.integer(numberofDeaths[2])

knitr::kable(head(DeathsRank, 10))
countriesAndTerritories TotalDeaths
United_States_of_America 109802
United_Kingdom 40465
Brazil 35026
Italy 33846
France 29142
Spain 27135
Mexico 13511
Belgium 9580
Germany 8668
Iran 8209

The above table reflects the ranked 10 worst countries in COVID-19 deaths. Portugal is ranked 25, with 1474 total deaths.

Let us now check the same rates in percentage.

CasesPerRank <- data %>% 
        group_by(countriesAndTerritories) %>%
        summarise("TotalCases" = sum(cases), "Pop" = unique(popData2018)) %>%
        mutate("PercentageCases" = TotalCases/Pop * 100) %>%
        arrange(desc(PercentageCases))

DeathsPerRank <- data %>% 
        group_by(countriesAndTerritories) %>%
        summarise("TotalDeaths" = sum(deaths), "Pop" = unique(popData2018)) %>%
        mutate("PercentageDeaths" = TotalDeaths/Pop * 100) %>%
        arrange(desc(PercentageDeaths))


PTPerRankCases <- grep("Portugal", CasesPerRank$countriesAndTerritories)
PTPerRankDeaths <- grep("Portugal", CasesPerRank$countriesAndTerritories)

knitr::kable(head(CasesPerRank, 10))
countriesAndTerritories TotalCases Pop PercentageCases
Cases_on_an_international_conveyance_Japan 696 3000 23.2000000
Qatar 67195 2781677 2.4156291
San_Marino 695 33785 2.0571259
Holy_See 12 1000 1.2000000
Andorra 852 77006 1.1064073
Bahrain 14383 1569439 0.9164421
Kuwait 31131 4137309 0.7524456
Chile 127745 18729160 0.6820648
Singapore 37527 5638676 0.6655286
Luxembourg 4035 607728 0.6639483
knitr::kable(head(DeathsPerRank, 10))
countriesAndTerritories TotalDeaths Pop PercentageDeaths
Cases_on_an_international_conveyance_Japan 7 3000 0.2333333
San_Marino 42 33785 0.1243155
Belgium 9580 11422068 0.0838727
Andorra 51 77006 0.0662286
United_Kingdom 40465 66488991 0.0608597
Spain 27135 46723749 0.0580754
Italy 33846 60431283 0.0560074
Sweden 4656 10183175 0.0457225
France 29142 66987244 0.0435038
Sint_Maarten 15 41486 0.0361568

There are evident differences from percentage to absolute numbers in the extremes (but in fact Portugal doesn’t change much at the time of this report writing). In percentage Portugal is in 31 place for cases and 31 for deaths. Is this a fair comparison? There may be missing variables to understand our data: some index of number of urban centers per country for example, and also the predominance of respiratory diseases, atmospheric pollution and elderly people percentage. Also, the number of tests each country do highly influence the reported cases. All this summed together can cause a great impact at the political level for managing the emergency states of each countries and interpreting the results. With too many unknown factors the effectiveness of some policies may result only by chance.

Since there are just too many unknown factors in the percentage rate maybe the fairest way to compare is the absolute numbers (aware of not being ideal as well).

Evolution of the Disease in Portugal and Simulations

In the next lines I try to simulate new cases and deaths from COVID-19 in Portugal with a 15 day advance from the last data reported and to identify a peak, where we could consider a turning point for the pandemic in Portugal, meaning that the contingency politics are taking effect.

PTdata <- filter(data, geoId == "PT")
PTdataArranged <- arrange(PTdata, dateRep)

library(ggpmisc)

x <- 1:length(PTdataArranged$dateRep)
y <- log10(PTdataArranged$cases)
y <- gsub("[-InfNaN]", 0, y)

xsq <- x^2
xcub <- x^3

fit <- lm(y~x+xsq+xcub)

xv <- seq(min(x), 100, 1)
yv <- predict(fit, list(x = xv, xsq = xv^2, xcub = xv^3))

#Prediction <- tibble(Day = xv, logCases = yv)
PredictionCases <- tibble(Day = as.Date("2020-03-02")+xv, 
                          SimCases = as.integer(10^yv), 
                          RealCases = c(PTdataArranged$cases, rep(NA,
                                                        100-length(PTdataArranged$cases))))
PredictionCases$Day <- as.POSIXct(PredictionCases$Day)

CasesMaxDay <- as.Date(PredictionCases$Day[
        grep(max(PredictionCases$SimCases[1:length(PTdataArranged$cases)]), 
             PredictionCases$SimCases)])

curvePredict <- ggplot(PredictionCases, aes(Day, SimCases))
PTgSimCasesNEW <- curvePredict + geom_line() +
                geom_point(aes(Day, RealCases), col = "springgreen4", pch = 15) +
                geom_line(aes(Day, RealCases), col = "springgreen4") +
                labs(y = "Cases per Day", x = "Date", 
                     title = "Portugal new Cases Simulation") +
                stat_peaks(col = "tomato3", ignore_threshold = .9) +
                theme(plot.title = element_text(hjust = 0.5)) +
                coord_cartesian(xlim = c(PredictionCases$Day[1], as.POSIXct(Sys.Date() + 15)),
                                ylim = c(0, 2000))
PTgSimCasesNEW

z <- log10(PTdataArranged$deaths)
z <- sub("-Inf", "0", z)

fitD <- lm(z~x+xsq+xcub)

xv <- seq(min(x), 100, 1)
zv <- predict(fitD, list(x = xv, xsq = xv^2, xcub = xv^3))

PredictionDeaths <- tibble(Day = as.Date("2020-03-02")+xv, 
                          SimDeaths = as.integer(10^zv), 
                          RealDeaths = c(PTdataArranged$deaths, rep(NA,
                                                        100-length(PTdataArranged$deaths))))

PredictionDeaths$Day <- as.POSIXct(PredictionDeaths$Day)

DeathsMaxDay <- as.Date(PredictionDeaths$Day[
        grep(max(PredictionDeaths$SimDeaths[1:length(PTdataArranged$deaths)]),
             PredictionDeaths$SimDeaths)])

curvePredictDeaths <- ggplot(PredictionDeaths, aes(Day, SimDeaths))
PTgSimDeaths <- curvePredictDeaths + geom_line() +
                geom_point(aes(Day, RealDeaths), col = "springgreen4", pch = 15) +
                geom_line(aes(Day, RealDeaths), col = "springgreen4") +
                labs(y = "Deaths per Day", x = "Date", 
                     title = "Portugal Deaths Simulation") +
                stat_peaks(col = "tomato3", ignore_threshold = .7) +
                theme(plot.title = element_text(hjust = 0.5)) +
                coord_cartesian(xlim = c(PredictionDeaths$Day[1], as.POSIXct(Sys.Date() + 15)))
PTgSimDeaths

As we can observe from the graphs, the worst seems to have passed. The peak for new cases detected was in 2020-06-07 and the peak for deaths in Portugal was in 2020-04-18, 2020-04-19.

Conclusions

  • Cases: Portugal is ranked 30 in 207 countries, with 34351 total cases. Portugal had the peak of cases in 2020-06-07.
  • Deaths: Portugal is ranked 25 in 207 countries, with 1474 total deaths. Portugal had the peak of deaths in 2020-04-18, 2020-04-19.

The fact that the “peaks” have passed for the cases and deaths in Portugal doesn’t mean that the problem is over. For example, the peaks are always moving around because every day the simulations are iterated with the new information available, and new policies and the end of the confinement may produce more peaks for new cases and deaths.
From comparing our country with the others, we are quite better than most countries. This is due to several factors, not only policies but also with our peripheric geo-location. Also, it must be observed the predominance of the disease in the north hemisphere of the planet - so our rank my improve greatly for the next months (our rank in the end of April is about 18th to 19th place), if there is some influence with the flu-season.

Final Considerations

There is a lack of trustworthy data sources that would complement this data and give us detailed information about the how’s and why’s of SARS-CoV-2 behavior. Interesting variables that I would consider worth studying would be: Active Cases, Recovered Cases, ICU Cases, Non-ICU Hospitalized Cases, and also some information about patients, like: Known Diseases, a logical or more informative IfSmoker variable and Air Pollution Exposure.

Some of this information is retained by the “Sistema nacional de vigilância epidemiológica” SINAVE

I’ve applied for access in 2020-04-27 from an institutional e-mail but not a reply until today.