This is a project designed to display a set of visualizations using GGPLOT2 package. Using different geoms and also creating small multiple plots using facet_wrap to split initial data by other set of variables. Using RStudio integrated with GitHub to commit all changes from my IDE.
It aims to present different ways of representing data. It will use NHS data from England,to display several ways of presenting Time Series data using R.
Data has been downloaded from https://www.england.nhs.uk/statistics, these statistics are publicly available.
Original downloaded files from the above website are Excel files .xlsx, and I have used this set of packages installed using packman package manager pacman::p_load(readxl,here,dplyr,janitor) to conduct the initial data pre-processing, to arange the data in a way that will be easy to use when creating a set of plots.
https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/
The Weekly and Monthly A&E Attendances and Emergency Admissions collection collects the total number of attendances in the specified period for all A&E types, including Minor Injury Units and Walk-in Centres, and of these, the number discharged, admitted or transferred within four hours of arrival.
https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/
We can download the Unadjusted: Monthly A&E Time Series April 2019 (XLS 364K) file:
-
Open your web-browser browser
-
Go to Applications menu top right corner
-
Select More Tools > Page Source
-
The HTML code will be displayed. Then we need to press CTRL +F to open the find option on the HTML page, as we are looking for “Time series” word within the main website
This allows us to locate the .xls file for Unadjusted Time Series data and to download it from R.
Unadjusted: Monthly A&E Time series April 2019 (XLS, 364K)
This section contains information on Consultant-led Referral To Treatment (RTT) waiting times, which monitor the length of time from referral through to elective treatment.
Monthly RTT waiting times data has been published since March 2007. Initially data was only published for patients whose RTT pathways ended in admission for treatment (admitted pathways). Non-admitted pathway data (patients whose RTT pathways ended for reasons other than admission for treatment) has been published since August 2007.
https://www.england.nhs.uk/statistics/statistical-work-areas/rtt-waiting-times/
This project is being build using RStudio to commit directly all script changes to the ggplot2-visualizations GitHub repo.
Some plots produced include themes, in the plots folder there is a collection of all plots iterations using GGPLOT2
Using library(gridExtra) to arrange below plots: grid.arrange(TypeI_theme_bw,TypeI_theme_light,TypeI_theme_classic,TypeI_theme_dark,ncol=4)
Example of a smooth line added to the AE Type I Attendances plot and tailored y axis labels
Tailored geom smooth by using these parameters (se = TRUE,FALSE, display SE; span = 0.1 Use span to control the "wiggliness" of the default loess smoother)
geom_smooth(span = 0.1,se = TRUE, size = 0.8)
This plot is just a small example on how to use facet_wrap to display specific plots by any categorical variable. In this instance I wanted to show AE Type I attendances in England by months and years.
It is important to remember to turn your months variable into a Factor for the months label to be chronologically displayed in the plot.
Att_Full_year_f <- Att_Full_year %>% mutate(Monthf = factor(Month, levels = month.abb))
b) AE Attendances by metric (Type I Major Attendances, Type 2 Single esp Attendances, Type 3 Other Attendances
This time we first re-shape our data to be in long format and then we use facet_wrap with the newly created Metrics column
We can also re-shape our data in long format to use Metric as color ggplot(aes(x = period, y = value,group = Metrics, colour = Metrics)). Allowing us to plot a single line for each AE Attendances type (I,II,III and Total) on the same figure.
We can also include a regression line for each of the individual facettted plots, as in "09 AE Attendances_by_year_geom_smooth.R", by using facet_wrap(~Year), group = Year and specially adding this new layer to the ggplot "geom_smooth(se = TRUE, colour = "darkorchid1" as shown below:
We can also combine two continuous measures like AE Type 1 Major AE And Type 2 single specialty into a scatterplot and add to this initial plot two density plots for each X and Y axis, describing the shape of the distribution for each of these metrics. See "11 Density plot Major Single AE Attendances.R" script for details.
Desity_plot02 <- Att_months %>% ggplot(aes(Major_att, Single_esp_att, Metrics, color = Month)) + ggtitle("AE Major attendances (Type 2: Major A_E vs Single_specialty departments. 2011-2019") + geom_point(size = 2, alpha = 0.3) +
-
Adding density plot for X axis (AE Type_2 major_a_e) geom_xsidedensity( aes(y = after_stat(density),fill = Month), alpha = 0.5, size = 1, position = "stack") +
-
Adding density plot for Y axis (AE Type_2 single_specialty) geom_ysidedensity( aes(x = after_stat(density),fill = Month), alpha = 0.5, size = 1, position = "stack")
Desity_plot02
In the plot below is the result of this combination:
A Raincloud chart allows us to combine different visualizations to explore metrics distributions shape using ggdist package In this particular example, I plot number of Major A&E Attendances by year for 2010-2013 period, using the following three density plot functios from ggdis package: stat_halfeye(), stat_dots() and geom_boxplot().
Using this ggdis plackage, many other functions allow to select the best geom to visualize frequency distributions in our plots. See script "12 Raincloud chart AE Attendances.R" in this project for further details about this Raincloud chart below:
An standard area plot can be quickly transformed into a Spaghetti plot
This is an exmaple on how to create a Spaghetti plot to highlight a serie within a set of several time series indicators
See script "13 Spaghetti plot OECD CPI 1974_2022.R" in this project for further details
Also we can include latest value for each country, highlighted as a purple dot, See script "13 01 Spaghetti plot OECD CPI 1974_2022.R" for details
Furthermore, we can use this charts to identify Min, Max and latest values in any TS, using facets to split plots in this instance by county. See script *14 Sprkline OECD CPI.R" in this project for specific details on this sparkline charts
The building blocks of this chart is made of an adhoc set of calculations to be displayed as dots in the main line chart. They will be visual references, and it can be extended to compute the five number summary (Min, Q1, Q2 (median), Q3, Max) or any other adhoc statistics like any central tendency measure
- Set of calculated values to be displayed in the line as dot geoms using ggplot2
minv <- group_by(OECD_subset, country) %>% slice(which.min(value)) (red dot)
maxv <- group_by(OECD_subset, country) %>% slice(which.max(value)) (blue dot)
endv <- group_by(OECD_subset, country) %>% filter(time == max(time)) (purple dot)
How to use Camcoder package to record animated GIF from GGPLOT2 charts https://github.com/thebioengineer/camcorder
There is an example on this project on how to use camcoder to create a GIF from a ggplot chart. Useful to explore the design process of any chart in R
This can be included in any presentation, when it might be useful to teach of to design ggplots in R. See details in this folder on this repo: https://github.com/Pablo-source/ggplot2-visualizations/tree/main/camcoder
There is the showtext package that allows us to use different google fonts on our chats. It can be useful when producing a more elaborated charts in R linke maps, where we want to obtain a specific aesthetic effect using fonts. See: https://fonts.google.com/. Some examples of this can bee seen in "A Using Google fonts in plots.R" scripts.
See this script for details on how to use Gootle fonts https://github.com/Pablo-source/ggplot2-visualizations/blob/main/A%20Using%20Google%20fonts%20in%20plots.R
This is an example on how to use annotations and reference lines to a ggplot2 charts. In this example I plot Bank of England Official Bank Rates against the three lockdowns that we had in the UK during COVID19 pandemic. For details on the scrips used to produce this chart please see "16 BoE Interest rates from chart.R" script.
We can use geom_curve() function to draw specific arrows pointing to data points we want to highlight in our chart. The plot below build from "17 Annotations mtcars data set.R" script in this project. Also, I have used geom_text_repel() function to avoid labels overlapping in the plot.
The chart below is an example on how to apply Tufte design priniples to improve graph readibility. He claimed that a good graphical representations maximize data-ink and erase as much non-data-ink as possible. This is a good design practise when creating ggplot2 charts in R. One key concept he developed was the data-ink ratio which is calculated by 1 minus the proportion of the graph that can be erased without loss of data-information.
The five design principles he created can be an excelent guide to create better charts in R:
- Above all else show data.
- Maximize the data-ink ratio.
- Erase non-data-ink.
- Erase redundant data-ink.
- Revise and edit
On top of these design principles, I have improved the previous BoE Interest rates chart design by apoplying these set of changes:
- Adding several paramters to the theme() function:
- Remove default chart area grey color background, make it white instead panel.background = element_rect(fill = NA),
- Remove X axis grid lines, x axis lines provide sufficient guideline to identify (bank interest rate) value panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(),
- Keep just major Y axis grid lines, use black colour to match axis and title font colour panel.grid.major.y = element_line(colour = "black")
- Added custom Title to x and Y axis using labs() function
- labs(title = "BoE Interest rates reach 4.5% in May 2023",
- subtitle ="Twelfth interest rate increase since Jan 2022",
- y = "Interest rate %",
- x = "Year")
See script "18 Tufte style charts in R.R" for details on the above changes applied to this chart:
The final ggplot2 chart output can be found here on this ggplot2 visualizations project: plots/25 Tufte style chart.png
Using gridExtra package we can arrange several charts in one image, choosing the layout of the charts in rows and cols. In this instance I combine the three inflation measures (CPI,CPIH,OOH) using ONS data, from the Consumer price inflation latest release: https://www.ons.gov.uk/economy/inflationandpriceindices/bulletins/consumerpriceinflation/april2023, into a single image made of three charts arranged in three columns and one row. See script "19 GridExtra combine charts.R", for details on how to use grid.arrange() function see: https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html
Combining measures from different data sources (Inflation (ONS) and Interest rates (BoE) in a single chart:
This is an elegant way to create customizable maps combining {ggplot2} and {sf} packages. Performing a left join between an indicator “Percent of new cases of cancer” and a NHS CCG England shapefile. We can display a continuous variable in a choropleth map, in this example cancer cases for 2013 and 2019 years by CCGs in England.
Click on the image below, a map will be displayed in a new tab, with its true final map colour palette.
See script “23 CCG OIS Indicators maps facet_wrap.R” for further details.
Trying to improve the feel and look of charts using Tufte's design principles, I found the example on Tweeter (x) below with a clever use of HTML tags to implement colours in a plot sub title. See script "25 Exclude legelnds using sbutitles.R" on this ggplot2 visualizations projecgt for script details.