/MSDSpipeline

Automatically download, join, and clean the NHS Digital Maternity Services Monthly Statistics data (MSMS), which is derived from the Maternity Services Data Set (MSDS).

Primary LanguageROtherNOASSERTION

MSDSpipeline

Automatically download, join, and clean the NHS Digital Maternity Services Monthly Statistics data (MSMS), which is derived from the Maternity Services Data Set (MSDS). When new information is released by NHS Digital easily download and join it with the data already downloaded.

Project Description

  • Each month, NHS Digital releases Maternity Services Monthly Statistics, which are derived from the Maternity Services Data Set. Multiple CSV and XLSX files are released each month, addressing different parts of the available data.

  • Working with the raw data in this form is time-consuming, and involves downloading the raw files, handling file naming inconsistencies, and joining data from multiple months together to form a clean time-series dataset.

  • This package enables an automated data pipeline by:

    1. Navigating to every monthly "publications" url on the main MSMS url, and scraping a list of all .csv and .xlsx files from each monthly page (example - 296 files as at Sept 2021).
    2. Downloading the .csv and .xlsx data to a local folder (780MB+ of data as at Nov 2021), and sorting into folders according to type. (eg. data, measures, CQIM, dq, meta, pa, rdt, qual, and "miscellaneous" files).
    3. Joining mothly files of the same type together, including cleaning and consolidating columns where formats have changed over the 6+ years that the datasets have been released.
    4. Implementing an example plotting function that quickly demonstrates the volume of data available.
  • Potential future work:

    1. Implement getter functions to get details of the available measures contained in each dataframe.
    2. Implement a shiny dashboard to give a basic window into the data. Several dashboards using the same source data are already available. One is the NHS Digital Maternity Services Dashboard.
    3. Generalise to enable downloading of other NHS Digital statistical datasets.

How to use

You can install from GitHub using the {remotes} package with:

# install.packages("remotes")
remotes::install_github("https://github.com/ThomUK/MSDSpipeline")

# Load the package
library(MSDSpipeline)

Using the package is a two-stage process. First the data must be downloaded locally. Next, each of the 3 groups of data contained in MSDS (measures, data, and dq) must be joined together and tidied. Once tidied the resulting dataframes are ready for use in your analysis.

  1. Download the data.
# Download the data to your local machine, or a destination of your choice.
# This will begin downloading 780MB+ and 300+ files to your machine.
# Files are also sorted into subfolders, according to the information contained in each file.
# The download can be cancelled in RStudio by clicking the red button in the console window.

msds_download_data(destination = "data/msds_download")
  1. Tidy the data you need.
# Tidy the data you need.  This will combine and tidy data, including consolidating column names,
# fixing date formats, and altering data and unit columns in a consistent way.
 measures_data <- msds_tidy_measures()
 exp_data <- msds_tidy_data()
 dq_data <- msds_tidy_dq()
  1. Do your analysis. Some demo plotting functions are included below to illustrate the available data.
# Measure
plot_demo_measure(measures_data, "CQIMPreterm", "RX1")

# Exp-data
plot_demo_data(exp_data, "TotalBabies", "RX1")

# DQ
plot_demo_dq(dq_data, "RX1")

Example plot outputs

Measure

image

Data

image

DQ

image

Contributing

I am always interested to hear from others working with maternity data. If you spot a problem, please raise an issue, or make a PR.

Similar work by others

Similar NHS source data relating to Maternity

This source data could be collated with a project similar to this one, but no project currently exists.