/DATA550-final-project-repo

DATA550 Data Science Toolkit; final project repository

Primary LanguageR

DATA550 Final Project Repository

This repository contains the code and files for the final project of DATA550. This particular data project focuses on analyzing the relationship between depression tests and PKD (Polycystic Kidney Disease) in a sample dataset.

Generating the Final Report

To generate the final report, follow these steps:

  1. Ensure you have R and RStudio installed on your system.
  2. Clone this repository to your local machine.
  3. Open RStudio and set the working directory to the root of the cloned repository.
  4. Run the command make final_report.html in the terminal or console within RStudio.

The final report will be generated as final_report.html. To clean the output after generating the report, run the command make clean in the terminal.

Building with Docker

To build the Docker image and generate the report using Docker, follow these steps:

  1. Ensure Docker is installed on your system.
  2. Open a terminal and navigate to the root of the cloned repository.
  3. Run the command make build to build the Docker image. This process will install all required R packages and prepare the environment. This step may take a few minutes.
  4. Once the image is built, run the command make run to execute the Docker container. This will generate the report and save it to the report directory on your local machine.

Code Description / Content

Variables

Variable (Type) Description

  • PatientICN: (Integer) Unique sequence of numbers representing individual patient
  • SurveyName: (Character) Depression test, has to be either PHQ9, PHQ-2, or PHQ-2+I9
  • RawScore: (Integer) Score of depression survey taken (For a PHQ-2 and PHQ-2+I9 test must be between 0-6, for PHQ9 between 0-27)
  • CurrentAge: (Integer) Age of patient currently
  • IndexAge: (Integer) Age of patient when they took their first depression test
  • SurveyGivenDateTime: (Date) Date of when the depression test was taken
  • Gender: (Character) Male or female
  • Race: (Character) Patient race (White, Black or African American, Asian, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, Unknown)
  • Ethnicity: (Character) Patient ethnicity (Hispanic or Latino, Not Hispanic or Latino, Unknown)
  • Alcohol: (Integer) Alcohol abuse (1 if yes, 0 if no)
  • Cancer: (Integer) Cancer (1 if yes, 0 if no)
  • Diabetes: (Integer) Diabetes (1 if yes, 0 if no)
  • Obesity: (Integer) Obese (1 if yes, 0 if no)
  • NumeGFR: (Integer) Number of eGFR measurements in total the patient received
  • Egfr.epi: (Integer) eGFR score at a given time

code/00_make_dataset.R

  • generates example data with 7,000 rows and 15 columns. The data is longitudinal with multiple observations per person. The columns created are described above.
  • saves data set as data.rds in output/ folder

code/01_make_table.R

  • reads data set saved by code/00_make_dataset.R
  • creates labels for variables to present nicely in a table
  • creates a table of the variables by survey name
  • saves the table as table_one.rds in output/ folder

code/02_make_figure.R

  • reads data set saved by code/00_make_dataset.R
  • uses ggplot to create four box plots of the counts of the four co-morbidities for the entire population
  • creates one figure combining the four plots
  • saves the figure as figure.png in output/ folder

code/final_report.Rmd

  • loads the output into a nice report with descriptions of what you should see in the table and figure

code/04_render_report.R

  • renders the report, code/final_report.Rmd in production mode

Synchronize Package Repository

To synchronize the package repository and restore the package environment for this project, run the following command in R or RStudio in the Terminal:

make install