Dirty Data Project

Introduction

This project consists of tasks that aim to test approaches to data wrangling and cleaning. Each task has its own analysis document which answers various questions using cleaned data obtained by programmatically processing the raw data.

Two tasks were chosen:

  • Task 1 - Decathlon Data
  • Task 4 - Halloween Candy Data

Languages

The code is written in R and both tasks contain RStudio .Rproj files.

How to run

Both tasks require that a cleaning script is run prior to attempting to run the analysis

The cleaning script will be found at data_cleaning_scripts/cleaning.R

Open the RStudio .Rproj and run cleaning.R. This script will generate new clean CSV data files in the clean_data folder. If this step has completed successfully open the relevant task analysis .Rmd file in the documentation_and_analysis folder.

Data used

CodeClan provided test data to students but due to file size concerns the original source data is not included in this repository for task 4. Similarly the clean data generated by the cleaning script is not uploaded but can be generated from the code.

CodeClan staff can find the source data files in CodeClan repository dr22_classnotes/week_03/day_5/dirty_data_project_raw_data/candy_ranking_data

For those outside CodeClan the data can be obtained from the following sources

Task 1 - Decathlon

decathlon: Performance in decathlon (data).
Department of statistics and computer science, Agrocampus Rennes

N.B. A copy of this file is in task1/raw_data/decathlon.rds

Task 4 - Halloween Candy

So Much Candy Data, Seriously. University of British Columbia

N.B. Before attempting to run any cleaning/analysis scripts the three source data files required should be copied to folder task4/raw_data. Full details in the analysis document.

Packages

The following R packages are required to run the code. The version numbers used at the time of the original project are shown.

Task 1

Package Version used for analysis
janitor "2.2.0"
tidyverse "2.0.0"

Task 4

Package Version used for analysis
assertr "3.0.0"
here "1.0.1"
readxl "1.4.3"
tidyverse "2.0.0"