DHIS2_data_cleaning: An HTML repository from Yalemzewod

Public health data encounter various constraints due to the heterogeneity in data sources, collection methods, and volume. These constraints, often referred to as data quality dimensions, include attributes such as:

Completeness: Captured but not reported.
Timeliness: Late reporting.
Availability: Captured and reported but not accessible for use.
Incomplete/Poor Recording: Some important variables or attributes not captured.
Consistency: Always tells a similar fact/story.
Aggregated: Masks important information relevant for decision-making.
Big Data: An ambiguous dimension.
Name Mismatching: Misspelled facility or area names.

Note! Having minimal useful data in real time, well-utilized, is better than having lots of data at a low speed and poorly utilized. The choice is yours! As data grows, advanced skills and tools become necessary.

Reading and exploring data

In R, you can import various types of data. Common file types include .csv, .dta, and .xlsx. To import CSV files into RStudio, use the base function read.csv(). Assign the imported dataset to an object (e.g., “routine_data”) using the <- operator.

For this training, we’re using a dataset called “routine_data,” extracted from DHIS2. It contains variables related to malaria tests and confirmed cases, stratified by age and sex, reported by districts.

For data management, consider using the tidyverse package (install it using install.packages("tidyverse")). The tidyverse includes various packages (like dplyr, ggplot2, tidyr) that enhance data manipulation and visualization.

Happy data exploration! 📊🔍👩‍💻

Yalemzewod/DHIS2_data_cleaning

Reading and exploring data