dataRC

dataRC is an R package designed to bring efficient data management technologies to everyone. It aims to enhance efficiency in data handling by providing easy-to-use tools for converting files to Apache Parquet format, unifying heterogeneous databases, providing templates for data processing and more. Whether you have little to none programming experience or are an advanced user, dataRC simplifies repetitive processes and boosts your productivity.

Note: dataRC has been released in its most basic form, but several features are currently inactive or under development. This includes supplementary materials such as vignettes, website and tutorials, which will be completed/added in future updates. Additionally, we are in the process of preparing the package for submission to CRAN to ensure broader accessibility and stability for users. Thank you for your patience as we continue to improve and expand dataRC to meet your data management needs.

Installation

At present, installation of the package is only supported from GitHub.

# install.packages("devtools")
devtools::install_github("jdrengifoc/dataRC")

If you also like to install the vignettes (see Usage section for more details) use the following command. However, if you have already installed the dependencies feel free to delete dependencies = TRUE, or skip the updates when asked in Console.

# install.packages("devtools")
devtools::install_github("jdrengifoc/dataRC", dependencies = TRUE, build_vignettes = TRUE)

Usage

To learn how to use all dataRC’s features we provide different kinds of study material as shown in the following table.

Material	Status
Documentation	Complete
README	Available
Vignettes	Available
Website	Available
Video Tutorial	Not started

The documentation provides a comprehensive information for each function. To see it you could load the library and use the symbol ?.

library(dataRC)
?convert_files

The README lacks presentation since you are reading it. Here you can find a simple usage example of three dataRC’s functions!

library(dataRC)

# Convert all the .dta, .txt, and .csv files in the current folder into Parquet
# format and store them in the folder ./parquet_files.
convert_files(
  folder = ".", files = list.files(pattern = '(dta|txt|csv)$'),
  new_extension = "parquet", new_folder = '/parquet_files')

# Create a partial dictionary to ease data homogenization without making
# unexpected changes to original data.
dict_path <- 'dict.xlsx'
create_partial_dictionary(
  folder = '/parquet_files', files = list.files(), dict_path, verbose = F)
#Add descriptive statistics and sort the partial dictionary for final manual
review.
sort_partial_dictionary(dict_path, overwrite = T)

By its part, vignettes are guides that showcase full examples of workflows. They can be access through the website or directly in RStudio. For the latter you need to install vignettes properly (see Installation section). Once this is done you could list the names of all available vignettes with vignette(package = 'dataRC'). Once you have identified the name of the vignettes, lets say process_data_with_partial_dict, use the following command to visualize it. The vignette will render in the help pane.

vignette('process_data_with_partial_dict', package = 'dataRC')
#> Warning: vignette 'process_data_with_partial_dict' not found

Finally, explore the complete project documentation, supplementary materials, and additional resources on the website.

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. If you don’t know how to do this or have any suggestion, please feel free to write an email to jdrengifoc@eafit.edu.co. Please include the word dataRC in the subject.