dataRC
is an R package designed to bring efficient data management
technologies to everyone. It aims to enhance efficiency in data handling
by providing easy-to-use tools for converting files to Apache Parquet
format, unifying heterogeneous databases, providing templates for data
processing and more. Whether you have little to none programming
experience or are an advanced user, dataRC
simplifies repetitive
processes and boosts your productivity.
Note: dataRC has been released in its most basic form, but several
features are currently inactive or under development. This includes
supplementary materials such as vignettes, website and tutorials, which
will be completed/added in future updates. Additionally, we are in the
process of preparing the package for submission to CRAN to ensure
broader accessibility and stability for users. Thank you for your
patience as we continue to improve and expand dataRC
to meet your data
management needs.
At present, installation of the package is only supported from GitHub.
# install.packages("devtools")
devtools::install_github("jdrengifoc/dataRC")
If you also like to install the vignettes (see Usage section
for more details) use the following command. However, if you have
already installed the dependencies feel free to delete
dependencies = TRUE,
or skip the updates when asked in Console
.
# install.packages("devtools")
devtools::install_github("jdrengifoc/dataRC", dependencies = TRUE, build_vignettes = TRUE)
To learn how to use all dataRC
’s features we provide different kinds
of study material as shown in the following table.
Material | Status |
---|---|
Documentation | Complete |
README | Available |
Vignettes | Available |
Website | Available |
Video Tutorial | Not started |
The documentation provides a comprehensive information for each
function. To see it you could load the library and use the symbol ?
.
library(dataRC)
?convert_files
The README lacks presentation since you are reading it. Here you can
find a simple usage example of three dataRC
’s functions!
library(dataRC)
# Convert all the .dta, .txt, and .csv files in the current folder into Parquet
# format and store them in the folder ./parquet_files.
convert_files(
folder = ".", files = list.files(pattern = '(dta|txt|csv)$'),
new_extension = "parquet", new_folder = '/parquet_files')
# Create a partial dictionary to ease data homogenization without making
# unexpected changes to original data.
dict_path <- 'dict.xlsx'
create_partial_dictionary(
folder = '/parquet_files', files = list.files(), dict_path, verbose = F)
#Add descriptive statistics and sort the partial dictionary for final manual
review.
sort_partial_dictionary(dict_path, overwrite = T)
By its part, vignettes are guides that showcase full examples of
workflows. They can be access through the
website or directly in
RStudio
. For the latter you need to install vignettes properly (see
Installation section). Once this is done you could list
the names of all available vignettes with
vignette(package = 'dataRC')
. Once you have identified the name of the
vignettes, lets say process_data_with_partial_dict
, use the
following command to visualize it. The vignette will render in the
help
pane.
vignette('process_data_with_partial_dict', package = 'dataRC')
#> Warning: vignette 'process_data_with_partial_dict' not found
Finally, explore the complete project documentation, supplementary materials, and additional resources on the website.
If you encounter a clear bug, please file an issue with a minimal
reproducible example on
GitHub. If you don’t
know how to do this or have any suggestion, please feel free to write an
email to jdrengifoc@eafit.edu.co. Please include the word dataRC
in
the subject.