Reproducible code
eshom opened this issue · 7 comments
I think as a standard all scripts should be completely independent and reproducible. I.e. people should be able to copy and paste code in their R REPL session without errors. This is currently not the case with many scripts in this repo. Instead of supplying example data, many algorithms are written as "templates" where one has to input their own data. However, there's no information what the data structure should even be.
R has many built in datasets, so these can be used to run algorithms with. If the script is just a function definition, then there should be an example usage of the function.
I could list here all scripts that need to be written this way.
What do you think?
- ./Data-Preprocessing/lasso.R
- ./Data-Preprocessing/K_Folds.R
- ./Data-Preprocessing/data_processing.R
- ./Data-Preprocessing/dimensionality_reduction_algorithms.R
- ./Classification-Algorithms/lasso.R
- ./Classification-Algorithms/decision_tree.R
- ./Classification-Algorithms/KNN.R
- ./Classification-Algorithms/gradient_boosting_algorithms.R
- ./Classification-Algorithms/LightGBM.R
- ./Classification-Algorithms/SVM.R
- ./Classification-Algorithms/xgboost.R
- ./Classification-Algorithms/naive_bayes.R
- ./Classification-Algorithms/random_forest.R
- ./Clustering-Algorithms/K-Means.R
- ./Clustering-Algorithms/dbscan_clustering.R
- ./Clustering-Algorithms/gmm.R
- ./Clustering-Algorithms/pam.R
- ./Clustering-Algorithms/kmeans_raw_R.R
- ./Association-Algorithms/apriori.R
- ./Regression-Algorithms/logistic_regression2.R
- ./Regression-Algorithms/logistic_regression.R
- ./Regression-Algorithms/linear_regression.R
- ./Regression-Algorithms/KNN.R
- ./Regression-Algorithms/gradient_boosting_algorithms.R
- ./Regression-Algorithms/LightGBM.R
- ./Regression-Algorithms/ANN.R
- ./Regression-Algorithms/multiple_linear_regression.R
- ./Regression-Algorithms/linearRegressionRawR.R
- ./Data-Manipulation/OneHotEncode.R
- ./Data-Manipulation/LabelEncode.R
How can the scripts be tested if they don't accept data as arguments? I think we need to add unit tests instead. They will test our code and provide users with examples at the same time.
I think this can be part of the documentation solution we talked about in #59. Using knitr
we can turn scripts into HTML reports, which would nicely incorporate example output. Errors caused by bad scripts can be handled, printed, and reviewed. I can write R code for this, but I'm not sure how to set up github actions correctly.
So you suggest having algorithms separated from data and unit tests that will show usage of the algorithms? And the tests can be transformed into HTML reports for convenience? Sounds good to me
Hmm not exactly. What I mean is that scripts specially formatted can be turned into HTML reports (https://rdrr.io/cran/knitr/man/spin.html). Data would still need to be part of the algorithms. Because this function, while trying to compile a report, runs the actual script - errors would be thrown if there's any problem with the script. That error can be part of a test. At the same time good scripts would compile to nice HTML reports.
It would make more sense once we have a prototype running in https://github.com/Panquesito7/R/tree/documentation_stuff
I agree with you on this fundamental issue; for linearRegressionRaw.R, I replaced a reference to the diamonds dataset with a specifically simulated and reproducible (via a set seed) synthetic dataset.
Half of the challenge here is going to be eliminating extraneous library calls, such as with the tidyverse functions and datasets.
I personally don't mind if third party packages are used, but either the include.only
operator should be used in order to only attach to the search path objects that appear in the code, or preferably it should be replaced entirely with the double colon operator to make everything more explicit.
In either case, some check should be done if packages are installed. Something like:
if (!require(ggplot2))
install.packages("ggplot2")
# The rest of the code
# ...