Adding `assertr` to ROpenSci
tonyfischetti opened this issue ยท 36 comments
-
- What does this package do? (explain in 50 words or less)
The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline to protect against common data errors and instances of bad data.
- What does this package do? (explain in 50 words or less)
-
- Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).
Package: assertr
Type: Package
Title: Assertive Programming for R Analysis Pipelines
Version: 1.0.0
Authors@R: person("Tony", "Fischetti", email="tony.fischetti@gmail.com",
role = c("aut", "cre"))
Maintainer: Tony Fischetti <tony.fischetti@gmail.com>
Description: Provides functionality to assert conditions
that have to be met so that errors in data used in
analysis pipelines can fail quickly. Similar to
'stopifnot()' but more powerful, friendly, and easier
for use in pipelines.
URL: https://github.com/tonyfischetti/assertr
BugReports: https://github.com/tonyfischetti/assertr/issues
License: MIT + file LICENSE
LazyData: TRUE
Imports:
dplyr,
MASS,
lazyeval
Suggests:
knitr,
testthat,
magrittr
VignetteBuilder: knitr
-
- URL for the package (the development repository, not a stylized html page)
https://github.com/tonyfischetti/assertr
- URL for the package (the development repository, not a stylized html page)
-
- What data source(s) does it work with (if applicable)?
Any. Mostly in the form ofdata.frames
- What data source(s) does it work with (if applicable)?
-
- Who is the target audience?
Anyone who has ever struggled with bad data
- Who is the target audience?
-
- Are there other R packages that accomplish the same thing? If so, what is different about yours?
Theensurer
package attempts to solve the very same problem. To a certain extent, theassertive
package also offer some similar capabilities. The difference betweenassertr
and these other packages is the grammar of usage and the way that assertions of different types can be easily combined to express arbitrarily complex assertions in a very readable way.
- Are there other R packages that accomplish the same thing? If so, what is different about yours?
-
- Check the box next to each policy below, confirming that you agree. These are mandatory.
- This package does not violate the Terms of Service of any service it interacts with.
- The repository has continuous integration with Travis and/or another service
- The package contains a vignette
- The package contains a reasonably complete readme with devtools install instructions
- The package contains unit tests
- The package only exports functions to the NAMESPACE that are intended for end users
-
- Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
Yes. All the user-facing functions are in snake case, but the internal functions sometimes use dots (.
) as separators. I'm open to changing that. Also, the package doesn't have a code of conduct yet but I think it's a good idea to include.
- Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
- Are there any package dependencies not on CRAN?
No - Do you intend for this package to go on CRAN?
It already is - Does the package have a CRAN accepted license?
You bet - Did
devtools::check()
produce any errors or warnings? If so paste them below.
No warnings -
- Please add explanations below for any exceptions to the above:
-
- If this is a resubmission following rejection, please explain the change in cirucmstances.
I have a use case for this and have already looked through the code, so am happy to review if that is useful (at the same time I can't review until the 4th January at the earliest as I will be travelling over the break).
I've been meaning to do one, so I can be the second.
thanks jenny, assigned
@tonyfischetti Excellent! Thanks for submitting! Looking forward to the reviews and adding this to the suite. ๐
cool - (p.s. that comment was your friendly heroku robot https://github.com/ropenscilabs/heythere)
The descent towards manuscript central begins... ๐
unless you're volunteering to remind everyone manually
Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.
Maybe we should send hand-written notes?
And yes, duly noted, that I need to bust a move on this.
Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.
heythere
!= MS central
Maybe we should send hand-written notes?
Yes!
General comments
โ
The assertr
package provides a generalised framework for defensive programming around data.frames. This sits somewhere between stopifnot
and testthat
in terms of flexibility and complexity and as such forms a useful building block for data analysis workflow, which I believe is under-tooled at the moment. I really like the idea of having packages that are primarily focussed on data workflows rather than restricting people to ideas that were developed for software engineering (such as formal unit tests).
โ
The package is very tight -- it exports the minimum set of functionality and conforms to the "do one thing and do it well" school of thought. The functions are well documented, the vignette is readable and less dry than most. I appreciate the split into NSE and SE versions of all core functions.
โ
Accordingly, most of my comments focus on design decisions and therefore may all be out of line because the author will have thought about this more than I have.
โ
The main entrypoints are difficult to differentiate
โ
My biggest concern is that I found the three main entry points very difficult to keep straight. And when I put the package down over Christmas I had to re-remember them again.
โ
assert(data, predicate, ...)
verify(data, expr, ...)
insist(data, predicate_generator, ...)
โ
The difference is primarily in the properties of the second argument and I wonder if there's a way of specifying that some other way than three functions that have such similar names? The current approach is extremely elegant but at the cost of being a bit too opaque to the user -- especially because the three function names are essentially synonyms of each other there is nothing to jog your memory. I presume the problem is it is difficult to detect the difference between the three argument types before they are evaluated (and the correct evaluation depends on the type).
The custom handler routine is inflexible
โ
The custom handler is a really great addition, but could be improved. testthat
has a similar handler approach that allows storage of a bunch of repeated assertions (pass or fail). My use-case for this package is to replicate something like this, so I'd want to pass the same handler in to all the functions in a pipeline:
โ
mtcars %>%
verify(nrow(mtcars) > 10, error_fun=my_error_fun) %>%
verify(mpg > 0, error_fun=my_error_fun) %>%
insist(within_n_sds(4), mpg, error_fun=my_error_fun) %>%
assert(in_set(0,1), am, vs, error_fun=my_error_fun) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
โ
What would be heaps nicer is if I could register a handler; change the assert
function to something like:
โ
assertr <- function(data, predicate, ..., error_fun=getOption("assertr.handler", assertr_stop)) {
}
โ
(or eqivalently use the package-environment trick like testthat
does). This would allow:
โ
options(assertr.handler=my_error_fun)
mtcars %>%
verify(nrow(mtcars) > 10) %>%
verify(mpg > 0) %>%
insist(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
โ
(As a related comment, the usage definitions reference the unexported function assertr_stop
which some may find confusing. Additionally, is there a reason why verify
uses error_fun=stop
not assertr_stop
?)
โ
Minor comments
โ
- The
dplyr
dependency, which is used soley fordplyr::select_
seems a potentially heavy dependency for one function; if it is straightforward to swap out for independent imlementation that would decrease the package footprint (I can imagine using this in contexts where I do not have dplyr installed such as container-based workflows). But I can totally see the advantage of sticking with something that is known to work.
โ - Think about the non-pipe people. All the examples and the vignette make heavy use of the
%>%
operator (which is fine), but as someone who uses this little or who imagines using this mostly in packages where I'd be avoiding so much weird evaluation, I would appreciate a few more pipe-less examples. This is potentially confusing in places like:
โ
our.data %>% assert(within_bounds(0,Inf), mpg) # and so on
โ
when ?assert
says that usage is
โ
assert(data, predicate, ..., error_fun = assertr_stop)
โ
This confused me because in usage it looks like predicate, data
, but of course the data argument comes from the pipe and the mpg
is the column name is passed through to the ...
argument. While the use with pipes is very elegant, I think the package has use outside of that scope too. The examples within the package are actually really good like this.
โ
- Classed exceptions would make the error handling more flexible. Related to the second major point above; it would be nice to distinguish between errors that were because the input to assertr was incorrect, and errors that are raised because the data failed the assertions. R's classed errors provide a nice framework for this. Then
tryCatch
andwithCallingHandlers
can dispatch appropriately based on the sort of error.
โ - A reporting framework would be fantastic If you don't write this, I will -- but given you have written this package I figure you should get right of refusal. Related to the point above, I would like to use the underlying bits you have here in a package for automated testing of upstream data sources that tend to be misbehaved. I can imagine a testthat-like reporting framework where a bunch of tests are run and the failures reported.
Thanks for the kind words and really great feedback @richfitz.
The main entrypoints are difficult to differentiate
As pointed out, this is somewhat a consequence of my making the API as elegant as possible but at the expense of some opaque-ness. The problem is (again, as you pointed out) there's no elegant way that I can see to programmatically detect the argument types. So we have verbs that are literally synonyms (I chose their names using a thesaurus). Unfortunately, I can't think of a better naming mechanism without really long function names like take_a_predicate_generator_and_apply_to_each_column()
. I'm open to suggestions, but it may not be the end of the world since there are only three main verbs and the docs are good.
The custom handler routine is inflexible
I like the idea of using a testthat
style options
mechanism. I'll implement this
Additionally, is there a reason why verify uses error_fun=stop not assertr_stop?)
Nope, that's an error. Good catch!
The dplyr dependency
That was a difficult choice. It's a hell of a heavy dependency, but I was wary of implementing dplyr::select
myself. Especially since I would have to reference dplyr
's implementation so heavily that it would likely constitute code theft. I'd love to drop the dependency though, if anyone thinks they can implement it without copying code.
A reporting framework would be fantastic
I need this feature, and I have a few great ideas on how to implement it. I'd like to talk to you more (@richfitz) about your particular use case in case my solution only suits my use case. I think this feature can potentially be the most powerful and useful capability of assertr
So I'm going to run into a little more free time in the near future and I'd like to get back from my learning hiatus back into improving assertr
--particularly because people are telling me its really useful for them. Because of the learning hiatus, I have some fresh new ideas for improvement, but I'd like to run them past some of you for further input...
The main entrypoints are difficult to differentiate
As mentioned before, there is assert
, insist
, insist_rows
, and assert_rows
for representing a wide range of tasks. However, if I wanted to add the ability to, say, declare that the whole data set should have no more than 15% missing values (and I do want to add that), the semantics of that would require another specialized function... and I'm running out of synonyms for "assert"!
Principled though that solution was, I'm not sure its the correct thing to do going forward; it's always been one of R's strengths from a user's point of view to use a familiar generic function (mean
, plot
, etc...) with all sorts of input and have the object system dynamically dispatch the correct functionality.
So how about this... creating an S3S4 generic function (proclaim
perhaps?) that will handle all of the different semantics for the user. Concretely, the function returned from within_n_mads
can be labeled with class assertr_dynamic
(because the predicate is dynamically generated). Then, proclaim
would dispatch on the second argument (the first arg is the data frame) and call what is currently referred to as insist
. In the same way, maha_dist
would be classed something like assertr_dynamic_rows
and a user calling proclaim(df, maha_dist, within_n_mads(10), ...)
would transparently dispatch insist_rows(maha_dist, within_n_mads(10), ...)
This would improve assertr
's extensibility greatly; for example, adding semantics to check the supplied data.frame as a whole would only require writing another S3S4 method of the proclaim
generic and making sure that the predicate function was correctly classed.
This solution is perhaps a little unconventional, but it completely obviates the requirement that a useR remember all the verbs.
The custom handler routine is inflexible
The most common suggestion for assertr
is to be able to warn (not error) on violation. Additionally, even if it does error on violation, there should be semantics in place for the entire chain of assertions to run so that the final error message will contain the complete report of the data errors.
The reason I needed to take my time with this one is that I need the warnings to be concatenated through a assertr
chain in a principled manner. To do this without using dynamic variables (eww), it requires that--along with returned the data frame given--the assertr
verbs need to return the warnings so that they can be concatenated with the warnings further in the assertion pipeline. Up until recently, I thought that I would have to implement something that would be tantamount to a Haskell monad in order to do this. Another possibility is to use S4 (hence the S3 strikeouts in the paragraphs above) in order to get proclaim
(or whatever it is) to dispatch on both the second argument, and the first argument. If the first argument is a data.frame
the current semantics stand... if the first argument is something else (some class that holds a data.frame
and a running list of errors) then the wanted behavior can be dispatched.
The big complication is what happens at the end of the chain. There needs to be something that tells assertr
that the chain is ending so that it can take the data.frame
out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?
I'd appreciate any feedback on these ideas for two reasons (a) it's now (or will hopefully be soon) ROpenSci's project not just mine, and (b) I'd like to get the input of some talented developers :)
To review, none of the proposed additions have to break backwards compatibility :) It would just make everything much easier for the user.... the example in the README would go from this:
mtcars %>%
verify(nrow(.) > 10) %>%
verify(mpg > 0) %>%
insist(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs) %>%
assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
insist_rows(maha_dist, within_n_mads(10), everything()) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
to this
mtcars %>%
verify(nrow(.) > 10) %>%
verify(mpg > 0) %>%
assert(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs) %>%
assert(num_row_NAs, within_bounds(0,2), everything()) %>%
assert(maha_dist, within_n_mads(10), everything()) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
(verify
wouldn't be able to be replicated under a assert
S4 generic)
Since I'm the one who is yet to review ... @tonyfischetti do you have recommendations of a good dataset to run through assertr
? I.e. one that you think should show it off but ... there's enough uncertainty that it would be interesting to see how things go? Also to see how a new user manages with it. I have one idea that I'll fall back on if nothing immediately comes to mind.
@jennybc Nothing immediately comes to mind but I'm sure I can dig up one of the examples that inspired me to get into this package in the first place :)
@aammd politely reminded me I have some really ugly data asserting/cleaning code in the private STAT 545 instructors repo, so that's my plan B ๐ฌ.
@jennybc wellll i would not say ugly, but rather "pre-assertr". it represents "how we did this before assertr" and highlights improvements in the UI of this package (improvements I tried to show in my lesson about assertr
)
@tonyfischetti some thoughts on:
The big complication is what happens at the end of the chain. There needs to be something that tells assertr that the chain is ending so that it can take the data.frame out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?
with help of @smbache - we have a way to detect whether a piped command is the last one or not. If it is the last, do X (e.g., execute some other fxn, print data, etc.) instead of passing to the next command. You can see the helper fxns here https://github.com/ropensci/jqr/blob/master/R/pipe_helpers.R and usage here https://github.com/ropensci/jqr/blob/master/R/index.R#L42
@jennybc - hey there, it's been 89 days, please get your review in soon, thanks ๐บ
OK I promise you I will not be able to face you an unconf w/o this being totally done.
@jennybc - hey there, it's been 151 days, please get your review in soon, thanks ๐บ (ropensci-bot)
It must be the temptation of ensurer that is causing delay ๐ Hehe
I am, and have been, halfway done for ages. I have a PR ready for the vignette. It's the incredibly insightful overall comments that need to be written. ๐ณ Will do.
@smbache I didn't know ensurer
was being considered
It's not. I was joking ;)
@jennybc - hey there, it's been 159 days, please get your review in soon, thanks ๐บ (ropensci-bot)
@tonyfischetti approved!
- Add the footer to your README:
[![ropensci\_footer](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)
- Update installation of dev versions to
ropenscilabs/assertr
and any urls for the github repo toropenscilabs
instead oftonyfischetti
- Update any links to the package from
tonyfischetti/assertr
toropenscilabs/assertr
(though even if they aren't github will redirect to the new location :) ) - Go to the Repo Settings --> Transfer Ownership and transfer to
ropenscilabs
- Note that all our newer pkgs go toropenscilabs
first, then when more mature we'll move toropensci
I tried to do the last thing and it says I don't have admin rights to ropenscilabs
:(
you should have received an invitation from ropenscilabs, did you get that email?
Idiotic move on my part not checking the mail :) It's done
sweet!