Adding `assertr` to ROpenSci

Question

Adding `assertr` to ROpenSci

tonyfischetti opened this issue 9 years ago · 36 comments

1. What does this package do? (explain in 50 words or less)
  The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline to protect against common data errors and instances of bad data.
1. Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).

Package: assertr
Type: Package
Title: Assertive Programming for R Analysis Pipelines
Version: 1.0.0
Authors@R: person("Tony", "Fischetti", email="tony.fischetti@gmail.com",
  role = c("aut", "cre"))
Maintainer: Tony Fischetti <tony.fischetti@gmail.com>
Description: Provides functionality to assert conditions
    that have to be met so that errors in data used in
    analysis pipelines can fail quickly. Similar to
    'stopifnot()' but more powerful, friendly, and easier
    for use in pipelines.
URL: https://github.com/tonyfischetti/assertr
BugReports: https://github.com/tonyfischetti/assertr/issues
License: MIT + file LICENSE
LazyData: TRUE
Imports:
    dplyr,
    MASS,
    lazyeval
Suggests:
    knitr,
    testthat,
    magrittr
VignetteBuilder: knitr

sckott commented 8 years ago

sweet!

Answer 1 · 2015-12-24T19:40:33.000Z

I have a use case for this and have already looked through the code, so am happy to review if that is useful (at the same time I can't review until the 4th January at the earliest as I will be travelling over the break).

Answer 2 · 2015-12-24T19:42:50.000Z

Reviewers: @richfitz @jennybc

Answer 3 · 2015-12-24T19:48:48.000Z

I've been meaning to do one, so I can be the second.

Answer 4 · 2015-12-24T19:51:59.000Z

thanks jenny, assigned

Answer 5 · 2015-12-24T19:53:12.000Z

@tonyfischetti Excellent! Thanks for submitting! Looking forward to the reviews and adding this to the suite. 😃

Answer 6 · 2016-01-22T20:30:25.000Z

@richfitz @jennybc - hey there, it's been 29.0 days, please get your review in soon, thanks 😺

Answer 7 · 2016-01-22T21:02:12.000Z

Sorry - I have been meaning to (and also #25). Next week for both I hope.

Answer 8 · 2016-01-22T21:03:23.000Z

cool - (p.s. that comment was your friendly heroku robot https://github.com/ropenscilabs/heythere)

Answer 9 · 2016-01-22T21:06:03.000Z

The descent towards manuscript central begins... 😁

Answer 10 · 2016-01-22T21:07:50.000Z

unless you're volunteering to remind everyone manually

Answer 11 · 2016-01-22T21:10:05.000Z

Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.

Answer 12 · 2016-01-22T21:17:20.000Z

Maybe we should send hand-written notes?

Answer 13 · 2016-01-22T21:17:39.000Z

And yes, duly noted, that I need to bust a move on this.

Answer 14 · 2016-01-22T21:22:34.000Z

Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.

heythere != MS central

Maybe we should send hand-written notes?

Yes!

Answer 15 · 2016-01-25T09:45:41.000Z

General comments

The assertr package provides a generalised framework for defensive programming around data.frames. This sits somewhere between stopifnot and testthat in terms of flexibility and complexity and as such forms a useful building block for data analysis workflow, which I believe is under-tooled at the moment. I really like the idea of having packages that are primarily focussed on data workflows rather than restricting people to ideas that were developed for software engineering (such as formal unit tests).

The package is very tight -- it exports the minimum set of functionality and conforms to the "do one thing and do it well" school of thought. The functions are well documented, the vignette is readable and less dry than most. I appreciate the split into NSE and SE versions of all core functions.

Accordingly, most of my comments focus on design decisions and therefore may all be out of line because the author will have thought about this more than I have.

The main entrypoints are difficult to differentiate

My biggest concern is that I found the three main entry points very difficult to keep straight. And when I put the package down over Christmas I had to re-remember them again.

assert(data, predicate, ...)
verify(data, expr, ...)
insist(data, predicate_generator, ...)

The difference is primarily in the properties of the second argument and I wonder if there's a way of specifying that some other way than three functions that have such similar names? The current approach is extremely elegant but at the cost of being a bit too opaque to the user -- especially because the three function names are essentially synonyms of each other there is nothing to jog your memory. I presume the problem is it is difficult to detect the difference between the three argument types before they are evaluated (and the correct evaluation depends on the type).

The custom handler routine is inflexible

The custom handler is a really great addition, but could be improved. testthat has a similar handler approach that allows storage of a bunch of repeated assertions (pass or fail). My use-case for this package is to replicate something like this, so I'd want to pass the same handler in to all the functions in a pipeline:

mtcars %>%
  verify(nrow(mtcars) > 10, error_fun=my_error_fun) %>%
  verify(mpg > 0, error_fun=my_error_fun) %>%
  insist(within_n_sds(4), mpg, error_fun=my_error_fun) %>%
  assert(in_set(0,1), am, vs, error_fun=my_error_fun) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))

What would be heaps nicer is if I could register a handler; change the assert function to something like:

assertr <- function(data, predicate, ..., error_fun=getOption("assertr.handler", assertr_stop)) {
}

(or eqivalently use the package-environment trick like testthat does). This would allow:

options(assertr.handler=my_error_fun)
mtcars %>%
  verify(nrow(mtcars) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))

(As a related comment, the usage definitions reference the unexported function assertr_stop which some may find confusing. Additionally, is there a reason why verify uses error_fun=stop not assertr_stop?)

Minor comments

The dplyr dependency, which is used soley for dplyr::select_ seems a potentially heavy dependency for one function; if it is straightforward to swap out for independent imlementation that would decrease the package footprint (I can imagine using this in contexts where I do not have dplyr installed such as container-based workflows). But I can totally see the advantage of sticking with something that is known to work.
Think about the non-pipe people. All the examples and the vignette make heavy use of the %>%operator (which is fine), but as someone who uses this little or who imagines using this mostly in packages where I'd be avoiding so much weird evaluation, I would appreciate a few more pipe-less examples. This is potentially confusing in places like:

our.data %>% assert(within_bounds(0,Inf), mpg) # and so on

when ?assert says that usage is

assert(data, predicate, ..., error_fun = assertr_stop)

This confused me because in usage it looks like predicate, data, but of course the data argument comes from the pipe and the mpg is the column name is passed through to the ... argument. While the use with pipes is very elegant, I think the package has use outside of that scope too. The examples within the package are actually really good like this.

Classed exceptions would make the error handling more flexible. Related to the second major point above; it would be nice to distinguish between errors that were because the input to assertr was incorrect, and errors that are raised because the data failed the assertions. R's classed errors provide a nice framework for this. Then tryCatch and withCallingHandlers can dispatch appropriately based on the sort of error.
A reporting framework would be fantastic If you don't write this, I will -- but given you have written this package I figure you should get right of refusal. Related to the point above, I would like to use the underlying bits you have here in a package for automated testing of upstream data sources that tend to be misbehaved. I can imagine a testthat-like reporting framework where a bunch of tests are run and the failures reported.

Answer 16 · 2016-02-20T20:15:06.000Z

Thanks for the kind words and really great feedback @richfitz.

The main entrypoints are difficult to differentiate

As pointed out, this is somewhat a consequence of my making the API as elegant as possible but at the expense of some opaque-ness. The problem is (again, as you pointed out) there's no elegant way that I can see to programmatically detect the argument types. So we have verbs that are literally synonyms (I chose their names using a thesaurus). Unfortunately, I can't think of a better naming mechanism without really long function names like take_a_predicate_generator_and_apply_to_each_column(). I'm open to suggestions, but it may not be the end of the world since there are only three main verbs and the docs are good.

The custom handler routine is inflexible

I like the idea of using a testthat style options mechanism. I'll implement this

Additionally, is there a reason why verify uses error_fun=stop not assertr_stop?)

Nope, that's an error. Good catch!

The dplyr dependency

That was a difficult choice. It's a hell of a heavy dependency, but I was wary of implementing dplyr::select myself. Especially since I would have to reference dplyr's implementation so heavily that it would likely constitute code theft. I'd love to drop the dependency though, if anyone thinks they can implement it without copying code.

A reporting framework would be fantastic

I need this feature, and I have a few great ideas on how to implement it. I'd like to talk to you more (@richfitz) about your particular use case in case my solution only suits my use case. I think this feature can potentially be the most powerful and useful capability of assertr

Answer 17 · 2016-03-12T22:39:37.000Z

So I'm going to run into a little more free time in the near future and I'd like to get back from my learning hiatus back into improving assertr--particularly because people are telling me its really useful for them. Because of the learning hiatus, I have some fresh new ideas for improvement, but I'd like to run them past some of you for further input...

The main entrypoints are difficult to differentiate

As mentioned before, there is assert, insist, insist_rows, and assert_rows for representing a wide range of tasks. However, if I wanted to add the ability to, say, declare that the whole data set should have no more than 15% missing values (and I do want to add that), the semantics of that would require another specialized function... and I'm running out of synonyms for "assert"!

Principled though that solution was, I'm not sure its the correct thing to do going forward; it's always been one of R's strengths from a user's point of view to use a familiar generic function (mean, plot, etc...) with all sorts of input and have the object system dynamically dispatch the correct functionality.

So how about this... creating an S3S4 generic function (proclaim perhaps?) that will handle all of the different semantics for the user. Concretely, the function returned from within_n_mads can be labeled with class assertr_dynamic (because the predicate is dynamically generated). Then, proclaim would dispatch on the second argument (the first arg is the data frame) and call what is currently referred to as insist. In the same way, maha_dist would be classed something like assertr_dynamic_rows and a user calling proclaim(df, maha_dist, within_n_mads(10), ...) would transparently dispatch insist_rows(maha_dist, within_n_mads(10), ...)

This would improve assertr's extensibility greatly; for example, adding semantics to check the supplied data.frame as a whole would only require writing another S3S4 method of the proclaim generic and making sure that the predicate function was correctly classed.

This solution is perhaps a little unconventional, but it completely obviates the requirement that a useR remember all the verbs.

The custom handler routine is inflexible

The most common suggestion for assertr is to be able to warn (not error) on violation. Additionally, even if it does error on violation, there should be semantics in place for the entire chain of assertions to run so that the final error message will contain the complete report of the data errors.

The reason I needed to take my time with this one is that I need the warnings to be concatenated through a assertr chain in a principled manner. To do this without using dynamic variables (eww), it requires that--along with returned the data frame given--the assertr verbs need to return the warnings so that they can be concatenated with the warnings further in the assertion pipeline. Up until recently, I thought that I would have to implement something that would be tantamount to a Haskell monad in order to do this. Another possibility is to use S4 (hence the S3 strikeouts in the paragraphs above) in order to get proclaim (or whatever it is) to dispatch on both the second argument, and the first argument. If the first argument is a data.frame the current semantics stand... if the first argument is something else (some class that holds a data.frame and a running list of errors) then the wanted behavior can be dispatched.

The big complication is what happens at the end of the chain. There needs to be something that tells assertr that the chain is ending so that it can take the data.frame out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?

I'd appreciate any feedback on these ideas for two reasons (a) it's now (or will hopefully be soon) ROpenSci's project not just mine, and (b) I'd like to get the input of some talented developers :)

Answer 18 · 2016-03-12T23:09:21.000Z

To review, none of the proposed additions have to break backwards compatibility :) It would just make everything much easier for the user.... the example in the README would go from this:

  mtcars %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      insist(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
      insist_rows(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

to this

  mtcars %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      assert(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert(num_row_NAs, within_bounds(0,2), everything()) %>%
      assert(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

(verify wouldn't be able to be replicated under a assert S4 generic)

Answer 19 · 2016-03-12T23:12:12.000Z

Since I'm the one who is yet to review ... @tonyfischetti do you have recommendations of a good dataset to run through assertr? I.e. one that you think should show it off but ... there's enough uncertainty that it would be interesting to see how things go? Also to see how a new user manages with it. I have one idea that I'll fall back on if nothing immediately comes to mind.

Answer 20 · 2016-03-12T23:17:00.000Z

@jennybc Nothing immediately comes to mind but I'm sure I can dig up one of the examples that inspired me to get into this package in the first place :)

Answer 21 · 2016-03-12T23:41:55.000Z

@aammd politely reminded me I have some really ugly data asserting/cleaning code in the private STAT 545 instructors repo, so that's my plan B 😬.

Answer 22 · 2016-03-13T00:20:03.000Z

@jennybc wellll i would not say ugly, but rather "pre-assertr". it represents "how we did this before assertr" and highlights improvements in the UI of this package (improvements I tried to show in my lesson about assertr)

Answer 23 · 2016-03-14T23:19:34.000Z

@tonyfischetti some thoughts on:

The big complication is what happens at the end of the chain. There needs to be something that tells assertr that the chain is ending so that it can take the data.frame out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?

with help of @smbache - we have a way to detect whether a piped command is the last one or not. If it is the last, do X (e.g., execute some other fxn, print data, etc.) instead of passing to the next command. You can see the helper fxns here https://github.com/ropensci/jqr/blob/master/R/pipe_helpers.R and usage here https://github.com/ropensci/jqr/blob/master/R/index.R#L42

Answer 24 · 2016-03-22T00:30:22.000Z

@jennybc - hey there, it's been 89 days, please get your review in soon, thanks 😺

Answer 25 · 2016-03-22T04:22:56.000Z

OK I promise you I will not be able to face you an unconf w/o this being totally done.

Answer 26 · 2016-05-23T14:43:38.000Z

@jennybc - hey there, it's been 151 days, please get your review in soon, thanks 😺 (ropensci-bot)

Answer 27 · 2016-05-23T15:14:59.000Z

It must be the temptation of ensurer that is causing delay 😆 Hehe

Answer 28 · 2016-05-23T15:33:08.000Z

I am, and have been, halfway done for ages. I have a PR ready for the vignette. It's the incredibly insightful overall comments that need to be written. 😳 Will do.

Answer 29 · 2016-05-25T14:13:57.000Z

@smbache I didn't know ensurer was being considered

Answer 30 · 2016-05-25T14:14:36.000Z

It's not. I was joking ;)

Answer 31 · 2016-05-31T00:31:05.000Z

@jennybc - hey there, it's been 159 days, please get your review in soon, thanks 😺 (ropensci-bot)

Answer 32 · 2016-06-08T16:50:43.000Z

@tonyfischetti approved!

Add the footer to your README:

[![ropensci\_footer](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)

Update installation of dev versions to ropenscilabs/assertr and any urls for the github repo to ropenscilabs instead of tonyfischetti
Update any links to the package from tonyfischetti/assertr to ropenscilabs/assertr (though even if they aren't github will redirect to the new location :) )
Go to the Repo Settings --> Transfer Ownership and transfer to ropenscilabs - Note that all our newer pkgs go to ropenscilabs first, then when more mature we'll move to ropensci

Answer 33 · 2016-06-14T21:14:47.000Z

I tried to do the last thing and it says I don't have admin rights to ropenscilabs :(

Answer 34 · 2016-06-14T21:17:03.000Z

you should have received an invitation from ropenscilabs, did you get that email?

Answer 35 · 2016-06-14T21:21:07.000Z

Idiotic move on my part not checking the mail :) It's done