ropensci/software-review

Skimr

Closed this issue Β· 63 comments

elinw commented

Summary

  • What does this package do? (explain in 50 words or less):
    Creates compact and flexible data summaries that are pipeable and display nicely in the console. For a given data frame or vector the skim function provides a useful set of summary statistics (based on the class of the individual vector/column) that allows users to skim their data to get an overall sense of what is included, extent of missing values, and similar information. The skim generic can be extended by users to other data structures.

  • Paste the full DESCRIPTION file inside a code block below:

Package: skimr
Title: Compact and Flexible Summaries of Data 
Version: 0.9.92
Authors@R: c(
  person("Amelia", "McNamara", email="amcnamara@smith.edu", role = "aut"),
  person("Eduardo", "Arino de la Rubia", email="earino@gmail.com", role = "aut"),
  person("Hao", "Zhu", email="haozhu233@gmail.com", role = "aut"),
  person("Julia", "Lowndes", email="lowndes@nceas.ucsb.edu", role = 'ctb'),
  person("Shannon", "Ellis", email="sellis18@jhmi.edu", role = "aut"),
  person("Elin", "Waring", email="elin.waring@gmail.com", role = "cre"),
  person("Michael", "Quinn", email="msquinn2@illinois.edu", role = "aut"),
  person("Hope", "McLeod", email="hmgit2@gmail.com", role = 'ctb'),
  person("Hadley", "Wickham", email="hadley@rstudio.com", role = 'ctb'),
  person("Connor", "Kirkpatrick", email="hello@connorkirkpatrick.com", role = 'ctb')
  )
Maintainer: Elin Waring <elin.waring@gmail.com>
Description: A simple to use summary function that can be used with pipes
    and displays nicely in the console. The default summary statistics may be 
    modified by the user as can the default formatting. Support for data frames 
    and vectors is included, and users can implement their own skim methods for
    specific object types as described in a vignette. Default summaries include
    support for inline spark graphs. Instructions for managing these on 
    specific operating systems are given in the Using skimr vignette and the 
    README.
Depends:
    R (>= 3.1.2)
Imports:
    dplyr (>= 0.7),
    magrittr,
    pander,
    purrr,
    rlang,
    stats,
    stringr,
    knitr,
    tibble (>= 0.6),
    tidyr (>= 0.7),
    tidyselect
Suggests:
    extrafont,
    rmarkdown,
    testthat,
    withr
License: GPL-3 + file LICENSE
Encoding: UTF-8
LazyData: true
URL: https://github.com/ropenscilabs/skimr
BugReports: https://github.com/ropenscilabs/skimr/issues
VignetteBuilder: knitr
RoxygenNote: 6.0.1
Collate: 
    'skimr-package.R'
    'formats.R'
    'stats.R'
    'functions.R'
    'options.R'
    'skim.R'
    'skim_print.R'
    'skim_v.R'
    'summary.R'

  • URL for the package (the development repository, not a stylized html page):
    https://github.com/ropenscilabs/skimr

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

"tools" Because skimr provides users with a way to get started with a new, unknown data set (by getting a quick overview (or skim) of the data that is readable and compact. It also serves as a good tool for reporting summary information about data.

  •   Who is the target audience and what are scientific applications of this package?  
  • People who want to get a quick overview of a data set while initially reviewing, iteratively cleaning, or sharing with others.
  • Beginning statistics students who need simple access to a set of basic descriptive statistics.

This is intended as an improved version of the r core summary functions available. Like summary, skim is a generic that users can extend to any R data object. It is designed to be somewhat like a more flexible version of fivenums() from stats or favstats() from the Mosaic package.

  •   If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
    • The package is novel and will be of interest to the broad readership of the journal.
    • The manuscript describing the package is no longer than 3000 words.
    • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal.
    • (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

πŸ‘‹ @elinw
Thanks very much for this submission. I will assign an editor shortly and we will then work through the editor checklist before we assign reviewers. We normally strive to get the process started within 5 business days. But we are approaching winter break (I'll be offline all of next week). So in the chance that we are not able to assign an editor/reviewers in the next two days, you'll see an update to this thread in the first week of January.

@elinw thanks for your submission & sorry about the delay! I've seen development is still active, do you consider skimr stable enough for review? It'd be best not to change too many things while the reviewers are looking.

The tests now pass on my laptop using the latest testthat version, thanks @jimhester for the help!

He suggested to add testthat (>= 2.0.0) in DESCRIPTION.

elinw commented

Yes, as of now we have fixed the Windows issue in master. I'm about to push one more change and then I'm going to tag 1.0.1 sometime today. I'll definitely update the testthat minimum first.

And you can confirm that 1.0.1 will be good for review? I even have two reviewers lined up once you confirm that. πŸ˜„

elinw commented

Yes it will be.

Editor checks:

  • [x ] Fit: The package meets criteria for fit and overlap
  • [ x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • [ x] License: The package has a CRAN or OSI accepted license
  • [ x] Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks again for your submission, if I understand correctly the version for review will be ready today?

  • Add testthat minimal version in DESCRIPTION.

  • You don't need to fill the Maintainer field in DESCRIPTION, it'll get filled from Authors@R

goodpractice output for you and the reviewers to have in mind

It is good practice to

  βœ– write short and simple functions. These functions have high cyclomatic
    complexity:kable.data.frame (60).
  βœ– avoid long code lines, it is bad for readability. Also, many people prefer editor
    windows that are about 80 characters wide. Try make your lines shorter than 80
    characters

    R\skim.R:14:1
    R\skim.R:73:1
    R\stats.R:28:1
    R\stats.R:52:1
    R\stats.R:64:1
    ... and 20 more lines

Two typos identified via devtools::spell_check()

differnt      skimr-package.Rd:16
faciliates    skim_to_wide.Rd:19

Reviewers: @jimhester @jenniferthompson
Due date: 2018-01-30

@jimhester and @jenniferthompson thanks for accepting to review this package! Your reviews are due on the 2018-01-30. 😺

You'll find here the review template and our reviewing guide.

elinw commented

Thanks for the quick checks. I have done the easy ones, but there is one issue that we can't fix and two I'd like feedback on.

  1. The complex function one is a function that is brought in from knitr so we can't do anything about that. So in our code it is just
    kable.data.frame <- knitr::kable

  2. In terms of long lines, the ones that remain are in tests of the print function so they are the full printed line. If we don't do that we get the tibble wrap which defeats the idea of testing the whole line. On the other hand if we want we could split those strings into two parts and then paste0() them which would keep the lines under 80 but still give us a correct version that is the full line. Does it make sense to do that? Or does that actually make the tests themselves harder to read?

  3. We have some things that are only run for windows_os and those are the lines showing up without code coverage in good practices. We've excluded them from testing on Travis since the code will never be hit on Linux or Mac. Is it possible to write conditional exclusions so that the tests would run when built on Windows?

You can run covr on appveyor as well, and the coverage reports will be automatically merged by codecov.

elinw commented

Cool, just add the same line to appveyor.yml?
- Rscript -e 'covr::codecov()'

Appveyor needs to use double quotes and you can add it to on_success:

on_success:
  - Rscript -e "covr::codecov()"

And I think you may need to add your codecov token to the appveyor settings. You can get the token at https://codecov.io/gh/ropenscilabs/skimr/settings, and add it as CODECOV_TOKEN at https://ci.appveyor.com/project/ropenscilabs/skimr/settings/environment

Thanks Jim! It seems to be working after ropensci/skimr@c34cec7

elinw commented

Release 1.0.1 should be in CRAN at this point. @michaelquinn32 is doing some refactoring that should be invisible to end users, and we have plans for updating some tests and documentation (and some feature requests) but other than that things are stable.

@jimhester and @jenniferthompson friendly reminder that your reviews are due on Tuesday next week 😸

@michaelquinn32 @elinw FYI, I'm getting a "Project not found or access denied" error when I try to look at the Appveyor page via clicking the README badge. I'm not experienced in CI, so I'm not sure if that should have been addressed with the fix Jim mentioned a couple of weeks ago.

Also, if I've caught some typos in the vignettes/examples, would y'all like a PR or no?

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Review Comments

All review comments are based off of version 1.0.1. Overall I feel like the
package would be a welcome addition to rOpenSci after the review process is complete.

README

  • Get rid of ## Dev mode: on at the top.
  • devel installation instructions better written as
    devtools::install_github("ropenscilabs/skimr@develop") (not sure it is
    necessary to mention the develop branch at all here)
  • A markdown file should not have a yaml header and this file should really
    be built from a README.Rmd included in the repository.
    usethis::use_readme_rmd() is a good way to set this up.
  • Should have links to kable() and pander(), and possibly include
    examples of what 'enhanced' options are available and what they look like.

skim

Displaying the results of skim() as seemingly a table, but having the
underlying data as a table in a different format is quite surprising. I pretty
strongly think that you should have two functions, one which returns the
summary information in a wide format, the other in the long format. e.g. From
the print output I would expect to be able to run the following to retrieve
variables in the dataset with standard deviations greater than 1

skim(iris, Sepal.Length, Petal.Length) %>%
  filter(sd > 1) %>%
  pull(variable)

but this instead throws a pretty confusing error

Error in filter_impl(.data, quo) :
  Evaluation error: comparison (6) is possible only for atomic and list types.

Because the data is actually in long format.

skim(iris, Sepal.Length, Petal.Length) %>%
  filter(stat == "sd", value > 1) %>%
  pull(variable)

Gives you the expected output.

As the print method hides this dichotomy I think this will cause users
confusion which would be best remedied by having two different functions. A
simple solution would be to return the skim_to_wide() representation from
skim(). Or maybe instead have a series of functions

  • skim_wide()
  • skim_long()
  • skim_list()

Regardless of the exact API I think you should think hard about making it more
obvious the results of skim() are not simple rectangular tables.

I realize this is somewhat challenging because you have separate output for
different column types. Another possible solution which would require minimal
API changes is to modify the print method to make the heterogeneous parts more
clear.

skim_with

Designing skim_with() as a impure function which modifies global state seems
an odd choice when it could instead be written as a functional which returned
skim() functions with alternate defaults. Beside the impurity concerns it
also makes it more difficult to use different types of skim functions
interchangeably. If you really want the impure interface you could do so by
defining skim_with_defaults() with the same interface as skim_with() and an
implementation that simply calls the pure skim_with() function and assigns
the result as the new default.

Vignettes

Having the 'Using Fonts' vignette be a rmarkdown template is kind of strange,
also I think it might be worth listing a few fonts which do support all the
features needed for the sparklines / historgrams would be useful.

Miscellanea

get_funs() has a very weird implementation, particularly because it's behavior seems equivalent to double square bracket. e.g.
options$functions[["numeric"]] gives equivalent output to get_func("numeric"), likewise for get_func("xyz") (which is NULL).

The code inherited from
pillar

should have a comment nearby indicating such, so that you can more easily track
upstream changes if the pillar implementation changes in the future.

I really think you should reconsider changing min and max to p0 and
p100. min() and max() are self explanitory, but I would need to read the
quantile documentation to be sure that p0 is guaranteed to be the minimum
value in the distribution not subject to rounding error. I would actually
prefer dealing with -Inf and Inf in the all NA case to using
quantile(probs = 0).

Thanks for your review @jimhester! 😸

skimr Package Review

Jennifer Thompson

This review is of skimr version 1.0.2, downloaded from Github 2018-01-29. Other relevant info: R version is 3.4.3; devtools version is 1.13.4.

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work.

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally Yes, but see below
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5


Review Comments

skimr is a very, very useful package that is generally well thought in terms of both simplicity and extensibility. I particularly appreciate a thorough vignette for the main functions, intuitive function names, and the attention to where many users are likely to run into trouble, along with the proactive explanation of potential solutions. skimr does a specific set of tasks and does them simply and well. The source code is easy to digest in both style and function, and the comments are abundant, clear and helpful.

I do agree with Jim that it's confusing to have the printed output be a completely different format from what is output to the console; I'm most excited about (and expect) the wide format printed, vs the long data.frame returned. Perhaps have the default skim() return a named list, with an element for each data type containing the "wide" format as a data.frame, and have a skim_long that returns the current long data frame.

Some specific suggestions:

  • README:
    • Clarify first bullet - "added missing, complete, n, sd" to what?
    • Suggested edit for brevity to statement of need: "skimr provides a frictionless approach to summary statistics which conforms to the principle of 'least surprise' [link to description of this principle], displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skimr object which can be included in a pipeline or displayed nicely for the human reader."
    • Remove YAML output at the top.
  • Really helpful vignette for skim() - great work showing common use cases and some really helpful functionality. Made it easy to add my own summary statistics. I did find a few typos and can include these in a PR if desired.
  • Examples in help files:
    • I'm not sure skim_format(.levels = list(nchar = 4)) is working. I changed the value of this argument several times, and skim(iris) looked the same every time, both printed and as a data.frame. Could be user error but I couldn't figure out how to make it affect the output.
    • ?skim: group_by example doesn't actually group by anything (works when I group by Species)
    • I'm not clear on the benefit of having the fonts vignette as an RMarkdown template. Also, it's not showing up as an option for me in RStudio (I only have HTML vignette and Github document, even with skimr loaded).
    • Would be helpful to show an example of skim_tee() - I'm not totally clear on when it's beneficial.
  • The inline histograms are really helpful for truly continuous variables. Is it possible to make them smarter/more flexible? When I skim a numeric discrete variable with 5 levels (skim(sample(c(1:5, NA), size = 20, replace = TRUE))), I get a histogram with what appears to be 8 bars.
  • Community guidelines: Provide links to contributing.md, Google style guide and CONDUCT.md.
  • The error message when a variable has infinite values (produced via dplyr::cut, I think) could be clearer - could be a good warning message to include up front.
  • I'd prefer p50 to median, particularly if you stick with p0 and p100. (In terms of workflow, I do prefer NAs to -Inf/Inf, though :) )

Overall I'm excited to use this package in my own work, and really appreciate the hard and thoughtful work of all the authors.

Thanks a lot for your review @jenniferthompson! πŸ˜ƒ

@elinw now both reviews are in!

@elinw I had forgotten to ask you to add the rOpenSci review badge to the README! πŸ™ˆ

[![](https://badges.ropensci.org/175_status.svg)](https://github.com/ropensci/onboarding/issues/175)

elinw commented

@jimhester @jenniferthompson @maelle Thank you so much for the reviews! They are very helpful and for the most part reflect issues @michaelquinn32 have been discussing often with users in the tracker (and Jennifer we'd be thrilled with a PR for typos).

A couple of quick thoughts. @jimhester The fonts thing is tricky because I want to vignette to actually use the fonts and also didn't want to suggest extrafont not put it in the requires list. Any thoughts about how to accomplish that? Also we don't inherit from pillar any more. Hadley told us to just put the code into our codebase since they are rapidly changing APIs and some parts are now internal. Hence Hadley and RStudio are now in the contributors list. I think though that it you're right that it's probably worth adding comments for when and if there are problems. Just thinking that we may also want to send them a note about how we addressed the UTF8 issue.

The NA versus Inf/p0,p50,p100 versus min, median, max issues I think is really interesting. All of us have opinions on "what the user expects" but actually we have no idea, really, what "typical" users expect and it is clear that there are very large portions that expect each. I did actually read the help on all of these so I know that p0 and p100 are the lowest and highest values, it's just that they do not enforce transitivity the way min and max do. Another option is to write a more complex function but that in a way makes the problem of being consistent with the core functions worse.

Lots to think about! Thanks!

elinw commented

Oops @jenniferthompson also mentioned the template (and had the same issue as someone in the issue tracker) so I'm trying to hunt that down. It is usually something goofy when they don't show up.

I think that with the histograms we need to come up with a more structured and probably less elegant approach when there are 2-9 separate values. We were more focused on the too many case than the too few.

I'm not sure what is going on with the help index, on my rstudioserver install from CRAN I have
screen shot 2018-01-31 at 12 17 58 pm

screen shot 2018-01-31 at 12 18 08 pm

But definitely don't have that when I build locally.

elinw commented

Some of the items listed are now in master (basic fixes to the readme, make the template work, add a bit to the fonts vignette, fixed the wrong code in the format documentation). @jenniferthompson About Inf values, you're talking about when the data being skimmed has Inf as opposed to when the skim produces it? That's definitely something we need to look at and handle. Probably overall edge case testing needs to b examined. I have a refactoring of tests project I'm working on so I'll include the cases I can think of.

About Inf values, you're talking about when the data being skimmed has Inf

@elinw Yes, exactly! I tried it on one of my datasets; my variable really shouldn't have had Inf as a value, so it was helpful in that it made me aware of it once I figured out what variable was causing the error. But I was thinking that maybe even more helpful would be either a check up front for Inf values with a more specific error message, or a workaround in the code to avoid the error but an informative warning message.

elinw commented

Thanks for the reviews everyone! I've tried to reduce it to a set of issues that we can start tracking. Please let me know if I missed anything:

  1. Fix readme. The bullet points are clear.
    • Get rid of ## Dev mode: on at the top.
    • devel installation instructions better written as
      devtools::install_github("ropenscilabs/skimr@develop") (not sure it is
      necessary to mention the develop branch at all here)
    • A markdown file should not have a yaml header and this file should really
      be built from a README.Rmd included in the repository.
      usethis::use_readme_rmd() is a good way to set this up.
    • Should have links to kable() and pander(), and possibly include
      examples of what 'enhanced' options are available and what they look like.
  • Clarify first bullet - "added missing, complete, n, sd" to what?
  • Suggested edit for brevity to statement of need: "skimr provides a frictionless approach to summary statistics which conforms to the principle of 'least surprise' [link to description of this principle], displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skimr object which can be included in a pipeline or displayed nicely for the human reader."
  1. Coalesce skim api around skim_long (aliased skim), skim_wide and skim list. The latter two should be support downstream computation, and not just the rendered values. This is going to be a bit tricking for stats with multiple levels, since we normally collapse them in the rendered output.

skim_long is considered the default because it is the most terse representation of relatively heterogeneous data. For any df with more than one data type, skim_wide will be very sparse. Jim's example is going to give one or more rows with lots of NA values where statistics weren't computed.

Skimming to a list is nice for some work but it doesn't play nice with magrittr/dplyr pipes. The other alternative, using nested data frames as a list within a single skimmed data frame, isn't a great idea until dplyr better supports accessing nested data.

x <- data.frame(id = 1:3, listcol = I(replicate(3, data.frame(a = 1, b =2), simplify = FALSE)))
dplyr::select(x, listcol$a)
# Throws an error
  1. Support something in the print that talks about the returned data object being long, wide or a list.
  2. (short-term) Rewrite skim_with_defaults() to use the same interface as skim_with(). Same for formatting functions.
  3. (long-term) Rewrite skim_with to support a functional implementation that does not depend on a global state.
  4. Better identify code from pillar. Link to it.
  5. Update get_funs documentation
  6. Fix group_by in skim docs (done). Thanks for pointing that out!
  7. Add a skim_tee example
  8. Support inline histograms for vectors with less than the default number of characters.
  9. Link to contributing.md, tidyverse style guide (replacing Google Style guide) and CONDUCT.md.
  10. Rename median to p50, compute using quantile for consistency.

-- Push back

  1. get_funs is vectorized, to work with S3 inheritance, which is just a vector of different classes. TBH this is all a little hacky, since we're basically reimplementing S3, but I would love a better idea that allows for dynamic registration of types.

The case that you gave above didn't work for us when we ran into ordered factors.

y <- ordered(letters)
class(y)
#> [1] "ordered" "factor"
z <- list(factor = "blue")
z[[class(y)]]
#> Error in z[[class(y)]] : no such index at level 1
  1. The default value of min for summary is also P0. It uses quantile to produce a five-number summary. I think it's appropriate to go this route as long as the returned statistics are consistently named, which they are. https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/base/R/summary.R#L42
  2. Vignettes, +1 to what Elin already said.
  3. For formatting the acceptable values for the .levels options are max_char = 3 and max_levels = 4. All apologies if the documentation suggested otherwise.
elinw commented

We've fixed a bunch of these issues,. @michaelquinn32 about the levels issue, the way it can work is really what you get when you go wide (spread) from the long, because the levels are already separate rows in the long form. I'm just wondering if it makes sense to think about why people want to do calculations, if indeed that is what they want to do. I think some people want the formatted because they are working on display, while other people want the numeric because they want to do things like filter based on sd or having non missing cases etc. Before we start implementing something we should think about what the ultimate goal is. That's why the question of doing selective skimming for a particular type is important too, because what does it mean when you have a mix of types? By the way the same whole list of concerns applies to many summary.x() methods so its good to try to think this through especially about what is not satisfying in summary() and its associated print methods. Just as an example, the print() for summary(lm(y~x)) really does not reflect what is in the summary object. We could even consider a complex object like that.

@elinw @michaelquinn32 thanks both for your work! 😸

Please create related issues in skimr repo and report back here your answers to reviewers and the specific questions you may have for them once you're ready. πŸ˜‰

elinw commented

I have a question for @jimhester, I did try moving the Rmd out of the docs folder to the top level (it's that way in develop right now) but I really feel it makes the top level a bit messy there since it doesn't really serve any purpose for github users to have it there (unlike all the rd files).

elinw commented

Okay I'm starting a list of issues from onboarding comments. I'll keep adding them here until I get a full list.

Having the Rmd in the top level is easier for contributors, as that is the standard convention for where to put it in R projects. Because you need to edit the *.Rmd file rather than the .md having both in the same directory makes this more obvious. For instance I had no idea the Rmd was even in version control originally in skimr. If you do want to include the Rmd in a different directory I would encourage you to have a comment in the md file mentioning this location, e.g. https://github.com/tidyverse/glue/blame/master/README.md#L2

elinw commented

An update.

As of today I think the easy 3 (that wouldn't involve API changes) have all been addressed. @michaelquinn32 has been working on a discussion paper about the API issues so that we can have a clear plan moving toward skimr 2.0 and our user community can chime in. That should be ready to share soon.

Thanks @elinw and @michaelquinn32 for the update and your work!

elinw commented

@jimhester In re-reading this morning I noticed this commen "Another possible solution which would require minimal API changes is to modify the print method to make the heterogeneous parts more
clear." That's something that we should be able to do with markup I think as an interim while we are deciding what to do with the larger API issue.
@jenniferthompson I still also need to look at the fewer than 8 bins issue.

elinw commented

Okay so we released 1.0.2 with all the bug and documentation fixes so far. I'm going to look at this possibility of using markup to make the display more clear to people.

Thanks for the update @elinw and congrats on the release! 🎊 I guess you'll soon write the "final" answer to reviewers then?

elinw commented

@maelle That's the plan. I wanted to add that we now have in the develop branch a solution to the other issue that I think folks here and in our tracker have wanted. That is now you can define your own complete list of functions with a name and use skim_with() to append or replace the default skimmers. So the way I think about this the use case would be that you have several different lists that you use for different situations e.g. my_funs1, my_funs2, my_funs3. Then you still have to do two steps but it is e.g.

skim_with(.list = my_funs1, append = FALSE)
skim(x)
skim_with(.list = my_funs2, append = FALSE)
skim(y)
skim_with(!!!my_funs3, append=FALSE)
skim_with_defaults()

I think if this were part of your usual workflow you might write your own functions so you don't need to do that append=FALSE and .list=.

There's also a PR making it so that if you define an empty list of functions for a type that type is removed from the analysis.

Cool work! 😸 Thanks for the update.

@elinw Any update? 😺

@elinw @michaelquinn32 any update? ☺

elinw commented

Ok thanks. Don't hesitate to notify the reviewers in person πŸ˜‰

elinw commented

Okay here's an update @jimhester @jenniferthompson @michaelquinn32
As of now all changes have been merged into the master branch, and I'm planning to do a release of 1.0.3 because I got a message from Brian Ripley about a typo in a test (I can't believe that all the eyes that looked at this before missed it).
I think we have responded to all of the issues except the possibility of changing the API. Michael has written a road map for skimr 2.0 that addresses that along with some architectural issues. But I think over the course of the discussion here and in the skimr repository we all came to agreement that cosmetic changes that make the fact that the skim print results do not represent a single data frame more clear would be okay for skimr 1.

Changes for 2.0 are in this doc.
https://docs.google.com/document/d/18lBStDZzd1rJq08O-4Sw2qHhuHEZ79QX4sBkeyzWNFY/edit?usp=sharing

I'm writing tests for the next version right now.

Thanks for the updates @elinw and @michaelquinn32, very cool to see such active development!

@jimhester and @jenniferthompson , are you happy with the current version of skimr?

LGTM

Very likely! I'm traveling and won't have time to look at the latest version till next week, but will do asap.

Thanks @jimhester!

@jenniferthompson, thank you, we'll wait for your input next week, happy travels!

Sorry for the delay, y'all (jet lag is very real) - just looked at the updated version and it looks super! πŸŽ‰ Awesome work, @elinw @michaelquinn32.

Going to check out the version 2.0 roadmap now πŸƒβ€β™€οΈ

I really appreciate all of the hard work everyone has put in. skimr has improved in so many ways thanks to this process. Thanks!

Approved! 🎊 Thanks a ton @jimhester and @jenniferthompson for your reviews and @elinw and @michaelquinn32 for your development work!

Now here is the list of things you have to do before I close this issue πŸ˜‰

  • Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. I'll make you both admins once you do.
  • Fix any links in badges for CI and coverage to point to the ropensci URL. We don't transfer Appveyor projects from personal accounts any more but since the project is under ropenscilabs we'll transfer it to ropensci account, once you've transferred the repo I'll activate the project.
  • add rOpenSci footer to README
    [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
  • We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

@elinw @michaelquinn32 πŸ‘‹ could you soon tackle the list above? Thank you! πŸ˜ƒ

elinw commented
elinw commented
  • Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. I'll make you both admins once you do.
  • Fix any links in badges for CI and coverage to point to the ropensci URL. We don't transfer Appveyor projects from personal accounts any more but since the project is under ropenscilabs we'll transfer it to ropensci account, once you've transferred the repo I'll activate the project.
    add rOpenSci footer to README
  • ropensci_footer
  • We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

About Appveyor that currently doesn't have a badge because of some problem we had before but we do run it off of @michaelquinn32's account. So what to do? We're happy to start from scratch on it.

You can keep the Appveyor project under @michaelquinn32's account if you prefer!

elinw commented

I think it makes sense to put it in ropensci.

Closing this, we can sort out CI details elsewhere and this way the peer-review badge will become green. πŸ˜‰

@elinw Tired of writing about skimr yet? Or might you be interested in writing a post for the rOpenSci blog?

elinw commented

@jimhester @jenniferthompson @maelle I think we are finally close to release for v2 which addresses most of the issues we discussed, specifically the format of the skim object. We still need to rewrite vignettes and the readme. We'll be adding a pkgdown site although we're not making that as a blocker for the release. I think there is one important issue to resolve. (We'll probably do one more v1 release since we have fixed a few issues there in the past month.) @michaelquinn32 anything to add?

Once you have the vignette I'd recommend making the pkgdown website right away because it'll be little effort and will allow you to put the link in DESCRIPTION before you submit the new version to CRAN πŸ˜‰