submission: piggyback

Question

submission: piggyback

cboettig opened this issue 6 years ago · 48 comments

Summary

Allow large and binary data files to "piggyback" on top of your existing repositories. push and pull large-ish (< 2GB) data files to & from GitHub repositories as attachments to a GitHub release;
Paste the full DESCRIPTION file inside a code block below:

Package: piggyback
Version: 0.0.0.9000
Title: Managing Larger Data on a GitHub Repository
Description: Because larger (> 50 MB) data files cannot easily be committed to git,
  a different approach is required to manage data associated with an analysis in a
  GitHub repository.  This package provides a simple work-around by allowing larger
  (up to 2 GB) data files to piggyback on a repository as assets attached to individual
  GitHub releases.  These files are not handled by git in any way, but instead are
  uploaded, downloaded, or edited directly by calls through the GitHub API. These
  data files can be versioned manually by creating different releases.  This approach
  works equally well with public or private repositories.  Data can be uploaded
  and downloaded programmatically from scripts. No authentication is required to
  download data from public repositories.
Authors@R: person("Carl", "Boettiger",
                  email = "cboettig@gmail.com",
                  role = c("aut", "cre", "cph"),
                  comment=c(ORCID = "0000-0002-1642-628X"))
URL: https://github.com/cboettig/piggyback
BugReports: https://github.com/cboettig/piggyback/issues
License: GPL-3
Encoding: UTF-8
LazyData: true
ByteCompile: true
Imports:
    gh,
    httr,
    jsonlite,
    git2r,
    fs,
    usethis,
    crayon,
    clisymbols
Suggests:
    readr,
    covr,
    testthat,
    datasets,
    knitr,
    rmarkdown
VignetteBuilder: knitr
RoxygenNote: 6.0.1.9000
Roxygen: list(markdown = TRUE)

URL for the package (the development repository, not a stylized html page):

https://github.com/cboettig/piggyback

Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

reproducibility, because accessing data being analyzed is essential for reproducible workflows, and yet we have no good solution for workflows with unpublished data or private workflows to do this once the data is too large for version control (e.g. files > 50 mb).

Who is the target audience and what are scientific applications of this package?

The target audience is anyone working with data files on GitHub.

Are there other R packages that accomplish the same thing? If so, how does
yours differ or meet our criteria for best-in-category?

datastorr on ropenscilabs is the closest match, which takes a very different approach (from the user perspective -- on the back end both store data on GitHub assets) to the essentially the same problem. The Intro vignette discusses at greater length many of the alternative possible strategies and why I feel they have all fallen short of my needs and led to me creating this package.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

Do you intend for this package to go on CRAN?
Do you wish to automatically submit to the Journal of Open Source Software? If so:
- The package has an obvious research application according to JOSS's definition.
- The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

No errors, notes, or warnings.

Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

Rich FitzJohn, @richfitz, would be great based on his experience in this area and with datastorr. Jenny Bryan, @jennybc, since this package makes heavy use of usethis and GitHub interactions.

Answer 1 · 2018-06-11T20:38:52.000Z

Editor checks:

Fit: The package meets criteria for fit and overlap
Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
License: The package has a CRAN or OSI accepted license
Repository: The repository link resolves correctly
Archive (JOSS only, may be post-review): The repository DOI resolves correctly
Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Hi @cboettig 👋, thanks for submission and great to be working with another of your packages. I believe I remember you talking about this problem. A nice solution to it!

Anyways, see below the result of my initial checks. I'm sure it's nothing that can't be fixed so I'm going to go ahead and approach reviewers. I'll take your suggestions onboard.

One small suggestion on the README, for storing GITHUB_PAT system variable, using edit_r_environ() and storing it there is also a nice suggestion.

checks

OK apart from 1 failling test.

tests

One test failing for me. Here's the relevant output: Any ideas?

Testing piggyback
✔ | OK F W S | Context
✖ |  2 1   1 | Requiring Authentication [1.5 s]
──────────────────────────────────────────────────────────────────────────
test-with-auth.R:10: error: We can upload data
GitHub API error (404): 404 Not Found
  Not Found
1: pb_upload(repo = "cboettig/piggyback", file = "iris.tsv.gz", tag = "v0.0.1",
       overwrite = TRUE) at /Users/Anna/Documents/workflows/rOpenSci/editorials/piggyback/tests/testthat/test-with-auth.R:10
2: gh("DELETE /repos/:owner/:repo/releases/assets/:id", owner = x$owner, repo = x$repo,
       id = ids[i], .token = .token) at /Users/Anna/Documents/workflows/rOpenSci/editorials/piggyback/R/gh.R:157
3: gh_process_response(raw)

══ Results ═══════════════════════════════════════════════════════════════
Duration: 9.2 s

OK:       15
Failed:   1
Warnings: 0
Skipped:  1

I believe in you!

spell check

flagging potential typos:

  WORD               FOUND IN
commitish          pb_new_release.Rd:17
delimeters         pb_list.Rd:26
gess               pb_delete.Rd:10, pb_upload.Rd:13
prerelease         pb_new_release.Rd:32
transfered         pb_pull.Rd:27

coverage

Unable to check because of test failure.

Answer 2 · 2018-06-11T20:48:06.000Z

Also, could you also add the review badge to the README? 👍

[![](https://badges.ropensci.org/220_status.svg)](https://github.com/ropensci/onboarding/issues/220)

Answer 3 · 2018-06-11T21:56:10.000Z

@annakrystalli Thanks!

I just learned about devtools::spell_check(), very cool. spelling errors fixed .
review badge added
I added a mention of usethis::edit_r_environ() to README
That test should be getting skipped if you're running locally, because you don't have my GitHub token... I'm guessing you have your own GITHUB_TOKEN or GITHUB_PAT already set? Can you try running checks without a .Renviron file that sets either env var?

I'm not sure how to change the skip command to handle that case; currently it just skips the authentication-requiring tests if the token is undefined ("") (i.e. so it will skip on CRAN; travis runs these tests since I can give travis an encrypted token. Note the coverage will be lower than what codecov shows if you test locally due to these skips.

Answer 4 · 2018-06-12T07:43:37.000Z

Aha 🤔, not sure how feasible it is but it feels like the only people that test should run for are the repo owner and any collaborators. So perhaps the test condition check could include a check that the gh user matches one of owner or collaborator?

Also, all tests passing after disabling GITHUB_PAT 👌

Answer 5 · 2018-06-12T07:47:12.000Z

Also, current coverage,

piggyback Coverage: 0.36%
R/push-pull.R: 0.00%
R/utils.R: 0.00%
R/gh.R: 0.77%

As you mentioned, some tests didn't run. What sort of coverage are you getting?

Answer 6 · 2018-06-12T07:51:33.000Z

Sorry just found link to codecov. Ignore comment above.

Answer 7 · 2018-06-12T07:58:24.000Z

I'm not sure how to change the skip command to handle that case; currently it just skips the authentication-requiring tests if the token is undefined ("") (i.e. so it will skip on CRAN; travis runs these tests since I can give travis an encrypted token

@cboettig - I use "opt in" environment variables for tests like this, something like <PACKAGENAME>_USE_<FEATURE> and set that in my .Renviron and .travis.yml, then in the helper functions in test that have an equivalent skip_if_not_<feature> function that skips the test if the variable is not set to true

Answer 8 · 2018-06-12T08:49:03.000Z

Or universal local ones: I just set MPADGE_LOCAL=true. If I've got that, then GITHUB_PAT must be mine.

Answer 9 · 2018-06-12T16:47:54.000Z

Thanks all, that's brilliant. An after a few boneheaded attempts I managed to pass the secure+public env vars to travis correctly this time.

Answer 10 · 2018-06-13T23:32:32.000Z

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5hrs

Review Comments

Environment Variable Access

The environment variable for the Personal Access Token (PAT), GITHUB_TOKEN, is not the common way the PAT is stored across development packages. Generally, the PAT in environment variables is given as GITHUB_PAT. Thus, an additional system environment check is warranted.

Note, that gh and devtools both prefer GITHUB_PAT.

https://github.com/r-lib/gh/blob/8677f7d287ef76c33f33831ea09a6df89b31f88e/R/gh_request.R#L143-L146

https://github.com/r-lib/devtools/blob/26c507b128fdaa1911503348fedcf20d2dd30a1d/R/github.R#L86-L106

I think it would also be advantageous to mimic the github_pat function from devtools in place of get_token():

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/gh.R#L281-L283

Package API

The API for the package dances back'n'forth between treating the repository as the data and the data as a file. I can understand the use-cases for both; but, would it be better practice to emphasize the repository as the data? This would simplify the interface to mimic pre-existing APIs, e.g. devtools::install_* and download.files(url, dest). Alternatively, consider making the API more verbose, e.g. pb_download_repo(repo) or pb_download_repo_file(repo, file)

Package API:

  library("piggyback")
  lsf.str("package:piggyback")
#> pb_delete : function (repo = guess_repo(), tag = "latest", file, .token = get_token())  
#> pb_download : function (file = NULL, dest = ".", repo = guess_repo(), tag = "latest", 
#>     overwrite = TRUE, ignore = "manifest.json")  
#> pb_download_url : function (file = NULL, repo = guess_repo(), tag = "latest")  
#> pb_list : function (repo = guess_repo(), tag = "latest", ignore = "manifest.json")  
#> pb_new_release : function (repo = guess_repo(), tag, commit = "master", name = tag, 
#>     body = "Data release", draft = FALSE, prerelease = FALSE, .token = get_token())  
#> pb_pull : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")  
#> pb_push : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")  
#> pb_track : function (glob)  
#> pb_upload : function (file, repo = guess_repo(), tag = "latest", name = NULL, 
#>     overwrite = FALSE, .token = get_token())

Function Parameters

Another notable issue is the placement of the positional parameter after defaults.

e.g. file in pb_delete

pb_delete(repo = guess_repo(), tag = "latest", file, .token = get_token())

Consider putting file first or indicating missingness with file = NA.

Function Exports

It may be beneficial to also export and document guess_repo() as that is prominently featured in function signatures.

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/push-pull.R#L261-L268

Documentation

Outside of that, it seems like there are a few artifact of an older API in the documentation examples and some exported functions are missing documentation:

Missing ex:

Outdated:

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/gh.R#L132

Answer 11 · 2018-06-14T10:58:27.000Z

Wow! I'm sure this might be some sort of record @coatless! 👏 Thanks so much for such a super speedy response!

@cboettig I'm still looking for the second reviewer. Feel to wait and respond to both reviews together or respond to @coatless anytime before.

Answer 12 · 2018-06-14T14:25:13.000Z

@annakrystalli apologies for jumping the gun. I had a use case in mind when I saw the package. So, I just quickly took down a few notes.

Answer 13 · 2018-06-14T20:03:54.000Z

No worries at all @coatless! I'm not going to complain about having your review in! In fact, having a relevant use case handy often makes for a really useful review.

Answer 14 · 2018-06-15T20:37:07.000Z

@coatless thanks, this is great! How'd your use case go?

I've fixed the missing and outdated examples. I've made most of the examples into donttest and made sure they run successfully with devtools::check(cran = FALSE, run_dont_test = TRUE), though that requires my PAT since the tests upload data to cboettig/piggyback. Not sure if there's a better strategy for documentation than that -- obviously this means copy-pasting examples won't work, but examples need a repo they can upload to. Ideas / suggestions welcome.

Re the "file" vs "repo" interface, that's very intentional -- but perhaps I need to document it better. If I'm just trying to debug an example on some data, I'd rather use a file-based interface to quickly share the file over, say, a generic "scratch" repository (regardless of my current working repo, or if I'm not "in" any repo). Locally I can do:

pb_upload("mtcars.tsv.gz", repo  = "cboettig/scratch")
pb_download_url("mtcars.tsv.gz", repo  = "cboettig/scratch")

Then I can post a question like: "Hey Jim, why does this fail:"

read.table("https://github.com/cboettig/scratch/releases/download/data/mtcars.tsv.gz")

This needs the more file-based workflow of upload/download/download_url/list/delete. Other times, I'm working in a particular repository with data connected to that project, and I really want the more "git-lfs" style workflow with the ability to track, say, all *.tif or *.tsv.gz files. Does that make sense? Can I do a better job of explaining that somewhere?

Re GITHUB_PAT -- as you probably saw, piggyback will recognize that name too. But I do think that choice was one of Hadley's rare mistakes, GITHUB_TOKEN appears in some 86K code files on GitHub vs only 2K that use GITHUB_PAT, and gh package also recognizes both. I probably should do something like devtools does to provide a default public token though. I would import the devtools method directly but I'm hesitant to do so while the devtools diaspora still seems underway; not sure where that routine will end up. I've tried instead to build on usethis routines here whenever possible. Perhaps @jennybc has insight here?

Re exporting guess_repo(), I agree that's a useful pattern, I wish I could import it from usethis or git2r myself!, but I think I should keep it internal precisely because it's quite general -- I really wouldn't want people introducing a revdep on piggyback just for that function, and I don't want to guarantee that method is robust and general, rather than an internal method whose behavior might change.

Answer 15 · 2018-06-15T21:06:02.000Z

Re: GitHub PAT

FWIW in usethis I have basically inlined gh::gh_token() (which is now exported by dev gh):

https://github.com/r-lib/usethis/blob/3cc50ae7c4877d5f10747d0e8f25469897206cfe/R/github.R#L261-L266

It's not great but I don't know what else to do.

Re: guess_repo()

FWIW the tidyverse/r-lib team has Git remote URL parsing all over the place, which we would dearly love to centralize. We plan to make a mini-package for this (https://github.com/r-lib/ghurl/issues). It's seems incredibly specific but actually many other languages have such a thing. It actually comes up a lot! I think you could use your own function for now, but not encourage external use and hopefully we'll make this other thing available to ourselves and all.

Answer 16 · 2018-06-15T21:17:45.000Z

Thanks @jennybc, this is great. Yeah, I have almost the same inline call in

https://github.com/cboettig/piggyback/blob/ac5acf06f85e768347407c46b505792621fdb591/R/gh.R#L304-L306

I wonder if it would be worth including a default "public" token, the way devtools::github_pat() does, so that tasks not requiring auth don't run afoul of rate limiting on tokenless calls, or if that would create more confusion than benefit...

I'm 💯% for mini-packages for common utilities. much better than duplicating the same utility code internally or exporting that functionality from a package in which it's orthogonal to the actual purpose/API of the package. ghurl 👀

Answer 17 · 2018-06-21T07:23:57.000Z

Thanks to @mpadge for volunteering and using the package already, we now have both reviewers! 🎉

Reviewers: @coatless @mpadge
Due date: 2018-07-12

Answer 18 · 2018-06-21T08:11:13.000Z

@cboettig also, I can't see the Under Review badge in the README anymore?

Answer 19 · 2018-06-21T17:35:53.000Z

@annakrystalli Karthik says badge server got restarted for upgrades, looks like it is back now.

Answer 20 · 2018-06-26T09:53:23.000Z

Package Review

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Package does not seem to have contributing guidelines in README or as separate file

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

There are no references. Note that I previously submitted a reference-less paper.md to JOSS and was asked to please provide at least one reference.

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5

Review Comments

This is a great and (for me at least) highly necessary package that makes my life so much easier. I am currently collaborating on a large project with multiple repos hosting code to generate very large data sets for which this package has provided an ideal solution to prevent repo-bloat. While this could all be done manually and laboriously, it is now possible to simply send links to collaborators who've got better things to do that learn the intricacies of git(hub), and write in the README:

library (our_data_packge)
download_all_data ()

That's awesome! Thanks @cboettig!

My major criticism

To get the negative out of the way first: The package as it stands is in many ways likely to be barely usable by non-coders because it issues almost no informative error messages. Almost all arguments appear to be passed un-checked directly to the github API which then returns a variety of largely unexpected and unintelligible http request errors.

Providing useful error message is crucial for productive use of any package, and would in this case have make the job of reviewing it considerably easier. Let me illustrate what I consider to be some inadequately processed errors:

pb_new_release(repo = "mpadge/myrepo", tag = "v1", .token = NULL)
# Error in pb_new_release(repo = "mpadge/junk", tag = "v1", .token = NULL) :
#   Not Found (HTTP 404).

Please issue a useful message about the necessity of local tokens.

pb_upload ("non-existent-file.Rds", repo = "mpadge/myrepo")
# Error in httr::upload_file(file) : file.exists(path) is not TRUE

Please intercept such errors prior to actually calling httr and relying on that package to issue your errors.

pb_upload ("existent-file.Rds", repo = "mpadge/myrepo",
           tag = "non-existent-tag")
# Error in gh("/repos/:owner/:repo/releases/tags/:tag", owner = r[[1]],  :
#               GitHub API error (404): 404 Not Found
#                 Not Found

That last one is particularly unhelpful, and would be much more useful with perhaps a dialog starting

# Tag 'v2' does not exist in that repository. Would you like it to be created?

(With suitable control over, or default values for, draft and pre-release options.)

pb_upload (file = "this-file-already-uploaded.Rds")
# Error in pb_upload(file = "data/x.Rds") :
#   Unprocessable Entity (WebDAV; RFC 4918) (HTTP 422).
pb_upload (file = "this-file-already-uploaded.Rds", overwrite = TRUE) # okay

The first of these should generate a message/warning about the overwrite parameter, rather than a 422 error. Finally,

pb_new_release (tag = "this-tag-already-exists")
# Error in pb_new_release(tag = "this-tag-already-exists") :
#   Unprocessable Entity (WebDAV; RFC 4918) (HTTP 422).

In short: Please pre-empt and process all errors within this package, rather than relying on external packages to detect and issue your error messages.

Back to the positive

Okay, now that that's out of the way, let me return to the positive tone in which I began. This really is a great and hugely useful package, and I hope that the remainder of this review can somehow contribute to making it even better.

Before proceeding, I note a general issue raised by @coatless regarding (in his words), the dance "between the repository as the data and the data as a file." I would extend that dance to a parallel and unavoidable one between git and github, and specifically between tags and releases. The package utilises the mechanism of github releases to aid enable clean interfaces with locally-hosted git repositories which only understand tags. I am unsure whether this is a package designed to interface with a github repo (via releases) or a local repo (via tags), although do not the useful section of the README/vignette in this context. Consider nevertheless that pb_new_release() will construct both a new tag on github, and a corresponding new release, yet this tag is not fetched locally, and so my local version of the repo will not know this tag.

While the package clearly has a primary github (rather than git) focus, it is nevertheless largely built around the guess_repo() function - which in my opinion, to strongly concur with @coatless, really ought to be exported - and this function is actually built on entirely local (git) queries (via git2r::discover_repository(). Given this, I suggest it would be useful to at least have some kind of optional (if not default) trigger to fetch all remote tags created with pb_new_release() to the local instance.

@cboettig also has a comment in the current code of the gh.R/pb_list() function:

Should we report the tag name too? x$tag_name?

To which I absolutely suggest Yes. It is perhaps useful to describe my current project to indicate perhaps a kind of use case not directly envisioned here. We have a variety of data representing a variety of stages from raw data to final output. There are several processing stages, and each of these now has - thanks to the ease and brilliance of this package - its own tag and corresponding github release. We therefore generally need to know at all times what the tags actually are, and with which tags each file is associated. This currently requires a command-line git interface to discern the tags, and the piggyback interface to list the files. Having these two combined within a single pb_list() call would be supremely useful here.

I also feel that pb_list() should by default list all files, enabling the kind of insight I just mentioned, with tag then restricting the view to only those associated with that tag (whether latest or not). It would also be very useful to have a simple pb_tags() function that listed all tags on the github repo.

Primary functionality

The package is divided between two modes, the explicit pb_up/download() and pb_push/pull(). I find both of these very useful, yet suspect that a merging of the two may ultimately be even more useful. The main difference actually seems to be the creation by the latter of a local .pbattributes file which is then used to track all files that are or can be pushed and pulled, enabling wholesale calls to pb_push/pull() without having to explicitly specify file names. On the other hand, pb_up/download() always require explicit filenames, and these calls never interact with .pbattributes at all.

Again, I do not think my current use case is likely to be particularly unusual, but we simply have several files of which only some are to be push/pull-ed or up/download-ed. This currently requires the latter, explicit interfaces, yet relying on these then prevents a simple pb_pull() call to download all data, which is a shame. One easy solution would be for pb_up/download() to also add entries to pb_push/pull(), so that pb_pull() could be used, but I suspect a nicer solution might ultimately be to have a single pair of calls (say, pb_up/download(), although i have no preference for nomenclature either way), such that pb_upload ("*.dat") functions as pb_push() currently does; pb_upload("x.dat") uploads that file only and adds the name to .pbattributes, and pb_upload() uploads all files listed on (or matching globs in) .pbattributes. Same for downloads.

Miscellaneous other notes

pb_delete

The following call simply and mysteriously completes silently:

pb_delete ("these-data-do-not-exist.Rds")

As does this one:

pb_delete ("these-data-do-exist.Rds")

Please issue an message in the first instance. It would also be useful to be able to:

pb_delete (tag = "v1")

with the expected results then manifest both locally and on github.

pb_push/pull

Note that pb_push/pull() generally behave as expected, yet if a file is deleted locally, then calling pb_pull() no longer displays the name of the file prior to displaying the progress bar. This inconsistency ought to be rectified.

Further note that if that file is again removed locally, then calling pb_push() simply completes silently. It would be useful in this instance to check any files explicitly listed in .pbattributes, and to issue some kind of informative messages that there are no local files to push.

I also suspect that some more thought might be productively put into exactly what role the .pbattributes file plays. If files are deleted locally, should they also be removed from .pbattributes? In this case, a call to pb_delete() could easily do all of these.

GITHUB_TOKEN

See discussion of nomenclature here, and my personally opinionated view that "GITHUB_PAT" is obfuscatory - "TOKEN" is a word; "PAT" is not.

tests

I note simply that when this package goes to CRAN, all tests will be skipped. I'm not sure of their current methods to actually examine the structure of tests, but it is certainly no longer acceptable to have a package without tests. If they actually examine whether tests are run (rather than just written), then the tests may need to be modified (purely in regard to CRAN, not at all in regard to this review).

Thanks @cboettig for a supremely useful and (lack of error messages notwithstanding) easy-to-use package. I look forward to it only become even easier to use through this review process.

Answer 21 · 2018-06-26T10:23:17.000Z

Thanks for the super timely review @mpadge and so great to see the package getting put to good use already and has managed to have some decent use cases thrown at it too.

@cboettig over to you!

Answer 22 · 2018-07-04T19:33:54.000Z

@mpadge Thanks for this, fantastic review.

Completely agree about the error message issues (and associated tests), will fix that once I resolve the interface issues first.
Re TOKEN vs PAT, agree on TOKEN (as I had stated in thread you linked, that's the dominant standard elsewhere), but lots of R users adopted PAT thanks to devtools, piggyback supports both (as does usethis and gh), so hopefully we can call that a truce.
tests: yeah, I'm supremely mistrustful of CRAN's servers to test things that require network connections, though I could add some more trivial tests. In principle all the tests-without-auth.R should work on CRAN, so maybe I should just let CRAN attempt those, but you know how it is.
interface. Yeah, this is the big issue I should address first, since it will probably impact how all the others are done. I think I could use some more input on this:
- I agree that push/pull are essentially upload/download with a file list generated from .pbattributes, even though they aren't quite written like that currently (currently only push/pull handle the manifest / hashes, as you know).
- I actually still cannot decide if the download / upload interface could support the manifest.json bit (at least as it is currently implemented). On the face of it, this sounds obvious and trivial to check hashes before downloading, but keeping an accurate manifest on GitHub isn't trivial, particularly if we're managing file-by-file instead of as a suite of files -- every download command would have to first download the manifest, and every upload would have to download, merge, and upload a merged manifest. delete would also have to download and upload a manifest, etc. (Of course none of this would be necessary if the GitHub API would compute hashes for us). I just worry this increases the error surface of what are otherwise super simple low-level functions for often very little gain (maybe I'm spoiled by high bandwidth; I've been impressed that GitHub is willing to download assets at > 200 MB/s, Zenodo doesn't get near those rates). More thoughts on this would be appreciated.

Answer 23 · 2018-07-16T12:51:48.000Z

Hello @cboettig and @mpadge 👋.

Just wondering if any progress had been made with the issues highlighted in the review and feedback requested by @cboettig?

Answer 24 · 2018-07-17T03:00:56.000Z

Hey @annakrystalli , thanks for the ping. I think this is mostly waiting for me to finish out ropensci/piggyback#8, which is about half-done on the timestamps branch (along with about 40% of the error message handling added there as well).

I take the thumbs up from @mpadge to mean he's okay with me leaving in push/pull and upload/download as separate APIs and not having the low-level wade into the additional complexity possible failure points of dealing with a manifest, but maybe I'm just wishing that interpretation on it.

Though they may not be an ideal solution, support for timestamps will work on the low-level API (as well as the high-level); might even be better than the manifest/hash approach of the push/pull API, effectively making it the simplification Mark is looking for?

I'll ping again once I've merged that and dealt with error messages, or if I have any additional questions about the error messages. Mark's list of specific failure modes is super helpful so I think I can at least improve on the errors and have a revised version to review by, say, the end of the month.

Answer 25 · 2018-07-25T13:28:42.000Z

@mpadge Okay, I think I've hit all of these issues at this point.

You should get more helpful error messages now, see test-errors.R for examples.

In some cases I have opted for warnings rather than errors (where I can imagine things being looped over and not wanting breaking errors), maybe that should be enforced more strictly?
the messages should be much more helpful, but some are short of your ideal. Mostly this is because I wanted to avoid introducing additional API calls merely to debug, since that both slows things down and increases more possible errors.

Regarding the unified interface, I think the support for timestamps by default in pb_upload / pb_download accomplishes much of the interface you might want out of pb_pull() / pb_push(). To take this one step further towards making pb_push/pb_pull dispensible, I've also modified pb_track: pb_track() now invisibly returns a list of all tracked files, and can optionally be called with no arguments. This might facilitate a common workflows such as:

pb_track("*.csv") %>% pb_upload()

Since it still writes to .pbattributes it will remember previously tracked patterns:

pb_track("*.csv")
pb_track("data/*")
write_csv(mtcars, "data/mtcars.csv") # note file doesn't have to exist when pattern was added

pb_track() %>% pb_upload()

So the only thing pull and push do that you cannot get with the above workflow is the manifest hashes, which I'm still not sure is sufficiently robust to be that useful. Maybe pull and push should be dropped at this point, but I'm still on the fence. Feedback from you, @annakrystalli , @coatless etc would be great on that call as well.

I think I've hit all the little things mentioned as well. I have a few references in the paper.md (see paper.bib), using pandoc.

Answer 26 · 2018-07-27T09:43:23.000Z

Thanks @cboettig. I'll take it our for a test ride in the coming week and report back then

Answer 27 · 2018-07-27T14:01:43.000Z

@cboettig thanks for the update. I'll check it out this weekend.

Answer 28 · 2018-08-06T09:30:15.000Z

Hello @mpadge and @coatless! Any news on your test drives of the changes? Please let me know if you are happy that the changes address all your concerns.

@cboettig my personal feeling on:

So the only thing pull and push do that you cannot get with the above workflow is the manifest hashes, which I'm still not sure is sufficiently robust to be that useful. Maybe pull and push should be dropped at this point, but I'm still on the fence.

is to remove pull and push if they are largely redundant and maybe return them later on if you are happy they robust enough? If not, perhaps noting this background info somewhere in the documentation (so users can easily make an informed choice on which workflow best suits their needs) could also work. Ultimately I'd weigh @mpadge and @coatless feedback on this more though.

Answer 29 · 2018-08-15T08:54:29.000Z

Thanks @cboettig for the great improvements to the package, and sorry for the slight delay getting back to you. The following comments are mere suggestions that in the subjective opinions of this reviewer might just help to make it even more useful and/or easy to use.

Error messages and tags

As indicated by @cboettig, most previously hard-to-decipher error messages have been addressed, and the package is as a result considerably more user friendly. There nevertheless remain a couple of unhelpful error messages prompted by passing inappropriate arguments through to package dependencies, primarily httr. One simple one of these is:

pb_upload ("non-existent-file.Rds", repo = "mpadge/myrepo")

This still passes non-existent file name through to httr and returns file.exists(path) is not TRUE. Please insert a simple catch at start of pb_upload():

if (!file.exists (file))
    stop ("file ", file, " does not exist")

The remainder may be more effectively addressed through first extracting all tags for a repository and then using those throughout to check each tag parameter. The extraction of tags is very simple using the graphql interface. My suggestion would be to ask @maelle if you might kindly borrow her create_client function from the ghrecipes package:

create_client <- function(){
  token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN")
  if (token == "")
  {
    stop ("Please set your GITHUB_GRAPHQL_TOKEN environment variable.")
  } else
  {
    if (!exists ("ghql_gh_cli")) {
      ghql_gh_cli <- ghql::GraphqlClient$new (
        url = "https://api.github.com/graphql",
        headers = httr::add_headers (Authorization =
                                     paste0 ("Bearer ", token))
      )
      ghql_gh_cli$load_schema()
      ghql_gh_cli <<- ghql_gh_cli
    }
    return (ghql_gh_cli)
  }
}

A get_tags function could then simply be written like this:

get_tags <- function (owner, repo)
{
    qtxt <- paste0 ('{
                    repository(owner:"', owner, '", name:"', repo, '") {
                        tags:refs(refPrefix:"refs/tags/", last:50) {
                            edges {
                                tag:node {
                                    name
                                }
                            }
                        }
                    }
    }')
    qry <- ghql::Query$new()
    qry$query('gettags', qtxt)
    dat <- create_client()$exec(qry$queries$gettags) %>%
        jsonlite::fromJSON ()
    if (is.null(dat$data$repository)) stop(dat$errors$message)

    dat$data$repository$tags$edges
}
get_tags ("mpadge", "junk")

This obviously introduces another dependency (ghql), but I suspect this package will likely be improved at some stage through incorporating ghql calls anyway. These tags can be cached somehow on the first call for a give repo, and then re-used throughout a session. A simple

tags <- memoise::memoise (get_tags (owner, repo))

would do the trick (sorry, requiring yet another dependency, but I also really think the package will at some stage require some more sophisticated caching like this).

This would then allow the kind of prompt I suggested in the initial review along the lines of

pb_upload (..., tag = "non-existent-tag")
# Tag 'non-existent-tag' does not exist in that repository. Would you like it to be created?

And it would happen instantly, without recourse to any API calls.

This would also enable error messages for the following call:

pb_new_release(repo = "mpadge/junk", tag = "this-tag-already-exists")

At the moment this still returns the entirely illegible:

# Error in pb_new_release(repo = "mpadge/junk", tag = "this-tag-already-exists") :
#   Unprocessable Entity (WebDAV; RFC 4918) (HTTP 422).

other stuff

pb_list

This function still does not return tag information - did you intend to incorporate that? I would reiterate most of the comments I previously made regarding this function.

pb_delete

This call now returns an appropriate error message:

pb_delete (file = "non-existent-data", repo = "mpadge/junk", tag = "v1")
# non-existent-data not found on GitHub

As previously stated, it would still be nice to be able to delete an entire release. At the moment this is not possible:

pb_delete (repo = "mpadge/junk", tag = "v1")
# Error in asset_filename(file) :
#   argument "file" is missing, with no default

Any comments on potentially enabling that?

pb_track

This does indeed work as @cboettig suggests, but I note that pb_upload is not vectorised:

pb_track ("data/*") # where * matches >= 2 files
pb_track () %>% pb_upload (repo = "mpadge/junk", tag = "v1")
# Error in vapply(elements, encode, character(1)) :
#   values must be length 1,
#  but FUN(X[[1]]) result is length 2

It would be very useful to simply make current pb_upload some kind of pb_upload_one function and then have pb_upload be a simple vapply call to that.

I'll leave my comments there for the moment, and await feedback from @cboettig

Answer 30 · 2018-08-15T17:15:35.000Z

Thanks @mpadge, really appreciate the detailed feedback and agree these things should be done. My apologies that most of the things in "other stuff" look like oversights on my part, and should be straight forward to implement.

The only thing I'm on the fence on is the tags etc, not because I disagree about doing it, but that I think it's going to require more thought on the right way to do this. Like you say, ultimately this package should really be caching some relatively detailed metadata about the package. Of course this will require some more invasive surgery to implement...

I'm not sure memoise is the way to do this, since memoise invalidates the cache only when the function arguments change, and here the repo information is likely to change without that happening. Still, it would be a convenient way to do so for sure, effectively making the cache persist through a session, so it might be an decent starting point.

I think I can get all the tags easily enough from the standard REST API, instead of ghql. Will have to dig into this...

Mark, I'd also like to ask your permission to list you as a contributor to the package, if that's okay with you? Your detailed input has had a substantial contribution to the package.

Answer 31 · 2018-08-17T08:10:23.000Z

👍 Re: "ctb": 👍 of course! Thanks!

Re: tags + caching: Yeah, some invasive surgery will be required, and yeah, memoise is perhaps not ideal for the reasons you state, but then again pb_new_release is the only function that should change tags (at least until, or if, pb_delete(tag = ...) is incorporated), so consistent cache updating would just require re-running memoise(get_tags()) within pb_new_release.

And yeah, API3 would also be entirely sufficient, because tag info will likely never run into rate restriction limits. I just got personally converted to API4 through @maelle's graphql evangelising, and don't plan on going back.

I hope all this back-and-forthing is okay with you, and I hope this doesn't come across as me being too picky. I really do just want to help make an already supremely useful package even cleaner and easier to use, and I strongly suspect that at least finding an initial system for interim caching will help make future package development much easier and smoother.

Answer 32 · 2018-08-29T17:20:17.000Z

Hi @cboettig! Really happy to see the discussions and development of this package so far. Feels like it is on a good road. Was just wondering whether you and @mpadge feel that an acceptable version of the package has been reached and therefore can complete the review process, with further discussed development to be noted for future versions?

Answer 33 · 2018-08-29T17:23:41.000Z

@annakrystalli I think I still need to tick off the above items, at least the easy ones, and would like to run them by @mpadge one more time. Shouldn't be too much work, I've just been 🌊 by the start of the semester and need to get 🏊 again first!

Answer 34 · 2018-08-29T18:27:21.000Z

OK, not to worry. I'll keep an eye on the thread for progress. 💪 🏊‍♂️

Answer 35 · 2018-09-04T23:23:14.000Z

@mpadge Okay, think I've hit all of these. A few strategic questions remain before we close this out.

First, I now have just about all functions call the internal method, pb_info(), which can get all assets associated with all tags in a single API call. Still, I haven't done any caching on this, because I cannot figure out a robust strategy for that. The only time I think caching this would be safe is within a vectorized version of the functions, and maybe I should handle that case explicitly? Otherwise, one can never be sure if a user, for instance, creates or deletes a release using the GitHub web interface in between two consecutive function calls in an interactive session. Thoughts on this?

Second, I'm warming more and more to the idea of just dropping the whole pb_push, pb_pull functions (while maybe keeping pb_track()?) e.g. this would eliminate the use of a manifest.json and hashes to know if a file had changed -- relying totally on timestamps instead. Yes / no?

@coatless , any thoughts on these?

Answer 36 · 2018-09-06T02:17:46.000Z

@cboettig a few. I need to sit down and type them up. Unfortunately, I've been slammed with the semester start and meetings. I'm going to try to carve out time on Friday.

Answer 37 · 2018-09-12T05:08:18.000Z

In response to the latest points raised...

Caching

Regarding caches, we must mention them in the frame of:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Having said this, I'm not sure caching needs to be attempted given the current rate limits. If the user is authenticated, which is a key requirement of many of piggyback's functions, the number of times the API can be queried is 5000. For users who do not supply a GH_TOKEN, they have 60 queries per hour (c.f. Rate Limiting Overview, Rate Limit Rules).

If you feel you must do some level of caching, the two heavy users of pb_info() are gh_download_asset() used in pb_download() and pb_upload_file() called by pb_upload(). The easiest way to cache results would be with memoise as @mpadge initially indicated. I say this because an option does exist to supply a timeout (c.f. timeout()).

As an example consider:

# Create a memoized pb_info function that expires after 10 seconds
# Note: The presently exposed pb_info is renamed to local_pb_info in this case
pb_info <- memoise(local_pb_info, ~timeout(10))

# Grab repo information
pb_info("cboettig/piggyback")

Though, to avoid another package dependency, the cache setup could be done by maintaining your own closure with a timeout procedure. Slight code sketch:

# Make shift caching
repo_cache = function(timeout = getOption("pb.cache_timeout", 120)) {
  
  # Create internal tracking variables
  start_time <- repo_data <- repo <- tag <- .token = NULL
  
  # Initialize Clock
  start = function() { start_time <<- Sys.time() }
  
  # Split Time
  time_point <- function() {
    # Difference between NOW and start in seconds
    difftime(Sys.time(), start_time, units = "secs")
  }
  
  # Store the repo data
  cache_data <- function() {
    repo_data <<- piggyback:::pb_info(repo = repo, tag = tag, .token = .token)
    start()

    repo_data
  }

  # Create initial cache (public-facing)
  set_cache_info <- function(repo = guess_repo(), tag = NULL, .token = piggyback:::get_token()) {
    repo <<- repo; tag <<- tag; .token <<- .token
    
    cache_data()
  }
  
  # Compare to time stamp
  # If bad, then flush cache
  check_cache_state <- function() {
    if( time_point() > timeout ){
       cache_data()
    }
  }
  
  # Retrieve value and reset if need be
  get_cache <- function() {
    check_cache_state()
    repo_data
  }
  
  # Return functions
  list(set_cache_info = set_cache_info, get_cache = get_cache )
}

# Allocate the closure
pb_cache = repo_cache()

# Setup repo caching
pb_cache$set_cache_info("cboettig/piggyback")
pb_cache$get_cache()

Misc note: Glancing over memoise, it provides an impressive cloud backend to storing information on S3 + GCP that could potentially be harnessed; however, I feel that is out of the initial scope for piggyback.

Narrowing the API

I'm largely in agreement with sunsetting both pb_push() and pb_pull(). These functions heavily rely on an underlying mental model of the git. At that point, I think that the user would likely just commit into git already since the barrier of entry is significantly higher due to the need of a manifest.json file.

Package style

Style-wise, the styler::style_pkg() found a few issues. I've spun up a PR for you to include the changes in ropensci/piggyback#14

Examples

Part of the piggyback repository is serving as a "checkpoint" for data (c.f. data tag). I wonder if such an example would be better suited inside of a secondary repository. The secondary repository could then be linked from a code example in the vignette and/or README.md as an example of what is possible.

I have a few more remarks, but the night is no longer young. I'll aim to finish round 2 later in the week.

Answer 38 · 2018-09-18T05:20:35.000Z

Thanks @coatless for the detailed review.

Caching: memoise with timeout sounds very compelling to me! I think I should just do this with a very short time-out. That would give you the benefit of caching when it matters most, like looping over a large number of piggyback calls; while not caching if someone just has the same R session open for a long time.
Sunsetting pb_push / pb_pull -- Yes, I largely agree with this. I wouldn't agree with "user should just commit to git" because I think the user should probably already do that if files are << 50 MB. If they are greater than 50 MB, committing to git immediately breaks one's ability to push to GitHub, and is not easy to remedy either; a mistake I've seen dozens of times and was hoping piggyback would fix (users don't have to know manifest.json exists -- the overhead involved there is more of a problem for me the developer).

I think this really comes down to what semantics are clearest for the user now: functionally, pb_track() %>% pb_upload() is now the moral equivalent of pb_push(). I thought invoking the git metaphor would be helpful to inspire users to make use of functionality like pb_track() to automatically list, say, all *.csv files in a project. pb_upload() seems to imply that a user somehow has to remember which files need to be uploaded. If we do kill pb_push, I think fewer users may use pb_track() to create the .pbattributes file of all tracked data patterns. But maybe I can address this with sufficient documentation.

styler PR merged, thanks! (I cannot get my styler installation to automatically create linebreaks... not sure what's up there, but glad I now know about this package!)

Let me know your remaining remarks, looking forward to it!

Answer 39 · 2018-09-18T19:19:19.000Z

Okay, caching turned out to be a bit more of an adventure than I expected with memoize.

A 1 second cache on pb_info breaks a good number of tests and vignettes. Many of these issues were fixed as soon as I made sure functions that update the GitHub release (pb_upload, pb_release) would also break the cache automatically.

I also wanted a way to turn off caching, turns out memoise::timeout(0) doesn't do what you expect, so I hacked around that. Now cache duration can be set as an env var, piggyback_cache_duration, and setting it to 0 will turn caching off.

I've also sunset pb_push and pb_pull. I've updated the docs to reflect this, also making the readme briefer and creating two separate vignettes; one with the longer intro to functions that was largely in the readme, and a separate vignette on alternative approaches to piggyback.

Remaining questions

One possibly significant issue I worry about for CRAN onboarding is the "write to temporary files" complaint. This is particularly likely to bite on pb_download() and I'm not sure how best to address it without losing the convenient default behavior based on the git repo root. Maybe since it defaults to usethis::proj_get() and not "." they will be okay with it or won't notice? (Arguably by setting an RStudio Project the user has declared a location(?). proj_get() will throw an error if the user is not in a project, so my argument is that this is not an actual default location (unlike here(), which doesn't error). Thoughts on this?

Answer 40 · 2018-09-19T07:58:27.000Z

Thanks @cboettig for doing the hard work for us all on memoise::timeout() - very useful to know! And as you know, I think the retirement of pb_push/pull is a strong boost for ease and consistency of use - thanks.

As for file writing, I don't have any great insights, and would love to be assured that what you do were permissible. I note only that:

I just scoured the code for usethis, and it seems they exclusively use file_temp() for writing with default tmp_dir = tempdir() at all times (yes, even when testing proj_get(), as in this example); and
CRAN have very strong auto-checks for any attempts to write to non-standard locations which have caught me out previously, so current usage will definitely be noticed. Have you tried testing on all rhub platforms to see whether any of those flag anything?

An after-thought: How about having default location as usethis::proj_get(), but enabling manual override so all tests could then use tempdir()?

Answer 41 · 2018-09-19T17:21:09.000Z

@mpadge Thanks for your thoughts on this. Yeah, pb_download already supports specifying dest = tempdir() while defaulting to dest = usethis::proj_get().

I've tried to consistently make sure tests, examples, and vignettes use tempdir in the calls (well, at least tests, I probably need another pass over this to add it to examples and vignettes). I don't really like this because it makes the examples look more cumbersome than they really are, but that is somewhat unavoidable I guess. (It's not just tempdir(); following @coatless 's very excellent point that I should use a different repo for testing, but this means that all functions include a manual declaration of repo = "cboettig/piggyback-tests" as well now.

My question is though, do you think that's good enough? I'm worried that having any default location at all is technically a violation of policy. (e.g. in base function download.file(), dest has no default).

(Though I still feel this is not technically a default location, since proj_get() throws an error unless the user has set a project).

Answer 42 · 2018-09-20T15:38:58.000Z

@annakrystalli @mpadge @coatless

Okay, I think I basically have things where I want them now in response to all the super excellent feedback. You have all made this a way more usable, useful, and robust package. I'm sure it's not perfect, but I think we've hammered out a nice API and functionality to get people started.

I think we have the 👍 from Mark to onboard, but still waiting final confirmation from James, yes?

@annakrystalli Remind me any additional steps I need to do for the JOSS part? I just go over there and open an issue pointing back here? Mark/Jim, are you 👍 with the JOSS submission?

Answer 43 · 2018-09-20T15:42:21.000Z

👍 to onboarding.

Answer 44 · 2018-09-21T08:16:02.000Z

Approved! So happy to see so much great work done on this package through review. Many thanks @cboettig for submitting and @coatless & @mpadge for your reviews! 😺

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

To-dos:

Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
Add the rOpenSci footer to the bottom of your README
" [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)"
Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
Activate Zenodo watching the repo
Tag and create a release so as to create a Zenodo version and DOI
Submit to JOSS using the Zenodo DOI. We will tag it for expedited review.

As for more details on the JOSS process, once you've got your DOI from Zenodo, use it to submit to JOSS here http://joss.theoj.org/papers/new.

I believe this triggers a [PRE-REVIEW] issue. I'll flag it as accepted to ropensci once the issue is created. 👍

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

Answer 45 · 2018-09-21T11:25:59.000Z

Thanks so much for the outstanding improvements @cboettig - I'm really looking forward to more tightly integrating this package within my workflow, and to contributing further issues to continue the constructive conversation. Congratulations!

Answer 46 · 2018-09-21T17:15:44.000Z

@cboettig please do let me know if you'd like to do a post or tech note on this or arkdb!

Answer 47 · 2018-09-21T21:32:53.000Z

@stefaniebutland Thanks, yes I'd love to do a blog post on piggyback, maybe combine a bit of arkdb. I was thinking I might focus on why I wrote these two little packages, more than the technical details (each package basically has only two functions, essentially data_in and data_out), so they are conceptually simple anyway.

Answer 48 · 2018-09-21T21:43:13.000Z

why I wrote these two little packages

yes!!

right now Tuesday 2018-10-16 or subsequent Tuesdays are available. Take your pick and then please submit via pull request a week prior to publication date. Some guidelines are here: https://github.com/ropensci/roweb2#contributing-a-blog-post