yihui/knitr

Option to specify document output path?

cboettig opened this issue · 9 comments

Hi Yihui,

We were grappling with this issue at #rrhack last week and would really love your input. Currently we can specify output path only as a function argument (as far as I can tell anyway), e.g.:

knit("doc.Rmd", "papers/doc.md")

or with render:

render("doc.Rmd", output_file = "papers/doc.Rmd")

Ideally we'd be able to do:

opts_knit$set(output.dir="papers")

and just hit the knit2pdf / knit2html button in RStudio instead. Is that an option you would consider adding?

The annoying part about this is that it is impossible to accomplish this same behavior with the knit button in RStudio, with the default assumption that the inputs (.Rmd) and outputs (.md or .pdf or whatever rmarkdown has been told to create) are in the same directory, which seems to be in direct conflict with much standard advice on reproducible research. Having to tell people to avoid the RStudio knit button and insist they type the command argument is clearly not ideal.

Expecting input and output to be in the same directory is equally problematic though, for while it works well for small examples it really doesn't scale to projects that need some directory structure. Your knitr book illustrates this issue nicely, where you write a script to call knit with explicit paths such that the .Rmd and .md end up in different directories.

If this is interesting I'd be happy to send a PR, though you might implement it faster ensuring that it doesn't break anything. (There are also some finer points to consider: e.g. whether figure urls should or should not be made relative to the new output.dir; and whether output.dir is relative to root.dir or the source .Rmd dir, etc)

This might not completely solve the working directory insanity but I think it would help. @jennybc and @hlapp might be able to put this better than I.

I understand the pain, and it is pain for both you and me. Compiling a source Rmd document is not just running the code chunks inside it. I've got big headache when thinking about the directory problem when the code chunks have file input/output. I'd strongly recommend not touching the output argument of knit() or render(). It is easy to screw up unless you know well about how knitr/rmarkdown works internally. My only suggestion is to use

owd = setwd('papers')
knit('../doc.Rmd')
setwd(owd)

This is the only reliable way to work with knitr, as far as I can tell. Do not think about root.dir or output.dir or any dir options, which will only make things worse. And if possible, keep all Rmd files in a flat structure, e.g. in a single directory.

I updated the Gist where I vent my pain and describe my current solution. Sorry it's so long -- it started as me just thinking out loud.

The bit that might merit further consideration is the connection I make to Jekyll and the path handling practices it encourages. Short version: in every file, you specify the path to website root in the YAML front matter. Then in the document you build paths and links relative to that.

I think that's really worth considering for knitr and rmarkdown. We're all used to writing the YAML front matter anyway.

PS I have also come to the conclusion that nothing good comes from using the args of knit() and render() that affect where intermediates or outputs go.

Let's also be clear that we're talking about two related but distinct things:

  • working directory during knit() and render()
  • destination of downstream outputs, e.g. figure files, .md, .html files

My Gist deals with the first and I use Makefiles to deal with the second post hoc, i.e. rename and move downstream products.

@yihui Thanks much for the reply. I realize this isn't a trivial thing, but it's made substantially more problematic because it's not obvious at the start that it isn't a trivial thing. And @jennybc makes a great point to emphasize that there are at least two different (though related) issues here, probably best tackled separately. I meant to focus just on the second one; destination of downstream output created by knitr or render using paths it chooses to invent itself, not the issue of input/output calls in chunks. I see your point about not messing with root.dir, and I agree that avoiding directories and keeping everything flat is a reliable way to avoid working directory problems, but not all projects are amenable to that. Given that users can already set output dir anyway with the command line argument, do you think it is reasonable to create an output.dir option? (Lacking your perspective on the internals I guess I don't see why that's such a bad idea per se, so happy to be enlightened)

Like Jenny I tend to rely on a script or Makefile to move things outputs around after the fact since it is difficult to control them in a consistent way otherwise. This more-or-less works for me but when we start talking about teaching knitr in the context of reproducible research it starts sounding like a pretty bad idea (since it just kicks the directory problem down the road and opens up another step to break things).

I believe adding an output.dir option will make things worse. Without it, you only need to ask yourself one question: what is my current working directory (all output files are written out relative to the current working directory)? With it, you will have to think "okay, what is the real output directory relative to the current working directory?", and on my side, I will also have to think where to write the figure files, cache databases, and this can drive me crazy when a project involves child documents, in which case I'll have to think about the path problems about them.

I think Pandoc's self-contained mode helps alleviate this problem (embedding all external dependencies), but it may not be practical for all users.

Another solution is to turn your projects into R packages. Then the only thing you will ever need is system.file(), which is portable. Put data under data/, functions under R/, reports under vignettes/, and other files under inst/.

Hi Yihui,

Oh well, thanks for considering. I do understand that this would make too much of a headache on your end. Since it was already part of the API via the function call I figured it might not be a dramatic change, but sounds like you might deprecate that part of the function?

Just for completeness, I think your knitr book, many jekyll blogs that use Rmd input, and the general advice of separating input and output files all illustrate use cases that are not at all addressed by using an R package structure or self contained mode. Of course we can continue to rely on scripts or makefiles to deal with this through the function argument output_file, like you do in the knitr-book.

Thanks again for your thoughts on this.

For Jekyll in particular, I have got a solution servr::jekyll(): https://github.com/yihui/servr (an example Jekyll repo https://github.com/yihui/knitr-jekyll) You do not even need to click any buttons. The blog posts are compiled automatically on change, and you can servr::jekyll(daemon = TRUE) so that it will not block your current R process. It is probably not very efficient at the moment for large Jekyll projects, but I can certainly improve it (caching should also help).

The only trick in this solution is the build script: https://github.com/yihui/knitr-jekyll/blob/gh-pages/build.R which is relevant to the discussion here. You can tweak base.dir and base.url in opts_knit to redirect figure files to other places, and figures are probably the only thing you need to worry about in a Jekyll project.

Re: obtaining consistent behavior between the knit button and command-line calls to render, it looks like we can simply set the knit option to a custom function in the yaml metadata; ropensci-archive/reproducibility-guide#81 / Thanks @lmmx and cc @hlapp

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.