allefeld/atom-pandoc-pdf

Improve performance of pandoc in consecutive runs

Opened this issue · 7 comments

I wonder if the rendering can be carried out faster after the initial run of pandoc, if latexmk would be configured to reuse the last temporary file folder.

So I tried to overwrite the output directory of latexmk (last definition holds):

pdf-engine-opts:
- "-xelatex"
- "-output-directory=cache"

I have the impression that consecutive updates on save require only 2 instead of 4 or 5 runs. Couldn't Pandoc/PDF configure such a directory that is auto deleted when the respective tab is closed? Maybe this can be made configurable in the Pandoc/PDF settings.

I wrote the package such that there is a persistent temporary directory for a given input file, even across Atom restarts. The temporary directory name is based on a hash of the input file path plus the input file name. Would -output-directory=cache improve on that? Also, I couldn't find any documentation on this cache option value.

I decided not to delete the directory because after closing Atom, the PDF file there may still be open in the viewer tab, so if the viewer tab is recreated on the next Atom start the PDF can be displayed again. This way, cleaning up the temporary directory is left to the OS mechanism. On my system, /tmp is a tmpfs, i.e. a RAM-disk.

Files in the directory except the PDF are deleted only when Atom is closed / the package is deactivated, to leave behind less cruft. But this should not affect recompilation time within a continuous Atom session.

Empirically speaking, when I work on a large document that takes a bit to compile, and then I save the file even though it was not changed, compilation is almost instantaneous, so I have the impression that latexmk's time-saving works under this setup.

PS: I have to correct myself a bit: There is one file that is explicitly deleted on every Pandoc run, and that is the intermediate LaTeX file generated by Pandoc. That is a workaround for a bug in Pandoc where it did not create a new version if the old still existed, even though the input had changed, see jgm/pandoc#6027 (comment).

The bug has since been fixed, but I keep the workaround for the time being in case users use old versions. In my experience, the Pandoc run takes a fraction of the time of the mklatex run, so it shouldn't affect performance relevantly.

The value cache is just an arbitrary folder name in the current working directory.
I checked the folder /tmp/pandoc-pdf-{hash}-{filename} and found that only the pdf is stored in there.

In my cache folder, much more files are kept:

cache
├── input.aux
├── input.bbl
├── input.bcf
├── input.blg
├── input.fdb_latexmk
├── input.fls
├── input.log
├── input.out
├── input.pdf
├── input.run.xml
├── input.tex
├── input.toc
└── input.xdv

0 directories, 13 files

Those are reused by latexmk in subsequent calls and help to reduce the number of required runs in many cases.

Do you delete those files in /tmp/pandoc-pdf-{hash}-{filename} ?

No:

Files in the directory except the PDF are deleted only when Atom is closed / the package is deactivated, to leave behind less cruft.

So I remove my folder overwrite.
Then, I see two different folders tmp folders in the pandoc log:

  • /tmp/pandoc-pdf-{hash}-{filename}: contains only the final pdf
  • /tmp/tex2pdf.-{hash}/ used by latexmk with files: input.aux input.bbl input.bcf input.blg input.fdb_latexmk input.fls input.log input.out input.run.xml input.tex input.xdv

The second directory is cleaned after pandoc processing concluded. I use pandoc-pdf@0.1.0 with the following latexmk configuration:

pdf-engine: latexmk
pdf-engine-opts:
- "-xelatex"

I see in the pandoc log that pandoc-pdf sets the output directory via --pdf-engine-opt=-output-directory=/tmp/pandoc-pdf-{hash}-{filename}, but this does not seem to taken into account. Maybe it is linked to quoting.

Screenshot_20200403_122736

Hence, I cannot reproduce your behaviour in my setup.

The additional temporary folder seems to be related to bibliography files / biber. I have never used biber, and such lines to not occur in my logs. Can you share a short document with which I can reproduce your case?

Maybe it is linked to quoting.

It shouldn't, since the call parameters are passed directly to the Pandoc process. There is no shell involved, and therefore no need for quoting / escaping. Also, I find all temporary files and the final PDF in the specified -output-directory=, so the option seems to work (for me). As I wrote before, they are only deleted (except for the PDF) when the package is deactivated, which usually happens on Atom exit.

It seems unlikely, but maybe the involvement of biber somehow changes the temporary directory used by latexmk?

Btw., what platform are you on?

If you want to check that file deletions happen as I said, you could comment out the removeSync commands on lines 102 and 162 of lib/pandoc-pdf-processor.js, and see whether it changes anything.