perrette/papers

Backup, Sync and git tracking

perrette opened this issue · 5 comments

Originally, git tracking feature was added in order to add safety to handling a global papers install.
Implementation details are now jeopardized with local install. Local installs are often git-tracked themselves, and nested git repos does not play good. Worse, papers git install might trigger commits to a directory where it is not expected to (fortunately it is off by default, so it still requires explicit user action to be enabled). In the original implementation, the git directory could also be separate from the bibtex file. If that was the case, the bibtex would be copied to the git directory upon saving, and a commit would be done. That works, but using git commands to revert or reset to a previous commit would then only affect the git repo, and not the original bibtex, making the overall behavior unintuitive. Clearly, some overhaul is needed.

While it is not entirely clear to me yet how that feature should evolve. The basic idea of using git to safeguard the bibtex, and undo unwanted changes, is still relevant IMO. Here a few options:

  • use git as an internal tool in papers, without explicitly asking about it. papers undo (and a new command papers redo) could be used to navigate git history. The git repo would be saved in a central papers dir, using different branches to handle different bibtex locations (using a slug of the full bibtex path as branch name, for instance). That could work even without a proper installation. Maybe. Issue: bibtex rename would break the flow by creating a new branch. We could live with that.

  • propose hooks upon bibtex save. Here a whole workflow could be fine-tuned by users. Could be used as internal to implement higher-level feature.

  • add options to track files, sync with a remote server etc.

For now I'll just leave that issue open to collect ideas. Current simplistic implementation works OK.

While there are many ways of implementing back-ups and git tracking, the git model of a local, self-contained folder is the most elegant in my opinion. It is easy to keep track of and to cleanup (in contrast to a centralized repo with various branches for various files -- the number of branches would accumulate over time and be hard to maintain).

To avoid double-tracking and conflicts with an existing, larger git repo, it should be possible to simply add a .gitignore file next to the .papers directory (or append .papers to an existing git ignore). And let the user choose whether to git-track or not in the first place. It will initially be opt-in, but could become opt-out if usefulness is greater than other concerns, which I presume will be the case -- reliability is a concern number one when building a bibliography over time.

To let papers handle git-tracking behind the scenes, any changes to the bibtex (and optionally, to the associated files), have to be mirrored to a specifically dedicated git repo. If file-tracking is activated, the mirrored bibtex cannot be mere copy, but need to maintain its own "file" field pointing to local files. Hard links could be used for files to keep disk usage to a minimum -- at the expanse of Windows user (workarounds, like a copy, could be found later for Windows users).

For a local install, the resulting files structure would look like:

 papers.bib        => that could be anywhere else
 files/            => that could be anywhere else, or be an untidy collection of files
.gitignore         => so that no conflict arises with an already git-tracked repo
.papers/
    config.json
    papers.bib     => copy of bibtex with updated file links
    files/         => a tidy, renamed version of files
        file1.pdf  => could be a hard link toward the actual file, to save disk space
        ...
    .git            => yet another copy of papers.bib and files + history
    .gitattributes  => produced by `git lfs track files`

A global install would be pretty much the same, except that a .papers would be stored in some place globally.

The model outlined above would ensure a solid backup whatever the user configuration. Restoring a previous bibtex would work with that sequence of commands:

cd .papers
git reset --hard HEAD^   # check-out git repo to previous (or any other specific version)
cd ..
rm papers.bib -f
touch papers.bib
papers add .papers/papers.bib --rename --copy

The last line is not a perfect undo. It does keep track of the files, but it forces rename.
This example shows that rename may be a must for git-tracking of files.

The sequence of commands above can be used for undos until the beginning of time, but it cannot be used for redo. Here an alternative sequence for papers undo, with a hack to keep track of future states (only section between cd .papers and cd .. is written below):

echo $(git rev-parse HEAD) >> futures
git reset --hard HEAD^

and for papers redo:

git reset --hard $(tail -1 futures)
head -n -1 futures > futures.tmp && mv -f futures.tmp futures

Any new modification to the bib would empty futures (no redo after branching out).

Upon saving of the bibliography, the following could work (a more efficient version would be needed to avoid moving around files if not necessary):

rm -rf .papers/papers.bib .papers/files    
touch .papers/papers.bib
papers add papers.bib --bibtex .papers/papers.bib --filesdir .papers/files --no-check-duplicate
cd .papers
git add .
git commit -m 'action that triggered the change'
# maybe: git push remote --force
rm -f futures   # redo disabled

The model above is some kind of black box that leaves the implementation details to papers. Alternatively, a simpler, more transparent implementation would involve git tracking in the same, working directory.

papers.bib
files/
.papersconfig.json
.git
.gitattributes

Here plain git commands would work, without the need to move around bibtex and files each time the bibliography is saved.

Pros of black-box, .papers model

  • Works regardless of the location of files and bibtex (=> will move/rename them anyways)
  • Minimal intereference with an existing git-tracking (through git ignore) => can keep parallel systems
  • Slower as full-size bibtex manipulation is necessary at every step
  • Somewhat counter-intuitively, that could be more universal despite the complexity, because locally we'd track the files in a standardized form.
  • Larger disk usage (but hard links can largely alleviate that issue)

Pros and contras of transparent, same-dir model

  • faster (no need to edit around the whole bibliography each time.
  • less error prone (less actions needed, simpler)
  • easier to implement and maintain code-wise
  • cannot keep track of bibtex and files outside git directory
  • may interfere with an existing git install => can just give up on git tracking or make local install in a subfolder folder written-down in .gitignore or use git submodule
  • does it add any benefit at all compared to just letting the user use git ?

While I am sensitive to the arguments of simplicity and maintenance, the very last point seems the stronger in favor of a black-box model. Or in favor of dropping the feature altogether. Since this issue is about doing something, let's discuss it further. In case of an already-tracked project repo (which might be common for a local install), the only benefit of the transaprent, same-dir model is to automatize the commit / sync. That could also be address via some kind of hook on savebib, redo, undo (set of commands stored in config file). The black-box model, in contrast, would have a redo/undo system that operates regardless of whether the larger project is handled in git or not.

Now included in release 2.4.