tmalsburg/helm-bibtex

Cannot open PDF files with commas in the name

Closed this issue · 10 comments

I've never had this issue until some time ago but now I cannot manage to open PDF files attached to a bib entry when the filename contains a comma. I use file field to attach PDFs.

I can normally open files with ASCII filenames not containing neither : (colon), nor ; (semicolon), nor , (comma). Nevertheless my understanding was that colons and semicolons only were reserved for the file field, and that commas were allowed. Indeed I used commas for years with no issues.

Here's an example:

(setq bibtex-completion-library-path "/home/massimo/cloud/Papers/"
      bibtex-completion-pdf-field "file")

(setq
 examplegood '(("=key=" . "Bell2020AutomatingRegular")
               ("=type=" . "article")
               ("author" . "Zoe Bell")
               ("title" . "Automating Regular or Ordered Resolution is NP-Hard")
               ("pages" . "")
               ("journal" . "Electronic Colloquium on Computational Complexity {(ECCC)}")
               ("year" . "2020")
               ("volume" . "105")
               ("file" . ":Bell (2020) - Automating Regular or Ordered Resolution is NP-Hard.pdf:PDF")
               ("url" . "https://eccc.weizmann.ac.il/report/2020/105"))

 examplebad '(("=key=" . "GurevichShelah1996FiniteRigid")
              ("=type=" . "article")
              ("author" . "Yuri Gurevich and Saharon Shelah")
              ("title" . "On Finite Rigid Structures")
              ("file" . ":Gurevich, Shelah (1996) - On Finite Rigid Structures.pdf:PDF")
              ("journal" . "J. Symb. Log.")
              ("year" . "1996")
              ("volume" . "61")
              ("number" . "2")
              ("pages" . "549\\nobreakdash--562")
              ("url" . "https://doi.org/10.2307/2275675")
              ("doi" . "10.2307/2275675")))

(bibtex-completion-get-value "file" examplebad)    ;;finds the path
(bibtex-completion-get-value "file" examplegood)   ;;finds the path

(bibtex-completion-find-pdf-in-field examplegood)  ;;finds the path
(bibtex-completion-find-pdf-in-field examplebad)   ;; returns nil

The likely culprit is the expression

(replace-regexp-in-string "\\([^\\]\\)[;,]" "\\1\^^" value)

in the function bibtex-completion-find-pdf-in-field. It kills the commas because it consider them separators in the file field.
I don't know what is the formal spec for the file field, but I am pretty sure comma were not a problem until recently.

Calibre uses commas to separate multiple PDFs. See #360. The trouble is that there's no standard syntax for the file field and every bibliography manager uses their own variant. That's the reason why I personally don't use the file field. In my setup, the filenames of PDFs follow the scheme [BibTeX-key].pdf. This also speeds up parsing of the bibliography a bit.

I'd probably just mass-replace commas by underscores (or similar). Easy to do with dired (M-x M-q).

Edit: Closing because, I'm afraid there is nothing we can do to resolve this conflict. Feel free to reopen if you have an idea.

Thank you for the quick answer.

I'd rather not touch the paper file names for various reasons:

  1. there may be other references to it, since a filename is a "public API"
  2. they are more readable when I look for them outside emacs (tablet readers, ...)
  3. machines should adapt to human formats, not viceversa ;)
  4. I want to decouple file names and bibtex keys because sometimes some local fixes are needed on both sides in a long term bibtex DB

I will likely add a configuration variable: if we specify the methods of attachment (i.e. file field) why not specifying also the convention that given field uses? (With retrocompatible defaults). I'll do a pull request eventually and then you will decide what to do with it.

Question: where do you find the Zotero, Calibre, ... file field conventions? I did not even know that Calibre could export bib file with attached documents.

Good reasons to stick with your current names. But I don't think a config option is the right way to go. The assumption of the current code is that there is a standard format that works for all users. But that assumption simply doesn't hold. What we need is a solution that recognizes the reality that every bibliography manager has its own dialect. Some kind of plug-on system. For instance, a function bibtex-completion-find-pdf-calibre and so on. Then users can select the right plug-in or even specify multiple plug-ins in case their bibliographies are messy (combined from multiple sources). Users would also be able to easily supply functions for other dialects that are not covered (yet).

Question: where do you find the Zotero, Calibre, ... file field conventions? I did not even know that Calibre could export bib file with attached documents.

There is no written convention that I'm aware of. It's all reverse-engineered. :)

The approach that I describe above (plug-ins) shouldn't be too difficult to implement actually and it would address many related issues that have accumulated over time. I just didn't find the time to implement this yet. If you feel inspired, let me know and I will be happy to provide input.

We'll let's start with opening the issue again. I'll try to give it a shot eventually.

Sorry for importing the bug... I changed to a more strict regular expression, in order to make it work both for comma spliited bib and comma inside file name.

Hope this fix the issue. #385

Thanks for the PR, @yuchen-lea. The diff is not terribly helpful (as is often the case for lisp code). Could you briefly summarize how you solved the problem? Thank you!

I changed the original

(replace-regexp-in-string "\\([^\\]\\)[;,]" "\\1\^^" value)

to a function:

(defun bibtex-completion-get-file-record (pdf-field-value)
  "Return the splitted list of record from PDF-FIELD-VALUE"
  ; Zotero/Mendeley/JabRef format:
  (setq pdf-field-value (replace-regexp-in-string "\\([^\\]\\);" "\\1\^^" pdf-field-value))
  ; Calibre format:
  (setq pdf-field-value (replace-regexp-in-string "\\(\.[A-Za-z0-9]+:[A-Za-z0-9]+\\)," "\\1\^^" pdf-field-value))
  (s-split "\^^" pdf-field-value))

In this way, it will only replace the comma which splits multiple records, while keep the comma in file name.

@yuchen-lea, sorry for the slow response. The code for finding PDFs is so incredibly messy (entirely my fault) that I hesitate to make further changes to it. In this particular case, I worry that we may fix things for some users and break things for others. It's become so hard to predict.

We really need a flexible plug-in approach which allows users to tailor finding PDFs to their own needs and bibliographies. The whole idea that there is a single approach that suits everyone was a mistake. I completely underestimated how many different formats there are. The good news is that it should be difficult to come up with a clean system and it's probably also going to speed up loading the library because we only need to consider the relevant cases, not all possible cases.