Cannot open PDF files with commas in the name

Question

Cannot open PDF files with commas in the name

Closed this issue 3 years ago · 10 comments

I've never had this issue until some time ago but now I cannot manage to open PDF files attached to a bib entry when the filename contains a comma. I use file field to attach PDFs.

I can normally open files with ASCII filenames not containing neither : (colon), nor ; (semicolon), nor , (comma). Nevertheless my understanding was that colons and semicolons only were reserved for the file field, and that commas were allowed. Indeed I used commas for years with no issues.

Here's an example:

(setq bibtex-completion-library-path "/home/massimo/cloud/Papers/"
      bibtex-completion-pdf-field "file")

(setq
 examplegood '(("=key=" . "Bell2020AutomatingRegular")
               ("=type=" . "article")
               ("author" . "Zoe Bell")
               ("title" . "Automating Regular or Ordered Resolution is NP-Hard")
               ("pages" . "")
               ("journal" . "Electronic Colloquium on Computational Complexity {(ECCC)}")
               ("year" . "2020")
               ("volume" . "105")
               ("file" . ":Bell (2020) - Automating Regular or Ordered Resolution is NP-Hard.pdf:PDF")
               ("url" . "https://eccc.weizmann.ac.il/report/2020/105"))

 examplebad '(("=key=" . "GurevichShelah1996FiniteRigid")
              ("=type=" . "article")
              ("author" . "Yuri Gurevich and Saharon Shelah")
              ("title" . "On Finite Rigid Structures")
              ("file" . ":Gurevich, Shelah (1996) - On Finite Rigid Structures.pdf:PDF")
              ("journal" . "J. Symb. Log.")
              ("year" . "1996")
              ("volume" . "61")
              ("number" . "2")
              ("pages" . "549\\nobreakdash--562")
              ("url" . "https://doi.org/10.2307/2275675")
              ("doi" . "10.2307/2275675")))

(bibtex-completion-get-value "file" examplebad)    ;;finds the path
(bibtex-completion-get-value "file" examplegood)   ;;finds the path

(bibtex-completion-find-pdf-in-field examplegood)  ;;finds the path
(bibtex-completion-find-pdf-in-field examplebad)   ;; returns nil

The likely culprit is the expression

(replace-regexp-in-string "\\([^\\]\\)[;,]" "\\1\^^" value)

in the function bibtex-completion-find-pdf-in-field. It kills the commas because it consider them separators in the file field.
I don't know what is the formal spec for the file field, but I am pretty sure comma were not a problem until recently.

Answer 1 · 2021-04-15T07:43:43.000Z

Calibre uses commas to separate multiple PDFs. See #360. The trouble is that there's no standard syntax for the file field and every bibliography manager uses their own variant. That's the reason why I personally don't use the file field. In my setup, the filenames of PDFs follow the scheme [BibTeX-key].pdf. This also speeds up parsing of the bibliography a bit.

Answer 2 · 2021-04-15T07:47:10.000Z

I'd probably just mass-replace commas by underscores (or similar). Easy to do with dired (M-x M-q).

Edit: Closing because, I'm afraid there is nothing we can do to resolve this conflict. Feel free to reopen if you have an idea.

Answer 3 · 2021-04-15T09:28:10.000Z

Thank you for the quick answer.

I'd rather not touch the paper file names for various reasons:

there may be other references to it, since a filename is a "public API"
they are more readable when I look for them outside emacs (tablet readers, ...)
machines should adapt to human formats, not viceversa ;)
I want to decouple file names and bibtex keys because sometimes some local fixes are needed on both sides in a long term bibtex DB

I will likely add a configuration variable: if we specify the methods of attachment (i.e. file field) why not specifying also the convention that given field uses? (With retrocompatible defaults). I'll do a pull request eventually and then you will decide what to do with it.

Question: where do you find the Zotero, Calibre, ... file field conventions? I did not even know that Calibre could export bib file with attached documents.

Answer 4 · 2021-04-15T10:17:59.000Z

Good reasons to stick with your current names. But I don't think a config option is the right way to go. The assumption of the current code is that there is a standard format that works for all users. But that assumption simply doesn't hold. What we need is a solution that recognizes the reality that every bibliography manager has its own dialect. Some kind of plug-on system. For instance, a function bibtex-completion-find-pdf-calibre and so on. Then users can select the right plug-in or even specify multiple plug-ins in case their bibliographies are messy (combined from multiple sources). Users would also be able to easily supply functions for other dialects that are not covered (yet).

Question: where do you find the Zotero, Calibre, ... file field conventions? I did not even know that Calibre could export bib file with attached documents.

There is no written convention that I'm aware of. It's all reverse-engineered. :)

Answer 5 · 2021-04-15T10:20:50.000Z

The approach that I describe above (plug-ins) shouldn't be too difficult to implement actually and it would address many related issues that have accumulated over time. I just didn't find the time to implement this yet. If you feel inspired, let me know and I will be happy to provide input.

Answer 6 · 2021-04-15T15:02:38.000Z

We'll let's start with opening the issue again. I'll try to give it a shot eventually.

Answer 7 · 2021-08-01T07:14:05.000Z

Sorry for importing the bug... I changed to a more strict regular expression, in order to make it work both for comma spliited bib and comma inside file name.

Hope this fix the issue. #385

Answer 8 · 2021-08-01T11:38:21.000Z

Thanks for the PR, @yuchen-lea. The diff is not terribly helpful (as is often the case for lisp code). Could you briefly summarize how you solved the problem? Thank you!

Answer 9 · 2021-08-02T04:52:05.000Z

I changed the original

(replace-regexp-in-string "\\([^\\]\\)[;,]" "\\1\^^" value)

to a function:

(defun bibtex-completion-get-file-record (pdf-field-value)
  "Return the splitted list of record from PDF-FIELD-VALUE"
  ; Zotero/Mendeley/JabRef format:
  (setq pdf-field-value (replace-regexp-in-string "\\([^\\]\\);" "\\1\^^" pdf-field-value))
  ; Calibre format:
  (setq pdf-field-value (replace-regexp-in-string "\\(\.[A-Za-z0-9]+:[A-Za-z0-9]+\\)," "\\1\^^" pdf-field-value))
  (s-split "\^^" pdf-field-value))

In this way, it will only replace the comma which splits multiple records, while keep the comma in file name.

Answer 10 · 2021-08-23T12:22:08.000Z

@yuchen-lea, sorry for the slow response. The code for finding PDFs is so incredibly messy (entirely my fault) that I hesitate to make further changes to it. In this particular case, I worry that we may fix things for some users and break things for others. It's become so hard to predict.

We really need a flexible plug-in approach which allows users to tailor finding PDFs to their own needs and bibliographies. The whole idea that there is a single approach that suits everyone was a mistake. I completely underestimated how many different formats there are. The good news is that it should be difficult to come up with a clean system and it's probably also going to speed up loading the library because we only need to consider the relevant cases, not all possible cases.