retorquere/bibtex-parser

Export to CSL JSON

hubgit opened this issue · 7 comments

Is a mapping available anywhere from this tool's output format to the standard fields of CSL JSON?

This parser doesn't return a fixed list of fields -- whatever is in the bibtex (and that can vary wildly in practice) will be returned, and no effort is made to normalize this towards something that could easily be turned into CSL; you might instead be looking for https://github.com/dsifford/astrocite.

This library just parses the bibtex into the equivalent JSON (as close as possible), so if there's a month field in the bibtex, you'd find a month field in the JSON. This library post-processes the astrocit-bibtex parser results to ensure that things like case-protection work as intended, names are parsed properly, expanding @Strings and a few other things. Currently the main target is to drive https://github.com/retorquere/zotero-better-bibtex/

I see, thanks for the explanation. I'd added both astrocite and this library to a comparison and hadn't spotted that both were using the same parser.

If there's any difference in the final output I might still look at a map to CSL, partly to ease comparison.

The output of both will still be different, that's why I do my own post processing - the test case in the comparison linked to is fairly simple and won't highlight the differences. You could try the bib files in my test suite (https://github.com/retorquere/bibtex-parser/tree/master/tests/better-bibtex/import) to get more edge cases. These are not synthetic test cases (except one I think), these are actual samples that people have handed to me.

Oh and to be clear - I don't postprocess the CSL astrocite yields, I parse the half-product that their bibtex-to-ast parser yields (part of which is by my hand). Bibtex parsing is ridiculously complicated if you go beyond just wanting the textual content, and mine and biblatex-csl-converter go to great lengths to make sure that the intent expressed in the bibtex (including sentence casing, which is a horror show, but also new-style biblatex name parsing) comes across in the parsing process.

Oh and also, my parser has error recovery. If astrocite hits a syntax error, nothing is returned.

(I also co-wrote biblatex-csl-converter)

bibtexParseJs, Citation.js and bibtex-parser simply fail to parse a number of (valid) entries from my test suite; bib2json and bibtex-parser are only usable for simple display cases because they don't break out creator names nor do they replace bibtex commands. It has to be said that my own parser takes a few shortcuts that make the parsed data easily consumed by Zotero; I don't break out particles (because Zotero holds them by convention in either the family or the given name), and I convert some TeX markup to HTML-ish code, because citeproc (which is what Zotero uses) understands it.

A couple of interesting cases:

parsing of \ocirc{u}

@incollection {MR1870153,
         AUTHOR = {Franc{\ocirc{u}}, Jan and Krej{\v{c}}{\'{\i}}, Pavel},
              TITLE = {Homogenization of scalar wave equation with hysteresis
                    operator},
       } 

bib(la)tex titles are assumed Title Case, CSL titles are assumed Sentence case. Because none except biblatex-csl-converter and bibtex-parser deal with this, there's not much point in showing the intricacies of dealing with case(un)protection using braces because there's really nothing to compare it to for the others.

@Book{Demus1984,
Title = {The Mosaics of {San Marco} in {Venice}},
Author = {Demus, Otto},
Date = {1984},
Location = {Chicago},
Publisher = {University of Chicago Press},
Annotation = {a "monumental" work on the mosaics of San Marco},
} 

EndNote at one point exported entries without keys:

@Book{
Title = {The Mosaics of {San Marco} in {Venice}},
Author = {Demus, Otto},
Date = {1984},
Location = {Chicago},
Publisher = {University of Chicago Press},
Annotation = {a "monumental" work on the mosaics of San Marco},
}  

@string definitions:

@String{pub-FRED = "Freds Publishing"}
@String{pub-FRED:adr = "London, UK"}
  
@Book{Bert:2001:SQL,
  author = "R. A. Bert",
  title = "SQL is great",
  publisher = pub-FRED,
  address = pub-FRED:adr,
} 

round braces are valid bibtex, \sc is a command, some of the parser generate two spaces before CONNIVER, {\sc ...} does not case-protect its content.

@String{MCDERMOTT = "McDermott, Drew V."}
@techreport(McDermott:72,
author =  MCDERMOTT # { and Gerald J. Sussman},
year =  {1972},
month = May,
institution = {MIT Artificial Intelligence  Laboratory},
title = {The {\sc CONNIVER} Reference Manual}
)

single-part names:

@article{sasson_increasing_2013,
  title = {Increasing cardiopulmonary resuscitation provision in communities with low bystander cardiopulmonary resuscitation rates: a science advisory from the American Heart Association for healthcare providers, policymakers, public health departments, and community leaders},
  volume = {127},
  issn = {1524-4539},
  shorttitle = {Increasing cardiopulmonary resuscitation provision in communities with low bystander cardiopulmonary resuscitation rates},
  doi = {10.1161/CIR.0b013e318288b4dd},
  language = {eng},
  number = {12},
  journal = {Circulation},
  author = {Sasson, Comilla and Meischke, Hendrika and Abella, Benjamin S and Berg, Robert A and Bobrow, Bentley J and Chan, Paul S and Root, Elisabeth Dowling and Heisler, Michele and Levy, Jerrold H and Link, Mark and Masoudi, Frederick and Ong, Marcus and Sayre, Michael R and Rumsfeld, John S and Rea, Thomas D and {American Heart Association Council on Quality of Care and Outcomes Research} and {Emergency Cardiovascular Care Committee} and {Council on Cardiopulmonary, Critical Care, Perioperative and Resuscitation} and {Council on Clinical Cardiology} and {Council on Cardiovascular Surgery and Anesthesia}},
  month = {mar},
  year = {2013},
  note = {{PMID:} 23439512},
  keywords = {Administrative Personnel, American Heart Association, Cardiopulmonary Resuscitation, Community Health Services, Health Personnel, Heart Arrest, Humans, Leadership, Public Health, United States},
  pages = {1342--1350}
}

math and sub/superscript:

@InProceedings{test_citation1,
  Title                    = {{T}est {T}itle {W}ith 100~$\mu${J}, 200\,{\mbox{$\mu$}}{J} {E}nergy, and $\pm$0.1\% {A}ccuracy, 0.2\,mm$^2$ {S}ize, and $-$50\,d{B} {A}ttenuation},
  Author                   = {Doe, J. and Smith, R.}, 
  Booktitle                = IEEE_J_PROC,
  Year                     = {2016},
  Month                    = {Feb},
  Pages                    = {300--301},
  Abstract                 = {Abstract here},
  DOI                      = {10.1109/JPROC.2016.2526118},
  ISSN                     = {0193-6530},
  Keywords                 = {Keyword 1; Keyword 2}
}

@InProceedings{test_citation2,
  Title                    = {{T}est {T}itle with space\quad quad, space\;semi-colon, space\:colon, space\,comma, space~nbsp},
  Author                   = {Doe, J. and Smith, R.},
  Booktitle                = IEEE_J_PROC,
  Year                     = {2016},
  Month                    = {Feb},
  Pages                    = {300--301},
  Abstract                 = {Abstract here},
  DOI                      = {10.1109/JPROC.2016.2526118},
  ISSN                     = {0193-6530},
  Keywords                 = {Keyword 1; Keyword 2}
}

name with accents:

@inproceedings{rehurek_lrec, 
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50}, 
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
url={http://is.muni.cz/publication/884893/en},
language={English}
}