el33th4x0r/crosstex

CrossTex fails on bib files with comments

Opened this issue · 5 comments

For instance, consider IEEEabrv.bib, distributed in the IEEE transactions template.

See http://tug.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/bibtex/IEEEabrv.bib

Hi,

That file doesn't look like a legitimate BIB file to me. In particular, none of the English comments are preceded by any indication that they are a comment; they are just interspersed through the document.

It is legitimate.

See the bibtex documentation (and sample) files btxdoc.pdf and btxdoc.bib. They are in most tex/latex distro and certainly in texlive.

For bibtex all that is not in an entry is a comment.

Indeed, the bibtex documentation explicitly says: If you want to comment out an
entry, simply remove the ‘@’ character preceding the entry type.

Thanks, I didn't realize that free-form text was considered a legitimate BibTeX comment. One more reason to avoid BibTeX!

I'll see if we can modify our parser to accommodate this, but it may not be so easy to deal with things like unmatched braces in the free-form text.

Indeed, some of the original choices in bibtex turned out to be really problematic (7bits, bad scripting language for bst files... and free-form text). Unfortunately a very large number of pre-built, downloadable bib files in the scientific area take advantage of the free-form comment style, so I think that a bibtex replacement should really deal with them properly when working in bibtex mode.

Rather than modifying the parser, another idea could be to make it 2-steps. First strip away all that is not in-entry, secondly process the entries with your current parser.

To strip away all that is not in-entry with a state machine is easy. Suppose you have 2 states ON and OFF and a counter. Start in OFF. In off state, when a char gets in, see if it is an @. If it is, move to the ON state, otherwise drop it. When in ON state, when if a char gets in, copy it to output. Also see if it is a {. If it is, increment the counter. Also see if it is }, if it is decrement the counter and if the counter goes to 0, move to the OFF state.

This is a bit rough, since there may be @s in comments, like in an email address but I think it could be good enough. The proper way would be to get in ON state if you have a sequence like "@word{" which requires storing strings.

Be careful not to get wrong with @ that are in-entry. These must be preserved (since they may belong to a @ command, or be inside an email address in a note). Pybibtex (a competing attempt at building a bibtex replacement in python) currently gets crazy at them.