jgm/citeproc

Error in latex generation of portuguese quotes in CSLReferences field

Closed this issue ยท 15 comments

I'm experiencing some error with pandoc 2.19.2 using --citeproc option, where my language is portuguese lang : pt-BR.
When I use lang : en it works fine, and generates like this:

\begin{CSLReferences}{1}{0}
\leavevmode\vadjust pre{\hypertarget{ref-autor2021}{}}%
LastName, First Name. 2021. {``Title''} Other things.

The {``Title''} seems to be correct... but when I shift to lang : pt-BR, it generates like this: LastName, First Name. 2021. {"Title"} Other things.
This is wrong, because of {" and "} elements... and generate this error:

Runaway argument?
! Paragraph ended before \language@active@arg" was complete.

My only solution was to add this ugly thing as a post-pandoc script:

pandoc ... output.tex
csplit output.tex /begin{CSLReferences}/ '{*}'
sed -i 's/{"/{``/g' xx01
sed -i 's/"}/''}/g' xx01
cat xx00 xx01 > output.tex

Meaning that I manually split the generated .tex file into two, and then replace {" and "} by their correct counterparts.
Is there any configuration that I can make pandoc work fine with non-english language in biblatex, without this fix?
Thanks!

jgm commented

That's strange; I don't understand why it is doing this and not including the proper localized unicode characters; they are in the locale. I'll need to investigate.

Thanks for the reply... to make investigation simpler, I'll share my setup, that was luckly made with Docker and vscode Dev Containers plugin... so it's 100% reproducible.

.devcontainer/devcontainer.json

{
	"name": "C++",
	"build": {
		"dockerfile": "Dockerfile"
	},
	"remoteUser": "root"
}

Dockerfile

FROM mcr.microsoft.com/devcontainers/cpp:0-debian-11

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
     && apt-get -y install \
        texlive texlive-base texlive-latex-extra texlive-latex-recommended \
        texlive-bibtex-extra biber \
        texlive-xetex texlive-fonts-extra \
        texlive-science \
        texlive-lang-portuguese
        
RUN wget https://github.com/jgm/pandoc/releases/download/2.19.2/pandoc-2.19.2-1-amd64.deb
RUN sudo dpkg -i pandoc-2.19.2-1-amd64.deb

RUN wget https://github.com/lierdakil/pandoc-crossref/releases/download/v0.3.13.0b/pandoc-crossref-Linux.tar.xz
RUN mv pandoc-crossref-Linux.tar.xz /usr/local/bin/
RUN (cd /usr/local/bin/ && tar xf pandoc-crossref-Linux.tar.xz)

And I don't understand how this locale situation works, but this is the output of locale command, inside the container:

root@container $ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Pandoc text .md:

---
link-citations: true
nocite: |
  @book_key, @article_key
linkcolor: teal
citecolor: teal
urlcolor: teal
cref : true
lang : pt-BR
title: Test
author:
- Igor Machado Coelho
date: 29/11/2021
linkReferences: true
nameInLink: true
toc-own-page: true
listings: true
codeBlockCaptions: True
...

# Introduction

Bla bla

Build script (based on this template: https://github.com/igormcoelho/pandoc-template-legrand-orange-book/blob/main/orangelegrand.latex ):

pandoc --listings -F pandoc-crossref --citeproc --template=orangelegrand.latex \
	  --top-level-division=part --bibliography bibliography.bib \
	  mytestbook.md -o livro.tex 
echo " => generated mytestbook.tex"
echo "Fix to [htpb] in figures"
sed -i 's/begin{figure}$/begin{figure}\[htpb\]/g' livro.tex
echo "fix language"
# LINE=$(cat livro.tex | grep -ni "begin{CSLReferences}" | cut -d: -f1)
csplit livro.tex /begin{CSLReferences}/ '{*}'
sed -i 's/{"/{``/g' xx01
sed -i 's/"}/''}/g' xx01
#cat xx00 xx01 > livro.tex
echo "Converting with pdflatex x1"
pdflatex -interaction=nonstopmode livro > pdflatex1.log 
echo "Converting with pdflatex (makeindex)"
makeindex livro.idx -s StyleInd.ist
echo "Converting with pdflatex x2"
pdflatex -interaction=nonstopmode livro > pdflatex2.log
echo "Finished!"
jgm commented

Here's a repro with trypandoc

Try changing the lang to en-GB or fr-FR and you'll see the quotes change in an appropriate way.

By the way, this isn't LaTeX-specific -- this example has 'plain' output.

jgm commented

Mystery solved? I looked at chicago-author-date.csl, the style pandoc uses by default, and it contains overrides for Portuguese:

  <locale xml:lang="pt">
    <terms>
      <term name="editor" form="verb">editado por</term>
      <term name="editor" form="verb-short">ed.</term>
      <term name="container-author" form="verb">por</term>
      <term name="translator" form="verb-short">traduzido por</term>
      <term name="translator" form="short">trad.</term>
      <term name="editortranslator" form="verb">editado e traduzido por</term>
      <term name="and">e</term>
      <term name="no date" form="long">s.d</term>
      <term name="no date" form="short">s.d.</term>
      <term name="in">em</term>
      <term name="at">em</term>
      <term name="by">por</term>
      <!-- PUNCTUATION -->
      <term name="open-quote">"</term>
      <term name="close-quote">"</term>
    </terms>
  </locale>

I'm curious why these are here!

Removing the overrides for the quotes will fix this.
Or you may want to try a different style.

Oh! Maybe someone wrongly fixed these?
Now I see... it's interesting because I wanted to change the style and didn't know how ๐Ÿ˜‚
So, I did my own fix, removing these quotes, and it works fine with --bibliography bibliography.bib --csl=chicago-author-date-fix-br.csl

Thanks a lot!

Indeed, someone really messed up ๐Ÿ˜‚... the correct solution is supposed to be:

      <term name="open-quote">โ€œ</term>
      <term name="close-quote">โ€</term>

It's here on this repo the correct one: https://github.com/citation-style-language/locales/blob/master/locales-pt-BR.xml#L163-L164
But chicago is wrong ๐Ÿ˜‚ https://github.com/citation-style-language/styles/blob/master/chicago-author-date.csl#L68-L69

jgm commented

Yes, the default for citeproc is the "correct one," but someone altered the chicago-author-date style intentionally to override it. I have no idea why but perhaps if you search the git history of the styles repository, you'll find the answer.

Yep, I guess some change in february by a fellow brazilian updated this during some translation... I'll try to revert this part of the change there.

jgm commented

If your changes are accepted, let us know and we can use the new version in pandoc.

Ok @jgm , I think it may be fixed on that side ;)

By the way, I couldn't understand where the problem from " came from.. who translates csl to latex, is it pandoc or latex itself? ๐Ÿค” And if it's pandoc, maybe a future strategy could be to detect open and close " on csl, and if it's the case, transform them into typographic quotes? This came from another discussion there...
citation-style-language/styles#6317 (comment)

In any case, I really think the change in style there needs to be reverted, so no fix here will be needed, but I'm just trying to better understand how all these things work :)

jgm commented

If citeproc says to use a ", pandoc will pass the " through to LaTeX.
At least with the current version of pandoc, it doesn't cause an error. I get the following with pdflatex

Screen Shot 2022-11-29 at 6 30 37 PM

and with xelatex:

Screen Shot 2022-11-29 at 6 31 01 PM

jgm commented

You probably need to adjust something in your custom template. Compare it with the default one in pandoc.

jgm commented

Let me know when the changes are accepted over there. so far the style has not changed.

Hi @jgm , the PR has been accepted at citation-style-language/styles#6317
The problematic lines have been removed. So, now pandoc can also do the same fix, if possible :)

Thanks a lot!