davidcarlisle/dpctex

Issues with mylatex.ltx and non-ascii characters (since automatic "inputenc+utf8")

jfbu opened this issue · 7 comments

jfbu commented

It is difficult to get mylatex.ltx to work with filenames containing non-ascii characters (with pdflatex) since those are active (the everyjob execution done when first macro of mylatex.ltx is encountered when generating preamble gives to non-ascii characters their real inputenc+utf8 meanings).

(I have edited my initial wording which was erroneous)

$ etex -ini \&pdflatex mylatex.ltx ééé
This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018) (INITEX)
 restricted \write18 enabled.
entering extended mode
(/usr/local/texlive/2018/texmf-dist/tex/latex/carlisle/mylatex.ltx
LaTeX2e <2018-04-01> patch level 5
) (/usr/local/texlive/2018/texmf-dist/tex/latex/tools/.tex File ignored)

! LaTeX Error: Missing \begin{document}.

See the LaTeX manual or LaTeX Companion for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
<*> &pdflatex mylatex.ltx é
                            éé
? X
No pages of output.
Transcript written on mylatex.log.

Very recently commits have been pushed to AUCTeX and the development version is able to cope successfully (for pdftex engine, as xetex and luatex are not concerned by this of course) with using mylatex.ltx to generate cached preambles even for filenames with non-ascii characters (and spaces). Many years ago already a hack into \dump had been put in place by AUCTeX maintainer to cope with filenames containing spaces. (not contiguous)

Thus signalling this issue for it to get documented perhaps, but not pushing too much for a fix as this will probably break the AUCTeX manoeuvers ;-)

Oh most likely mylatex.ltx should (since the 2018-04-01 latex release) restore the \everyjob settings that latex uses to read the commandline filenames safely. I'll look in to it ....

jfbu commented

The \everyjob settings in LaTeX format (as recently updated by yourself) actually destroy the commandline filename safety:

$ pdflatex ééé

is ok but

$ pdflatex \\input{ééé}

fails due to \everyjob execution. (just clarifying to myself, as of course you know this much better than me)

well without the everyjob settings the first form wouldn't work either, so it isn't those settings that break \input{ééé} so much as the implict utf8 handling the \inputform isn't so bad as you can use\input{\detokenize{ééé}}}` (and we may see a way to make that work automatically at some point)

jfbu commented

well without the everyjob settings the first form wouldn't work either

Would it make sense to delay the everyjob utf8 related activation (I mean by that the activation of the non \string-ified actions), i.e. remove it from the format and transfer it to the LaTeX document classes? I suppose this puts burden on third party classes, on the other hand users of those classes could still be with inputenc usage in preamble, and class maintainers could at some point adopt the LaTeX team provided code and announce to their users they too can drop inputenc+utf8 from their preamble.

No I think that would lead to massive fragmentation as it would be impossible to know what is happening, there are thousands of university thesis classes lying on hard disks round the globe.

I am not sure what issue you are worried about there have been almost no reported issues with the UTF-8 default handling, just the edge case of windows using legacy file system encodings not UTF-8, which was addressed in patch 5.

jfbu commented

No I don't have much of an issue but a slight uneasiness about active characters already on command line. For example,

pdflatex \\input \\detokenize{é}#1.tex

works but is somewhat complicated, and the \detokenize couldn't have the # inside.

You will ask why \input? well not me, but AUCTeX used it per default (with braces, so here it would raise other difficulties). Recent commits will as far as I followed reduce usage of \input.

In the context of a format created via mylatex.ltx, the catcode regime and the fact that an active end of line is expected, it got a bit arduous to manage to apply the \detokenize to a non-ascii filename for a pdflatex run using this format (using command line&format, not first-line-parsing). (in the past AUCTeX used no \input precisely in that case; and it is difficult to do so with the \ being active and the \everyjob making now non-ascii unfriendly... but this "arduous" bit got solved in current commits to AUCTeX dev repo). So this new situation creates trouble in certain specialized contexts.

Also, the whole way LaTeX activates non-ascii characters (I am thinking of the LICR) appears like a legacy of the past. If we live in an UTF-8 world, why LICR? If we need \unexpanded and \detokenize to handle the new situation, why not use \protected too in the definitions and get the non-ascii to expand to themselves when written to files and behave nicely in \edef with no extra precautions? I have not thought really, I just wrote that in jest, and of course it borders on off-topic, but that was simply to try to express my feelings: I feel like the LaTeX UTF-8 is perhaps not as "modern" as it could be.