Contributing a new Lua filter for Font styling and alignment
nandac opened this issue · 28 comments
Dear Folks,
I would like to contribute a new Lua filter to the community that is available here: https://github.com/nandac/fonts-and-alignment
This filter currently only works for LaTeX documents, but I will be making it work for HTML and ePub in the future.
As I am a beginner in Lua and LaTeX and would like to
- Get a review of the Lua code.
- Get suggestions on refactoring and improvement
- Get some comments on if I should be splitting this into two different filters etc.
I hope this would not be too much of an imposition on the community. The filter is not very large at present.
Thanks for this, that's a nice little filter. Below are a few small suggestions.
- Whether an element is a Span or Div can be tested by checking
elem.t
(orelem.tag
). - The filter uses
for
loops combined withif
statements to find the correct field in a table. A more idiomatic way would be to use table indexing. E.g.,local tag = elem.t local code_for_class = LATEX_CODES_FOR_TAGS[tag]
- It's not necessary to pass the tag to
create_handler
if the tag is taken fromelem.t
. The inner function could be namedhandler
and then be used directly:local function handler (elem) ... end Span = handler Div = handler
Keeping everything in a single file seems like a good choice at this moment.
Hope this is helpful.
This is very similar to a filter which I posted to the Pandoc mailing list recently https://v.gd/GQL7mL. You might want to look at how I did things. In particular:
-
It is probably a good idea to use
\textsf
and friends for spans whenever possible for spans, they being the LaTeX standard and set up to handle italic correction etc. automatically. -
You might want to include alternative classes corresponding the two-letter Plain TeX font switch commands
bf
,it
,rm
,sc
,sf
,sl
,tt
. They are less intrusive and faster to type, and should be clear enough for most people targeting LaTeX (and you can include a table in the documentation! :-) Something like this to keep it DRY:
local abbrev = {
bold = 'bf',
italic = 'it',
serif = 'rm',
smallcaps = 'sc',
sans = 'sf',
slanted = 'sl',
monospace = 'tt',
}
for _,style in pairs(LATEX_CODE_FOR_TAGS) do
for long, short in pairs(abbrev) do
style[short] = style[long]
end
end
-
You may want to include a shortcut
bolditalic
so that people won't have to type two classes for that combination, unless you include the short class names. -
You might want to include the commands from ulem which AFAIK Pandoc's LaTeX template loads anyway. I use the following abbreviations for them BTW:
Command Short \uline
u
\uuline
uu
\uwave
uw
\sout
so
\xout
xo
\dashuline
da
/dau
\dotuline
do
/dou
I would as you see create the RawInline and RawBlock objects only once during initialization and reuse them to save resources, unless @tarleb thinks that that may have adverse effects:
end_span = pandoc.RawInline('latex', '}')
local span = LATEX_CODES.Span
for class,code in pairs(span) do
code = pandoc.RawInline(code)
span[class] = { code, end_span }
end
local div = LATEX_CODES.Div
for class,code1 in pairs(div) do
local code2 = code1:gsub('begin','end')
div[class] = {
pandoc.RawBlock('latex', code1),
pandoc.RawBlock('latex', code2),
}
end
local function handler (elem)
local style = LATEX_CODES[elem.tag]
-- Unpack the stuff we need
local class = elem.classes
local content = elem.content
-- Looping over the classes in the element is more efficient!
-- And we loop backwards so the commands come in the same order as the classes!
for i=#class,1,-1 do
if style[class[i]] then
-- The content is a pandoc.List object
content:insert(1, style[class][1])
content:insert(style[class][2])
end
end
elem.content = content
return elem
end
Span = handler
Div = handler
I hope this helps!
@tarleb and @bpj thank you for your valuable comments.
@bpj I was using the code you posted on Pandoc discuss as a base as you were answering my question.
I value the opportunity to collaborate with both of you on this filter.
I will take up your suggestions and message you both once I am done.
I hope you won't mind giving me another review later.
I have made improvements to the filter according to your suggestions above.
- I am now predominantly using table indexes to access values and reduced the number of for loops considerably. Thank you so much for pointing out the capabilities of the
elem
object. - Inline elements now use the
\text*
LaTeX commands instead of the long hand commands. - Abbreviated classes are now supported.
- Fleshed out the README and created specimen files of the output using the filter.
What I have not done are the following:
- The addition of underlining capabilities has not been added because the Pandoc LaTeX template only conditionally adds the
ulem
package.
If the two of you don't mind giving my code another review I would appreciate it.
My main concern now is the maintainability of the LATEX_CODE_FOR_TAGS table. If I was using a language I am familiar with I would have broken it up into three tables. In Lua, I find that I would have to loop through three tables to find the correct class which I want to avoid as merging tables is a little involved.
In addition, I am open to suggestions for a better name for the filter that is more descriptive.
Thanks in advance.
Nice, looks good to me!
I agree that the LATEX_CODE_FOR_TAGS
is a bit verbose, but it's not that bad. However, if you want to remove duplication, how about doing something like
local latex_font_sizes = {
xxsmall = 'scriptsize',
xsmall = 'footnotesize',
-- ...
}
for k, v in pairs(latex_font_sizes) do
LATEX_CODES_FOR_TAGS.Span[k] = {'\\'.. v .. ' ', '}'}
LATEX_CODES_FOR_TAGS.Div[k] = {'\\begin{'.. v .. '}', '\\end{' .. v .. '}'}
end
Thanks, @tarleb for the code above, but considering the fact that the code is not the same for inlines and divs for LaTeX it might not be worth the work. I think I might end up with something even more unwieldy. So perhaps I should leave the code as it is right now.
I am now facing an issue where the expected output and the generated output do not match.
This only seems to occur when I specify more than one style for a piece of text for example:
::: {.sc .bf .it}
Some text
:::
The classes when accessed in the code do not seem to follow a set order as far as I can tell. Because the classes get applied in a different order each time the corresponding LaTeX code also gets applied in a different order, and as a result, the expected and actual do not match.
Is there a way to preserve the order of the classes and not have it change every time? I thought that because the .classes
attribute is a list it would follow a set order.
I hope you will be able to enlighten me on why this is the case and what would be a good workaround.
The reason for the non-determinism is that pairs
does not guarantee a specific order when called on tables. Only ipairs
ensures a certain order, but it only works with sequences.
The easiest would be to iterate over the elem.classes
and then check if code_for_class[class]
is non-nil.
Aside: we do guarantee a certain order when using pairs
on a pandoc.AttributeList
element, thanks to implementing a custom __pairs
metamethod. But that'd be overkill here.
Thanks, @tarleb the first method works but I need some explanation on how iterating over elem.classes
with pairs
preserves the order of the classes.
I think I am missing something here conceptually and would like to understand the mechanism being used here.
I have uploaded the new code to the repository.
The elem.classes
table is a sequence, so we can iterate over it with the deterministic ipairs
instead of pairs
. I'm actually surprised that using pairs
appears to be working as well.
Ah @tarleb that makes much more sense now. I have changed pairs
to ipairs
and everything looks right.
What would be the next steps now?
We still haven't decided how to deal with new filters in this repo.
You could announce the new filter on the pandoc-discuss mailing list, so others can start to use it.
@tarleb Before I announced the filter I wanted to make the change regarding separating the large style list into three and I am happy to say that I have succeeded in that endeavor.
I am now happy with the filter code and would like to request one more review of the code.
I have a question about the "magical" elem
variable is this something Pandoc gives us access to whenever we attach a handler to an element in the AST. I could not find much documentation about it as far as I could tell.
Sorry, I currently don't have time for another detailed review.
(I wrote this on Sunday but for some reason never got around to posting it. Anyway here it is...)
You really should loop through the classes of elem
in reverse linear order and then look up the class in code_for_class
rather than the other way around because
- There likely are far fewer classes on
elem
than there are items incode_for_class
, so the loop is much shorter. - Each call to
elem.classes:includes(class_name)
loops through the list of classes comparing each class toclass_name
, so its better to just do this loop once and avoid the outer loop entirely since table lookups are much more efficient than loops. - All the tables involved are static so there is no point in multiple repeated loops when lookups work.
- It is much better to loop through
classes
, or its indices, in linear order than to loop throughcode_for_class
in random order withpairs()
because the order in which the commands appears in the LaTeX code becomes predictable. - It is even better to loop through the indices in reverse order, because then the commands appear in the LaTeX code in the same order as the classes appeared in the span/div attributes.
Consider for example this span:
[foo]{.sf .bf .sc}
If you loop through the classes in forward order the LaTeX code looks like this, with the commands in the opposite order to the classes in the markdown:
{\textsc{\textbf{\textsf{foo}}}}
If you loop through the classes
indices in reverse order the LaTeX commands come in the order which the markdown author has all right to expect, and probably intends:
{\textsf{\textbf{\textsc{foo}}}}
but if you loop through code_for_class
with pairs()
the order of the commands will be unpredictable between runs, which is very bad because sometimes the order matters in LaTeX! so you should let the user decide it.
The reverse classes
index loop may look scary, but it arguably gives the only right result, in addition to being more efficient. (A forward index loop may be even more efficient, but in this case gives wrong results!)
Therefore you should replace line 93 through 102 with the following:
local class = elem.classes
for i=#class, 1, -1 do
if code_for_class[class[i]] then
local begin_code = code_for_class[class[i]][1]
local end_code = code_for_class[class[i]][2]
table.insert(elem.content, 1, raw('latex', begin_code))
if end_code then
table.insert(elem.content, raw('latex', end_code))
end
end
end
Of course the assignment of elem.classes
to a temporary variable class
is optional, but it makes the rest of the code a lot easier to read and write, which is also important.
You might want to include the full gamut of LaTeX font style commands, since if some of them are available people will expect all of them to be available:
Class | Span | Div |
---|---|---|
bf |
\textbf{ |
\begin{bfseries} |
em |
\emph{ |
\begin{em} |
it |
\textit{ |
\begin{itshape} |
lf |
\textlf{ |
\begin{lfseries} |
md |
\textmd{ |
\begin{mdseries} |
no |
\textnormal{ |
\begin{normalfont} |
rm |
\textrm{ |
\begin{rmfamily} |
sc |
\textsc{ |
\begin{scshape} |
sf |
\textsf{ |
\begin{sffamily} |
sl |
\textsl{ |
\begin{slshape} |
tt |
\texttt{ |
\begin{ttfamily} |
up |
\textup{ |
\begin{upshape} |
As for the ulem commands why not include them but comment them out so that users who load ulem unconditionally in a custom LaTeX template can just uncomment them if they wish?
I'll try to find time to do a review later today.
Thanks, @bpj for your review. With the latest version of the code, I am looping through elem.classes
because I fell into the pitfall you mentioned, however, I am not doing it in reverse order which I will add now.
The code has been modified quite a but now has fewer nested loops and uses table indexing where possible.
I am happy to add the underline style if there is a way for me to check if the ulem
package has been added/enabled in the Lua filter. Having users uncomment code to enable a feature is not a great usability experience in my humble opinion.
As for enabling the other styles the one that I am most concerned about is \lfseries
which is not adhered to by all fonts as some fonts define \lgseries
or some other nonstandard LaTeX directive.
I will try and incorporate the others.
@bpj I have added some of the styles but not all because I do not wish to add support for capabilities Pandoc Markdown already provides such as emph
which can be done by two underscores.
The LIght font feature is non-standard and slanted is not too different from italic so I have made a value judgment on what might be useful.
I have added medium and normal fonts though.
I have uploaded the changes to GitHub so they can be reviewed
First: this can work for Links as well as for Spans just by saying
latex_cmd_for_tags.Link = latex_cmd_for_tags.Span
At least in my opinion the first of these looks better than the second, and is probably more efficient in Pandoc:
[Link text](http://pandoc.org){.sc}
[[Link text]{.sc}](http://pandoc.org)
You could inject the \usepackage[normalem]{ulem}
statement into the metadata header-includes and the ulem commands into the main class-to-command mapping table if the user requests it through a metadata or environment variable. Why conditionally? Because if the user uses the -H
option or header-includes
is set in variable space pandoc will ignore the header-includes
in metadata space. (I certainly think that merging them would be better but it is probably too late to change that. Also you have to assume that if meta header-includes
is already set and is not a List whatever value it has is appropriate to include in a List. That is not a particularly risky assumption to make, but given these what-ifs it is better to somehow let the user decide without having the user modifying the filter code.
So here is some code which will do the injects if a metadata field is set to an appropriate value ("true", "yes" or true
) or if the metadata field is unset (is nil
) checks if an environment variable is set to "true" or "yes". The environment variable can be overridden by setting the meta variable to "false", "no" or false
. In fact just saying -M ulem_styles
on the command line will set it to true.
-- Save some typing and check that we got the pandoc libraries
local p = assert(pandoc, "Cannot find the pandoc library")
if not ('table' == type(p)) then
error("Expected variable pandoc to be table")
end
local u = assert(pandoc.utils, "Cannot find the pandoc.urils library")
local L = assert(pandoc.List, "Cannot get the pandoc.List class")
-- Stringify safely
local function stringify (val, accept_bool)
-- First try
local ok, res = pcall(u.stringify, val)
if ok and res then
return res
end
local vt = u.type(val)
if ('string' == vt) or ('number' == vt) or (accept_bool and ('boolean' == vt)) then
return tostring(val)
else
return error("Cannot stringify " .. vt)
end
end
-- Convert appropriate values in meta or env to boolean
local bool_for = {
["true"] = true, -- string to bool
["false"] = false,
yes = true
no = false
}
local function boolify (val)
str = stringify(val, true):lower() -- case insensitive
return bool_for[str]
end
-- Let the user request the ulem styles through metadata or environment
local function Meta (meta)
-- Check if we have a meta value
local want_ulem = meta.ulem_styles
-- If not try the environment variable
if nil == want_ulem then
-- The environment variable is either unset or a string
-- Default to a false boolean if unset
want_ulem = os.getenv("PANDOC_ULEM_STYLES") or false
end
-- Try to "convert" it to a boolean
want_ulem = boolify(want_ulem)
-- Validate the result: if it was a boolean or an appropriate string
-- it is now a boolean. If not it is now a nil, and probably a mistake.
if nil == want_ulem then
error("Expected meta.ulem_styles or environment variable to be 'true', 'false' or unset")
end
-- If user wants it make sure we have a list of header-includes
if want_ulem then
-- Get header-includes if any
local includes = meta['header-includes']
-- Default to a List
includes = includes or L({ })
-- If not a List make it one!
if 'List' ~= u.type(includes) then
includes = L({ includes })
end
-- Add the ulem use statement
includes:insert(p.RawBlock('latex', "\\usepackage[normalem]{ulem}"))
-- Make sure Pandoc gets our changes
meta['header-includes'] = includes
---- Inject the ulem classes/commands into the main table ----
-- First we make a mapping from commands to the multiple classes for each command
local ulem_styles = {
["\\uline{"] = { "uline", "u" },
["\\uuline{"] = { "uuline", "uu" },
["\\uwave{"] = { "uwave", "uw" },
-- ["\\sout{"] = { "sout", "so" },
["\\xout{"] = { "xout", "xo" },
["\\dashuline{"] = { "dashuline", "da", "dau" },
["\\dotuline{"] = { "dotuline", "do", "dou" }
}
-- Inject them into the main table by looping through each command
-- and its classes and assign each command to all of its classes
for cmd, classes in pairs(ulem_styles) do
for _, class in ipairs(classes) do
latex_cmd_for_tags.Span[class] = { cmd, '}' }
end
end
-- Make sure Pandoc gets our changes
return meta
end
-- No request, do nothing
return nil
end
-- Run the metadata "filter" before the main filter by returning a list of
-- filters at the end of the file
return {
{ Meta = Meta },
{ Span = handler,
-- Link = handler,
Div = handler
}
}
@bpj Thanks for the above. I will definitely look into incorporating the underline feature using the method you have suggested and see how far I get.
I have added support for more style and font sizes as per your previous suggestions and updated the code.
The only ones I am not adding are \lfseries
because of lack of standardization and \Huge
size because there is not a lot of difference between \huge and \Huge in a 12pt document.
The others have been incorporated and are working well.
I will get the underline things done and get in touch.
@bpj @tarleb I have done my final commits for the filter code considering feature complete.
I have allowed the ulem_styles
to be added using a metadata field but not through an environment variable as it is less explicit IMHO.
I would appreciate one more code review before announcing the filter on pandoc-discuss.
@tarleb Has there been any consensus reached on how Pandoc Lua filters will be managed in the future?
I would like to contribute this filter publically if possible.
Thank you for your patience @nandac. To be honest, I've put off writing an answer as I'm still not 100% sure about the best path. However, I currently strongly lean towards reducing the scope of this repo. Please publish the filter in a separate repository for the time being.
Feel free to let us know when you published it: I'll then "advertise" it on social media.
@tarleb Thank you for your reply.
I have completed the work for the Lua filter for now and I know of at least one person who is using it in his projects.
The repository is here: https://github.com/nandac/fonts-and-alignment
The GitHub tests are not running now due to inactivity but the last commit passed all tests.
I did announce it on Pandoc discuss a few months back.
However, please feel free to advertise it on social media and hopefully, this filter can get more exposure.
By the way, and completely off-topic: @nandac say hi for me to Jug and Magdalena at your place of work. I hope the math engine still does its job, I helped to build it.
@tarleb I believe you are talking about Jugdip. He left the company about a month or two ago. I don't know Magdalena could you tell me her last name and I could say hi on your behalf.
When did build the math engine?
Did you work for a company that Chegg later acquired?
You have piqued my curiosity.
By the way, do you have a LinkedIn page? I would like to connect.
Sad to hear that Jugdip left. He did great work, and I'm a big fan his management style.
I don't have LinkedIn, but I will send you an email!