Contributing a new Lua filter for Font styling and alignment

Question

Contributing a new Lua filter for Font styling and alignment

nandac opened this issue 3 years ago · 28 comments

Dear Folks,

I would like to contribute a new Lua filter to the community that is available here: https://github.com/nandac/fonts-and-alignment

This filter currently only works for LaTeX documents, but I will be making it work for HTML and ePub in the future.

As I am a beginner in Lua and LaTeX and would like to

Get a review of the Lua code.
Get suggestions on refactoring and improvement
Get some comments on if I should be splitting this into two different filters etc.

I hope this would not be too much of an imposition on the community. The filter is not very large at present.

Answer 1 · 2022-02-17T11:17:43.000Z

Thanks for this, that's a nice little filter. Below are a few small suggestions.

Whether an element is a Span or Div can be tested by checking elem.t (or elem.tag).
The filter uses for loops combined with if statements to find the correct field in a table. A more idiomatic way would be to use table indexing. E.g.,
```
local tag = elem.t
local code_for_class = LATEX_CODES_FOR_TAGS[tag]
```
It's not necessary to pass the tag to create_handler if the tag is taken from elem.t. The inner function could be named handler and then be used directly:
```
local function handler (elem)
  ...
end
Span = handler
Div = handler
```

Keeping everything in a single file seems like a good choice at this moment.

Hope this is helpful.

Answer 2 · 2022-02-17T15:58:43.000Z

This is very similar to a filter which I posted to the Pandoc mailing list recently https://v.gd/GQL7mL. You might want to look at how I did things. In particular:

It is probably a good idea to use \textsf and friends for spans whenever possible for spans, they being the LaTeX standard and set up to handle italic correction etc. automatically.
You might want to include alternative classes corresponding the two-letter Plain TeX font switch commands bf, it, rm, sc, sf, sl, tt. They are less intrusive and faster to type, and should be clear enough for most people targeting LaTeX (and you can include a table in the documentation! :-) Something like this to keep it DRY:

local abbrev = {
  bold      = 'bf',
  italic    = 'it',
  serif     = 'rm',
  smallcaps = 'sc',
  sans      = 'sf',
  slanted   = 'sl',
  monospace = 'tt',
}

for _,style in pairs(LATEX_CODE_FOR_TAGS) do
  for long, short in pairs(abbrev) do
    style[short] = style[long]
  end
end

You may want to include a shortcut bolditalic so that people won't have to type two classes for that combination, unless you include the short class names.
You might want to include the commands from ulem which AFAIK Pandoc's LaTeX template loads anyway. I use the following abbreviations for them BTW:

Command Short

\uline u

\uuline uu

\uwave uw

\sout so

\xout xo

\dashuline da/dau

\dotuline do/dou

I would as you see create the RawInline and RawBlock objects only once during initialization and reuse them to save resources, unless @tarleb thinks that that may have adverse effects:

end_span = pandoc.RawInline('latex', '}')
local span = LATEX_CODES.Span
for class,code in pairs(span) do
  code = pandoc.RawInline(code)
  span[class] = { code, end_span }
end
local div = LATEX_CODES.Div
for class,code1 in pairs(div) do
  local code2 = code1:gsub('begin','end')
  div[class] = {
    pandoc.RawBlock('latex', code1),
    pandoc.RawBlock('latex', code2),
  }
end

local function handler (elem)
  local style = LATEX_CODES[elem.tag]
  -- Unpack the stuff we need
  local class   = elem.classes
  local content = elem.content
  -- Looping over the classes in the element is more efficient!
  -- And we loop backwards so the commands come in the same order as the classes!
  for i=#class,1,-1 do
    if style[class[i]] then
      -- The content is a pandoc.List object
      content:insert(1, style[class][1])
      content:insert(style[class][2])
    end
  end
  elem.content = content
  return elem
end

Span = handler
Div  = handler

I hope this helps!

Answer 3 · 2022-02-17T16:15:24.000Z

@tarleb and @bpj thank you for your valuable comments.

@bpj I was using the code you posted on Pandoc discuss as a base as you were answering my question.

I value the opportunity to collaborate with both of you on this filter.

I will take up your suggestions and message you both once I am done.

I hope you won't mind giving me another review later.

Answer 4 · 2022-02-20T09:01:53.000Z

@tarleb and @bpj.

I have made improvements to the filter according to your suggestions above.

I am now predominantly using table indexes to access values and reduced the number of for loops considerably. Thank you so much for pointing out the capabilities of the elem object.
Inline elements now use the \text* LaTeX commands instead of the long hand commands.
Abbreviated classes are now supported.
Fleshed out the README and created specimen files of the output using the filter.

What I have not done are the following:

The addition of underlining capabilities has not been added because the Pandoc LaTeX template only conditionally adds the ulem package.

If the two of you don't mind giving my code another review I would appreciate it.

My main concern now is the maintainability of the LATEX_CODE_FOR_TAGS table. If I was using a language I am familiar with I would have broken it up into three tables. In Lua, I find that I would have to loop through three tables to find the correct class which I want to avoid as merging tables is a little involved.

In addition, I am open to suggestions for a better name for the filter that is more descriptive.

Thanks in advance.

Answer 5 · 2022-02-21T09:51:26.000Z

Nice, looks good to me!

I agree that the LATEX_CODE_FOR_TAGS is a bit verbose, but it's not that bad. However, if you want to remove duplication, how about doing something like

local latex_font_sizes = {
  xxsmall = 'scriptsize',
  xsmall = 'footnotesize',
  -- ...
}

for k, v in pairs(latex_font_sizes) do
  LATEX_CODES_FOR_TAGS.Span[k] = {'\\'.. v .. ' ', '}'}
  LATEX_CODES_FOR_TAGS.Div[k] = {'\\begin{'.. v .. '}', '\\end{' .. v .. '}'}
end

Answer 6 · 2022-02-21T14:53:28.000Z

Thanks, @tarleb for the code above, but considering the fact that the code is not the same for inlines and divs for LaTeX it might not be worth the work. I think I might end up with something even more unwieldy. So perhaps I should leave the code as it is right now.

I am now facing an issue where the expected output and the generated output do not match.

This only seems to occur when I specify more than one style for a piece of text for example:

::: {.sc .bf .it}
Some text
:::

The classes when accessed in the code do not seem to follow a set order as far as I can tell. Because the classes get applied in a different order each time the corresponding LaTeX code also gets applied in a different order, and as a result, the expected and actual do not match.

Is there a way to preserve the order of the classes and not have it change every time? I thought that because the .classes attribute is a list it would follow a set order.

I hope you will be able to enlighten me on why this is the case and what would be a good workaround.

Answer 7 · 2022-02-21T15:14:38.000Z

The reason for the non-determinism is that pairs does not guarantee a specific order when called on tables. Only ipairs ensures a certain order, but it only works with sequences.

The easiest would be to iterate over the elem.classes and then check if code_for_class[class] is non-nil.

Answer 8 · 2022-02-21T15:17:17.000Z

Aside: we do guarantee a certain order when using pairs on a pandoc.AttributeList element, thanks to implementing a custom __pairs metamethod. But that'd be overkill here.

Answer 9 · 2022-02-21T17:02:04.000Z

Thanks, @tarleb the first method works but I need some explanation on how iterating over elem.classes with pairs preserves the order of the classes.

I think I am missing something here conceptually and would like to understand the mechanism being used here.

I have uploaded the new code to the repository.

Answer 10 · 2022-02-21T17:37:38.000Z

The elem.classes table is a sequence, so we can iterate over it with the deterministic ipairs instead of pairs. I'm actually surprised that using pairs appears to be working as well.

Answer 11 · 2022-02-21T17:51:15.000Z

Ah @tarleb that makes much more sense now. I have changed pairs to ipairs and everything looks right.

What would be the next steps now?

Answer 12 · 2022-02-21T19:57:28.000Z

We still haven't decided how to deal with new filters in this repo.

You could announce the new filter on the pandoc-discuss mailing list, so others can start to use it.

Answer 13 · 2022-02-22T09:31:02.000Z

@tarleb Before I announced the filter I wanted to make the change regarding separating the large style list into three and I am happy to say that I have succeeded in that endeavor.

I am now happy with the filter code and would like to request one more review of the code.

I have a question about the "magical" elem variable is this something Pandoc gives us access to whenever we attach a handler to an element in the AST. I could not find much documentation about it as far as I could tell.

Answer 14 · 2022-02-22T09:52:49.000Z

Sorry, I currently don't have time for another detailed review.

Answer 15 · 2022-02-22T10:04:25.000Z

(I wrote this on Sunday but for some reason never got around to posting it. Anyway here it is...)

You really should loop through the classes of elem in reverse linear order and then look up the class in code_for_class rather than the other way around because

There likely are far fewer classes on elem than there are items in code_for_class, so the loop is much shorter.
Each call to elem.classes:includes(class_name) loops through the list of classes comparing each class to class_name, so its better to just do this loop once and avoid the outer loop entirely since table lookups are much more efficient than loops.
All the tables involved are static so there is no point in multiple repeated loops when lookups work.
It is much better to loop through classes, or its indices, in linear order than to loop through code_for_class in random order with pairs() because the order in which the commands appears in the LaTeX code becomes predictable.
It is even better to loop through the indices in reverse order, because then the commands appear in the LaTeX code in the same order as the classes appeared in the span/div attributes.

Consider for example this span:

[foo]{.sf .bf .sc}

If you loop through the classes in forward order the LaTeX code looks like this, with the commands in the opposite order to the classes in the markdown:

{\textsc{\textbf{\textsf{foo}}}}

If you loop through the classes indices in reverse order the LaTeX commands come in the order which the markdown author has all right to expect, and probably intends:

{\textsf{\textbf{\textsc{foo}}}}

but if you loop through code_for_class with pairs() the order of the commands will be unpredictable between runs, which is very bad because sometimes the order matters in LaTeX! so you should let the user decide it.

The reverse classes index loop may look scary, but it arguably gives the only right result, in addition to being more efficient. (A forward index loop may be even more efficient, but in this case gives wrong results!)

Therefore you should replace line 93 through 102 with the following:

local class = elem.classes
for i=#class, 1, -1 do
  if code_for_class[class[i]] then
    local begin_code = code_for_class[class[i]][1]
    local end_code = code_for_class[class[i]][2]
    table.insert(elem.content, 1, raw('latex', begin_code))
    if end_code then
      table.insert(elem.content, raw('latex', end_code))
    end
  end
end

Of course the assignment of elem.classes to a temporary variable class is optional, but it makes the rest of the code a lot easier to read and write, which is also important.

You might want to include the full gamut of LaTeX font style commands, since if some of them are available people will expect all of them to be available:

Class	Span	Div
`bf`	`\textbf{`	`\begin{bfseries}`
`em`	`\emph{`	`\begin{em}`
`it`	`\textit{`	`\begin{itshape}`
`lf`	`\textlf{`	`\begin{lfseries}`
`md`	`\textmd{`	`\begin{mdseries}`
`no`	`\textnormal{`	`\begin{normalfont}`
`rm`	`\textrm{`	`\begin{rmfamily}`
`sc`	`\textsc{`	`\begin{scshape}`
`sf`	`\textsf{`	`\begin{sffamily}`
`sl`	`\textsl{`	`\begin{slshape}`
`tt`	`\texttt{`	`\begin{ttfamily}`
`up`	`\textup{`	`\begin{upshape}`

As for the ulem commands why not include them but comment them out so that users who load ulem unconditionally in a custom LaTeX template can just uncomment them if they wish?

Answer 16 · 2022-02-22T10:15:21.000Z

I'll try to find time to do a review later today.

Answer 17 · 2022-02-22T11:04:33.000Z

Thanks, @bpj for your review. With the latest version of the code, I am looping through elem.classes because I fell into the pitfall you mentioned, however, I am not doing it in reverse order which I will add now.

The code has been modified quite a but now has fewer nested loops and uses table indexing where possible.

I am happy to add the underline style if there is a way for me to check if the ulem package has been added/enabled in the Lua filter. Having users uncomment code to enable a feature is not a great usability experience in my humble opinion.

As for enabling the other styles the one that I am most concerned about is \lfseries which is not adhered to by all fonts as some fonts define \lgseries or some other nonstandard LaTeX directive.

I will try and incorporate the others.

Answer 18 · 2022-02-22T15:43:02.000Z

@bpj I have added some of the styles but not all because I do not wish to add support for capabilities Pandoc Markdown already provides such as emph which can be done by two underscores.

The LIght font feature is non-standard and slanted is not too different from italic so I have made a value judgment on what might be useful.

I have added medium and normal fonts though.

I have uploaded the changes to GitHub so they can be reviewed

Answer 19 · 2022-02-24T16:05:19.000Z

First: this can work for Links as well as for Spans just by saying

latex_cmd_for_tags.Link = latex_cmd_for_tags.Span

At least in my opinion the first of these looks better than the second, and is probably more efficient in Pandoc:

[Link text](http://pandoc.org){.sc}
[[Link text]{.sc}](http://pandoc.org)

You could inject the \usepackage[normalem]{ulem} statement into the metadata header-includes and the ulem commands into the main class-to-command mapping table if the user requests it through a metadata or environment variable. Why conditionally? Because if the user uses the -H option or header-includes is set in variable space pandoc will ignore the header-includes in metadata space. (I certainly think that merging them would be better but it is probably too late to change that. Also you have to assume that if meta header-includes is already set and is not a List whatever value it has is appropriate to include in a List. That is not a particularly risky assumption to make, but given these what-ifs it is better to somehow let the user decide without having the user modifying the filter code.

So here is some code which will do the injects if a metadata field is set to an appropriate value ("true", "yes" or true) or if the metadata field is unset (is nil) checks if an environment variable is set to "true" or "yes". The environment variable can be overridden by setting the meta variable to "false", "no" or false. In fact just saying -M ulem_styles on the command line will set it to true.

-- Save some typing and check that we got the pandoc libraries
local p = assert(pandoc, "Cannot find the pandoc library")
if not ('table' == type(p)) then
  error("Expected variable pandoc to be table")
end
local u = assert(pandoc.utils, "Cannot find the pandoc.urils library")
local L = assert(pandoc.List, "Cannot get the pandoc.List class")

-- Stringify safely
local function stringify (val, accept_bool)
  -- First try
  local ok, res = pcall(u.stringify, val)
  if ok and res then
    return res
  end
  local vt = u.type(val)
  if ('string' == vt) or ('number' == vt) or (accept_bool and ('boolean' == vt)) then
    return tostring(val)
  else
    return error("Cannot stringify " .. vt)
  end
end

-- Convert appropriate values in meta or env to boolean
local bool_for = {
  ["true"]  = true,  -- string to bool
  ["false"] = false,
  yes = true
  no  = false
}
local function boolify (val)
  str = stringify(val, true):lower() -- case insensitive
  return bool_for[str]
end

-- Let the user request the ulem styles through metadata or environment
local function Meta (meta)
  -- Check if we have a meta value
  local want_ulem = meta.ulem_styles
  -- If not try the environment variable
  if nil == want_ulem then
    -- The environment variable is either unset or a string
    -- Default to a false boolean if unset
    want_ulem = os.getenv("PANDOC_ULEM_STYLES") or false
  end
  -- Try to "convert" it to a boolean
  want_ulem = boolify(want_ulem)
  -- Validate the result: if it was a boolean or an appropriate string
  -- it is now a boolean. If not it is now a nil, and probably a mistake.
  if nil == want_ulem then
    error("Expected meta.ulem_styles or environment variable to be 'true', 'false' or unset")
  end
  -- If user wants it make sure we have a list of header-includes
  if want_ulem then
    -- Get header-includes if any
    local includes = meta['header-includes']
    -- Default to a List
    includes = includes or L({ })
    -- If not a List make it one!
    if 'List' ~= u.type(includes) then
      includes = L({ includes })
    end
    -- Add the ulem use statement
    includes:insert(p.RawBlock('latex', "\\usepackage[normalem]{ulem}"))
    -- Make sure Pandoc gets our changes
    meta['header-includes'] = includes
    
    ---- Inject the ulem classes/commands into the main table ----
    -- First we make a mapping from commands to the multiple classes for each command
    local ulem_styles = {
      ["\\uline{"] = { "uline", "u" },
      ["\\uuline{"] = { "uuline", "uu" },
      ["\\uwave{"] = { "uwave", "uw" },
      -- ["\\sout{"] = { "sout", "so" },
      ["\\xout{"] = { "xout", "xo" },
      ["\\dashuline{"] = { "dashuline", "da", "dau" },
      ["\\dotuline{"] = { "dotuline", "do", "dou" }
    }
    -- Inject them into the main table by looping through each command
    -- and its classes and assign each command to all of its classes
    for cmd, classes in pairs(ulem_styles) do
      for _, class in ipairs(classes) do
            latex_cmd_for_tags.Span[class] = { cmd, '}' }
      end
    end
    -- Make sure Pandoc gets our changes
    return meta
  end
  -- No request, do nothing
  return nil
end

-- Run the metadata "filter" before the main filter by returning a list of
-- filters at the end of the file
return {
  { Meta = Meta },
  { Span = handler,
    -- Link = handler,
    Div  = handler
  }
}

Answer 20 · 2022-02-24T18:17:49.000Z

@bpj Thanks for the above. I will definitely look into incorporating the underline feature using the method you have suggested and see how far I get.

I have added support for more style and font sizes as per your previous suggestions and updated the code.

The only ones I am not adding are \lfseries because of lack of standardization and \Huge size because there is not a lot of difference between \huge and \Huge in a 12pt document.

The others have been incorporated and are working well.

I will get the underline things done and get in touch.

Answer 21 · 2022-02-28T06:28:43.000Z

@bpj @tarleb I have done my final commits for the filter code considering feature complete.

I have allowed the ulem_styles to be added using a metadata field but not through an environment variable as it is less explicit IMHO.

I would appreciate one more code review before announcing the filter on pandoc-discuss.

Answer 22 · 2022-06-07T01:59:14.000Z

@tarleb Has there been any consensus reached on how Pandoc Lua filters will be managed in the future?

I would like to contribute this filter publically if possible.

Answer 23 · 2022-06-10T21:06:40.000Z

Thank you for your patience @nandac. To be honest, I've put off writing an answer as I'm still not 100% sure about the best path. However, I currently strongly lean towards reducing the scope of this repo. Please publish the filter in a separate repository for the time being.

Feel free to let us know when you published it: I'll then "advertise" it on social media.

Answer 24 · 2022-06-10T23:54:13.000Z

@tarleb Thank you for your reply.

I have completed the work for the Lua filter for now and I know of at least one person who is using it in his projects.

The repository is here: https://github.com/nandac/fonts-and-alignment

The GitHub tests are not running now due to inactivity but the last commit passed all tests.

I did announce it on Pandoc discuss a few months back.

However, please feel free to advertise it on social media and hopefully, this filter can get more exposure.

Answer 25 · 2022-06-11T08:46:33.000Z

Tweet: https://twitter.com/pandoc_tips/status/1535543791888875520

Answer 26 · 2022-06-11T12:08:48.000Z

By the way, and completely off-topic: @nandac say hi for me to Jug and Magdalena at your place of work. I hope the math engine still does its job, I helped to build it.

Answer 27 · 2022-06-11T21:08:01.000Z

@tarleb I believe you are talking about Jugdip. He left the company about a month or two ago. I don't know Magdalena could you tell me her last name and I could say hi on your behalf.

When did build the math engine?
Did you work for a company that Chegg later acquired?

You have piqued my curiosity.

By the way, do you have a LinkedIn page? I would like to connect.

Answer 28 · 2022-06-12T06:42:50.000Z

Sad to hear that Jugdip left. He did great work, and I'm a big fan his management style.

I don't have LinkedIn, but I will send you an email!

Command	Short
`\uline`	`u`
`\uuline`	`uu`
`\uwave`	`uw`
`\sout`	`so`
`\xout`	`xo`
`\dashuline`	`da`/`dau`
`\dotuline`	`do`/`dou`