prosegrinder/pandoc-templates

Add a filter so em dashes don't appear at the beginning of lines

tomheon opened this issue · 6 comments

By default, pandoc will convert a markdown em dash (---) into a an em dash that allows line breaks before and after it, leading to potential sections of text like this in the docx:

he looked through the door
--the old, wooden door--and saw...

I would like to contribute a filter that prevents this by inserting a non-breaking zero-space separator before the em dash, so you end up with

he looked through the door--
the old, wooden door--and saw...

or

he looked through the
door--the old, wooden door--and saw...

depending on the line lengths.

If you're interested, I can open a pull request, or otherwise you can feel free to use the filter, which is straightforward enough to paste below.

return {
   {
	  Str = function(elem)
		 if string.find(elem.text, "—") then
			return pandoc.Str(string.gsub(elem.text, "—", "\u{2060}—"))
		 else
			return elem
		 end
	  end,
   }
}

Thank you for this project--I was about to start working on something similar for my own needs, and this saved me a lot of time.

Love it! Thanks for opening an issue. I'm absolutely open to PRs, and this would be a welcome contribution. Let me know if you have questions or need anything.

Sweet, PR is up.

I am a bit confused on the problem. What is the intended formatting in the source and what is the undesired result? Are you writing poetry where the formatting must be kept? Is this an issue with non-breaking hyphens?

The EM DASH is used to set off parenthetical text. Normally, it is used without spaces. However, this is language dependent. For example, in Swedish, spaces are used around the EM DASH. Line breaks can occur before and after an EM DASH. Because EM DASHes are sometimes used in pairs instead of a single quotation dash, the default behavior is not to break the line between even though not all fonts use connecting glyphs for the EM DASH.

Some languages, including Spanish, use EM DASH to set off a parenthetical, and the surrounding dashes should not be broken from the contained text. In this usage there is space on the side where it can be broken. This does not conflict with symmetrical usages, either with spaces on both sides of the em-dash or with no spaces.

Unicode Line Breaking Algorithm

I'm not sure this helps, or is even prescriptive, but it's the closest I could find to some "ruling" on breaking around em-dash.

@tomheon - sorry for the super late follow up. Looks like you closed out your PR but left this issue open. If this isn't needed anymore, would you mind closing the issue. If it is still needed, can you provide any updates? Thank you!

Can do!