Table of contents fails on international characters
jaimeiniesta opened this issue · 4 comments
Detected on https://rocketvalidator.com/s/4ec0b390-f775-4db1-87a4-ced8f5ece619/p/4686420/i
It looks like the tables of contents for lessons does not generate the headings when those contain non-English characters.
While in the legacy site it works fine:
https://elixirschool.com/es/lessons/specifics/mnesia/#inicialización-de-datos-y-migración
the beta site doesn't work OK:
https://beta.elixirschool.com/es/storage/mnesia
Headings containing non-English characters like "inicialización" are not being generated.
I took a look at this and the culprit appears to be the @headers_regex
in the SchoolHouse.Content.Lesson
module. The regex scans the body of the lesson and finds header tags to construct the TOC. Right now, the regex does not include accented characters in the main capture group.
I am not familiar with capturing accented characters in regex so I went to Stack Overflow of course and found this suggestion for matching all characters (accented and unaccented): A-Za-zÀ-ÖØ-öø-ÿ
. If we update the regex used in the Lesson
module, it would look like:
# current regex
@headers_regex ~r/<(h\d)>(["\w\s?!\.\/\d]+)(?=<\/\1>)/i
# proposed regex
@headers_regex ~r/<(h\d)>(["A-Za-zÀ-ÖØ-öø-ÿ\s?!\.\/\d]+)(?=<\/\1>)/i
I can go ahead and create a pr for this and add tests for this use case specifically, but before I do I want to see if there are any suggestions or improvements we can make to this regex update.
Sounds fine, here are some accented character I can think of, to include in your tests:
áéíóú
àèìòù
âêîôû
äëïöü
ñ ç
and in uppercase:
ÁÉÍÓÚ
ÀÈÌÒÙ
ÂÊÎÔÛ
ÄËÏÖÜ
Ñ Ç
Also, found this - there's a regex match for international characters!
Thanks for sharing that link, that's a much better approach. I tried using \p{Letter}
and got an error but I was able to find the solution pretty quickly: \p{L}