elixirschool/school_house

Table of contents fails on international characters

jaimeiniesta opened this issue · 4 comments

Detected on https://rocketvalidator.com/s/4ec0b390-f775-4db1-87a4-ced8f5ece619/p/4686420/i

It looks like the tables of contents for lessons does not generate the headings when those contain non-English characters.

While in the legacy site it works fine:

https://elixirschool.com/es/lessons/specifics/mnesia/#inicialización-de-datos-y-migración

the beta site doesn't work OK:

https://beta.elixirschool.com/es/storage/mnesia

Captura de pantalla 2021-07-29 a las 13 22 03

Headings containing non-English characters like "inicialización" are not being generated.

I took a look at this and the culprit appears to be the @headers_regex in the SchoolHouse.Content.Lesson module. The regex scans the body of the lesson and finds header tags to construct the TOC. Right now, the regex does not include accented characters in the main capture group.

I am not familiar with capturing accented characters in regex so I went to Stack Overflow of course and found this suggestion for matching all characters (accented and unaccented): A-Za-zÀ-ÖØ-öø-ÿ . If we update the regex used in the Lesson module, it would look like:

# current regex
@headers_regex ~r/<(h\d)>(["\w\s?!\.\/\d]+)(?=<\/\1>)/i

# proposed regex
@headers_regex ~r/<(h\d)>(["A-Za-zÀ-ÖØ-öø-ÿ\s?!\.\/\d]+)(?=<\/\1>)/i

I can go ahead and create a pr for this and add tests for this use case specifically, but before I do I want to see if there are any suggestions or improvements we can make to this regex update.

Sounds fine, here are some accented character I can think of, to include in your tests:

áéíóú
àèìòù
âêîôû
äëïöü
ñ ç

and in uppercase:

ÁÉÍÓÚ
ÀÈÌÒÙ
ÂÊÎÔÛ
ÄËÏÖÜ
Ñ Ç

Also, found this - there's a regex match for international characters!

https://davidwalsh.name/regex-accented-letters

Thanks for sharing that link, that's a much better approach. I tried using \p{Letter} and got an error but I was able to find the solution pretty quickly: \p{L}