brailcom/speechd

speechd should respect locale, number format

Closed this issue · 7 comments

if one tries the following, all but the first two are spoken incorrectly:

spd-say --wait "this one is with gaps: 1234444.2451, this one with american style: 1,234,444.245,1, 3 further examples using international varieties: recent german, 1.234.444,245.1, arabic separators: 1٬234٬444.245٬1 (UTF-8 taken from here: https://www.compart.com/en/unicode/U+066C), old german handwriting: 1'234'444,245'1, using quote as the arabic separator is difficult to type, especially with currencies alo written with dot, 1'234'444.245'1 (this prevailed in switzerland, and many programming languages). modern programming languages like kotlin also allow 1_234_444.245_1."

❯ locale
LANG=en_US.UTF-8
LC_NUMERIC=de_CH.UTF-8
...

Obtained behavior

speechd knows dot as comma, and comma as thousands separator.

Expected behavior

speechd should have three modes:

  1. one, LC_NUMERIC is "en", it should know 6 thousands separators: arabic, thin space, single quote, apostrophe, underscore, comma,. it should know two commas: dot, arabic decimal separator.
  2. second mode, when LC_NUMERIC is not en, it should know 6 thousands separators: arabic, thin space, single quote, apostrophe, underscore, dot. it should know two commas: comma, , arabic decimal separator.
  3. three, wen lc_numeric is canada, luxemburg, peru or switzerland: it should know 5 thousands separators: arabic, thin space, single quote, apostrophe, underscore. it should know one of as comma: comma, dot, , arabic decimal separator.

the options are out of digit grouping

Mmmm, I don't think we can actually support that: the usual knob that we can tune in speech syntheses is only the language, and not details in the processing... Which means we can only support LANG, and not LC_* values that'd be different from that. What you can do, however, is use ssml tagging within the text, to change the language on the fly.

@sthibaul speechd is not only wrong for LC_NUMERIC, but also for LANG, set:

❯ set LANG de_CH.UTF-8
❯ spd-say "1'130,01"
wrong: eins einhundertdreissig komma null eins , wrong, should be eintausend einhundertdreissig komma null eins

but, LANG is anyway not something which can be used here. take this example. english is a global language, so many sites use it, but use of course the local formatting, and write: "Zürich is the largest city in Switzerland with a population of over 428'700, an increase of 19'500 since year 2000. 1,4 million people live in Zürich agglomeration. " (from facts and figures

if we switch over the language to german, then it would speak: "population of over vierhundertachtundzwanzig siebenhundert an increase .." which is the same error as in english, but in german. same is valid for swiss french.

so, if you do not want to take into account LC_NUMERIC for number formatting, what about considering at least undisputed ones, independent of language and formatting.

  • digit grouping, additionally to the ones you currently have: thin space, arabic thousands separator, quote, apostrophe, underscore.
  • decimal separator, additionally to the ones you currently have: arabic comma.

this then should fix many edge cases already. the real challenge is the inversion of comma and dot depending on the region, i can lively imagine. creates headache.

many sites use it, but use of course the local formatting

That doesn't seem very common to me.

if you do not want to take into account LC_NUMERIC for number formatting

The problem is not that I don't want to take it into account, but that speech synthesizers don't provide the interface to do so.

considering at least undisputed ones

It's the synthesizers which implement this, so that is where this suggestion should be reported.

It's the synthesizers which implement this, so that is where this suggestion should be reported.

oh, really? is this what i have installed synthesizers? what would speechd then do, it has no influence on the text, like it can (should) not remove digit groupings?

❯ paru -Q | grep espe
espeak-ng 1.51.1-2
espeakup 0.90-2

is this what i have installed synthesizers

Yes, espeak-ng is a synthesizer.

it has no influence on the text, like it can (should) not remove digit groupings?

It'd be very fragile to do something about digit grouping, since synthesizers may have different behaviors in different languages. I would be terribly cautious with trying to mangle the content.