allo-media/text2num

"relaxed" for German - expected behavior?

fquirin opened this issue · 14 comments

Hi Romuald,

I'm working on some bug fixes, tweaks and ordinal support for German and was wondering what to expect from the "relaxed" setting. So far I have found only one case (in all languages) that actually makes use of it and as far as I understand it converts "quatre vingt" into "quatre-vingt" (or treats them equally).

In German we have "zweiundzwanzig" (22) and have to split this internally into "zwei und zwanzig" (2 and 20) for parsing. by default we accept "zwei und zwanzig" in any situation. Should the "relaxed" option allow/disallow that? (It will be hard to handle so I'm just asking for now ^^).

@fquirin There is one more point, like

'erste' which means 1. or 'zweiundzwanzigte' which means 22. ... this issue still persists. and also is not recognized to convert to digits.

Hi @Tortoise17
This should be fixed by the pending pull request #60 from earlier today 🙂 . I've added ordinal support ('erste', 'zweite', ...) and exceptions for "ein Buch, eine Sache, ..." (if 'eine', 'ein' is alone).

rtxm commented

The relaxed setting is used to introduce tolerance to spelling errors. For example, in French, usage of the dash in numbers is not always well understood or applied (even more so in case of automatic speech to text).
So, if the spelling of the input is OK, "quatre-vingt" and "quatre vingt" means two different things: "80" and "4 20" respectively. alpha2digits will correctly translate them in strict mode.
If unsure, the relaxed mode will treat them equally as "80" unless there is an explicit separation like a comma "4, 20" — if you processed speech transcripts, you would use a voice gap threshold instead.

I don't know German spelling rules, but if the spelling "zweiundzwanzig" ("22") is intended to differentiate from "zwei und zwanzig" (as "2 und 20"), then yes, maybe strict mode should follow that.

I don't know German spelling rules, but if the spelling "zweiundzwanzig" ("22") is intended to differentiate from "zwei und zwanzig" (as "2 und 20"), then yes, maybe strict mode should follow that.

I thought about this a bit more and I agree that "strict" mode should indeed generate "2 und 20" instead of "22" but its a bit tricky to realize at the moment and a possible solution cannot follow the current list-of-exceptions approach since the number of possibilities is endless.
In fact I think "strict" mode would mean we only accept numbers that are given as one single word ... 🤔 ... hmm ... 🤔 ... maybe its not that complicated after all 😅 . On the other hand I believe this is not a good default setting and in 99% of the cases probably not what people want (just my guess).

rtxm commented

You are the German language expert here, so we'll follow your advice 😉

Just to make sure I double-checked the official rules 😅 and it says everything is written as one word ... if its smaller than million 🦊 🤦 , so '127.987.654.321.532' becomes 'einhundertsiebenundzwanzig Billionen neunhundertsiebenundachtzig Milliarden sechshundertvierundfünfzig Millionen dreihunderteinundzwanzigtausendfünfhundertzweiunddreißig Euro' ... mon dieu! 🙈.

Well it is what it is. I'll try to apply this rule somehow to the split function, making it dependent on the relaxed parameter.
We have to make sure that this behavior is well documented because I think in most cases people need relaxed=True.

@fquirin thank you so much. Good help. But still some words are converging some are still there like
'erste' 'zweite' 'dritte' are still like 'erste' 'zweite' 'dritte'
but
'siebte' 'achte' are working as 7. and 8. .

@Tortoise17 thats the ordinal threshold ;-)

Try alpha2digit("erste, zweite, dritte", "de", ordinal_threshold=0) and alpha2digit("erste, zweite, dritte", "de", ordinal_threshold=2)

Since 2.4.0 the default is 3.

@fquirin Thank you. With threshold 1 everything returns perfect except erste. that needs threshold. while, in random text, we cannot make two thresholds. I think

@fquirin Thank you again, by chance, I tried, . I tried threshold 0 and everything worked perfect now

@Tortoise17 The 'threshold' value includes the given number as exception so "1" means "transform everything from 2 on" ;-)

There is one more thing , in the latest pull request I've implemented parsing of decimals, e.g. "drei komma eins vier" -> 3.14 🙂

I've been working on parsing things like "5. 8. 2021" to "05.08.2021" and "acht Uhr dreißig" -> "8:30 Uhr" , but I've not included it in this library since I've only implemented it for German (and partly English) and felt it's too specific for text2num.
If you are interested, you can check out my text_processor.py I've developed for the SEPIA STT-Server (make sure to check the 'dev' branch, it's still in test-mode ^^).

@rtxm let me know if you think time and date parsing could be something for text2num ;-)

rtxm commented

@fquirin The scope of text2num is limited to transliteration of numbers only.

Interpretation of content (detecting and formatting phone numbers, dates, time, quantities, money sums,…) is yet another subject requiring some kind of Named Entity Recognition and Natural Language Understanding and is better left to specialized libraries above text2num. That's another project I work on at Allo-media, but not as Open Source 😉

That's another project I work on at Allo-media, but not as Open Source

Too bad, but I understand ;-)

@Tortoise17 The 'threshold' value includes the given number as exception so "1" means "transform everything from 2 on" ;-)

There is one more thing , in the latest pull request I've implemented parsing of decimals, e.g. "drei komma eins vier" -> 3.14 🙂

I've been working on parsing things like "5. 8. 2021" to "05.08.2021" and "acht Uhr dreißig" -> "8:30 Uhr" , but I've not included it in this library since I've only implemented it for German (and partly English) and felt it's too specific for text2num.
If you are interested, you can check out my text_processor.py I've developed for the SEPIA STT-Server (make sure to check the 'dev' branch, it's still in test-mode ^^).

@rtxm let me know if you think time and date parsing could be something for text2num ;-)

@fquirin at the moment, I am fine with it if you can share for German only that parser. for the time and date. But, your work is really great