scrapinghub/number-parser

Edge cases that are not working

noviluni opened this issue · 4 comments

Some edge cases that should be fixed:

On the other side, when having "5 hundred" I think it should return "500". What do you think?

Example:

parse("one million")                                                              
>>> '1000000'
parse("1 million")                                                              
>>> '1 1000000'

Please, add a test with each of these strings when fixing it :)

On the other hand, I think there are some currently working cases that should be added to the tests:

  • word that includes a number but it's not a number and shouldn't be translated. Examples: "weight" (includes "eight") or "network" (includes "two").
  • "twelve hundred" notation (returns "1200" which is correct).

Let me know if you have any doubt. 🙂

I was thinking a bit about these edge cases

  1. twentyone - This seems to be a specific alternate form of the number thus we should probably a dictionary of hard-coded exceptions that would be language specific to handle such cases.

  2. 5 hundred - I do agree that 500 is the more natural and expected parsing, This might involve a lot of major code changes so I might push it for a bit later, with first working on incorporating languages because as discussed it shouldn't be too customized to English.

I feel that we should only handle base_number (in digits) multiplier (in words) in this way.

5 hundred = 500
267 thousands = 267000
3 million = 3000000

(These treated normally)
hundred 73 = 100 73
sixty 5 = 60 5
5 nine thousand = 5 9000 

Opinions ?

  1. The usage of 'a' , I am not very sure about the correct usage of 'a' before a number.
He has a hundred percent chance of succeeding

Do we remove the 'a' or not ?. Also is it the same behavior with other numbers and the position of 'a' in the beginning of the sentence or middle impacts parsing?

a one in six chance 
  1. Agree
  2. Agree, I like what I see in your examples. If anyone has another idea just raise it.
  3. I think that we should "remove" the "a" in the same cases we omit the "one". That is: "a hundred", "a million", "a billion", etc. Removing the "a" from "a one" doesn't have any sense for me.

However, I think that this ticket shouldn't be prioritized as we should first accept other languages (#10) to avoid adapting this library too much to the English language.

Regarding

He has a hundred percent chance of succeeding

I think the right wording would be:

He has a one hundred percent chance of succeeding

Where removing the ‘a’ makes sense.

Hey @noviluni @Gallaecio

The issue mentioned about parsing '1 million' to '1000000'

Previously
parse('1 million')
>>> '1 1000000'

so to change this to '1000000' we can make few changes to the algorithm.

My Approach for this issue

1.)According to me,
In the number_parser/parser.py while checking the validity(function name: _check_validity) of the current token with the previous one , we can check if the previous token is a numeric one(using isnumeric() ) and the previous token's integer value is less than the current token's (mapped value in the dictionary) mapped integer value. Then we can continue to build the current token with the previous number. And then the actual value resulting till now would be multiplying both integer values (previous and current one).

Intuition

a.) '1 million' should be parsed as '1000000'.
b.) 'million 1' or '1000000 one' should and will parse as '1000000 1'.
c.) Also keeping in mind that 'million one' (both require lookups to its dictionary values) will be parsed as '1000001' (already handled previously)

2.)
The reason we check if the previous token's integer value is less than the current one's is because if the both are string values (ex:- "one"), the previous token's integer value is bigger than current one's and both are in string ordinals without any numeric values in it, they both will add up .for ex: 'million one' will parse '1000001'. So we will only be moving forward for the one where the previous one is numeric and current is string and previous is smaller. Rest all case handlings are all managed previously.

I know this needs a more efficient and deeper approach.
I am a beginner, so please give your valuable views. Also am thinking to take part in GSoC next year through this sub-org project . Please let me know what you think.