scrapinghub/number-parser

Add support for multiple locales

noviluni opened this issue · 3 comments

This library should support multiple languages.

As a first approach, we could support English (default) and three more languages that could be Spanish, Russian, and Hindi, as they are broadly used and have different alphabets.


We could use a similar approach than the approach used in dateparser.

It works like this:

  • json files coming from CLDR
  • yaml files containing specific-language exceptions
  • py files merging both sources.

The script used in dateparser is this one: https://github.com/scrapinghub/dateparser/blob/master/dateparser_scripts/write_complete_data.py but it's not a good example, as there are a lot of things to be improved and some bad practices.

To allow this library to support it, we could just add a locale or similar argument to the parse() function (defaulting to English). I don't expect it to autodetect the language, at least in this first iteration.

@arnavkapoor feel free to implement it in the way you think is better. We can also achieve this in separated PRs, no need to do just one PR.

I have created a draft PR #12 (I still need to test properly / clean the code ) up a bit. I will appreciate inputs regarding the approach and the feasibility of expanding it to even more languages. This is an overview of the approach followed.

The main approach rests on creating 6 sets of dictionaries for each of the languages:-

  • UNIT_NUMBERS -> Numbers from 1 to 9.

  • BASE_NUMBERS -> These contain uniquely defined numbers (i.e don't use any prefix). The maximum range is from [10,99]. For different languages, this range changes.

    • English -> This range is from [10,19] (ten,eleven , twelve ... , nineteen)
    • Hindi -> This range is from [10,99] Unique words exists all the way upto 100.
    • Spanish -> This range is from [10,29]
  • MTENS -> These are multiples of tens from 20 to 90 that are used along with unit numbers to form the complete number, This might be empty for certain languages like Hindi. For English this list is twenty,thirty, forty ... ninety

  • MHUNDREDS -> These are multiples of hundreds from 200 to 900. This is a new set added as it wasn't needed for English or Hindi. However it is widely used for Russian and Spannish and probably other languages too,

    • This includes words like doscientos (200), quinientos (500), пятьсот (500) , двести (200)
      Now one alternate approach was to parse substrings instead as in doscientos - 'dos' as two and cientos as hundred '100'. However the lack of delimiters would mean major upheaval in the logic. Also, words like quinientos don't have any root word (5 is cinco). Similary the suffix in russian is different based on numbers. eg) сти for 200 , сот for 500.
      Thus decided to create this dictionary as opposed to parsing it.
  • MULTIPLIERS -> These are simply powers of 10.eg for English -> Hundred , Thousand ....... and so on.

  • VALID_TOKENS -> Presence of certain words are ignored between the numbers. 'and' for English, 'y' for Spanish, and so on.

  1. Creation Procedure
    Now, these dictionaries were populated by parsing CLDR data as well as user-specified information (quite dependent on this).
    The important point to note is that all the data for any language is pretty less. (About 100 values at max). Also, this data won't need to be updated, hence manually filling them even once and using for all perpetuity is also a viable option. The CLDR data source is CLDR-RBNF (Rule-Based number formatting) which is a set of rules to convert number to words, i.e 23 -> twenty-three. From this, I have extracted most of the things needed for my dictionaries. The data which wasn't easily parsed, (There are a lot of ways to write the rules and I only parsed the common base rules across the board) I filled manually in the supplementary files.
    The script for data parsing is at scripts/parse_data_from_cldr.py. It takes JSON from CLDR-RBNF (downloaded in raw_cldr_translation_data) , parses it and merges it with supplementary translation data before saving the results in final translation data. number_parser/translation_data_merged. This is what is used by the parser.

  2. Drawbacks / Issues remaining

  • Languages without delimiters. Japanese and Chinese (Simplified, Traditional) and possibly other east Asian languages don't have any delimiter. eg) 九千九百九十九 (9999 in Japanese). These actually have a very similar structure compared to English but the lack of a delimiter makes it tougher.

  • I also looked at German/French and both have some idiosyncrasies which might require logic change. For French, using quatre-vingt for 80 and then allowing numbers from 1 to 19 as the suffix , eg) quatre-vingt-dix-neuf for 99, (80+19), would need to be handled. With German, it's a more fundamental issue, as they tend to build numbers from left to right. achtunddreißig (28) which is like 8 and twenty. Can refer to this for more details about the issue. Also, there isn't a delimiter as such (I think) so that's also concerning. One approach in mind for the delimiter thing is reading words character by character and as soon as we have a match in any of the words we insert a space and after this pre-processing step, we can follow the same logic. This does increase the complexity O(string_length ^ 2) which shouldn't be a major issue I believe. (We can use this function only for certain languages without delimiters).

  • Another thing is the different forms of the numbers, this is also prevalent in a few languages. For eg) Russian has different forms of numbers ( Male, Female, Female-Dative, Female-Accusative and so on ... ). Since the raw JSON had most of these forms, I have parsed most of them however some might be missing (especially those that were manually added in the supplementary data source). Similarly, una(female) and un(male) will both map to 1 in Spanish.

  • A lot of languages treat exact large powers of 10 differently '100' , '100000' etc. eg) ciento is for all numbers after 101 but for exact hundred it is cien. Similarly 'millón' is used for a million but greater value uses 'millones', A simple fix would be using both millón and millones in the multipliers dictionary. However then we would have false positives wherein dos million would also be 2000000. So this also needs to be fixed but isn't as major an issue as the previous ones.

In addition to all these above points would appreciate some opinions about using JSON files as the final data source instead of py files like date parser. So in my main parser code based on the language, I load the corresponding JSON and populate the 6 dictionaries. Are there any drawbacks to this method. (Speed ? loading json for each new call to parse )

Hi @arnavkapoor !

The most valuable things in this project at this moment are:

  • Tests
  • Tests (yes, they are quite important)
  • Raw data

Believe me, we will probably need to rewrite all the code, but the important thing is keeping good tests and making sure that they pass.

The interface will be really important too, as we need it to build the tests, but for now, it doesn't matter if it changes a lot. After releasing a new version to PyPI this will change and it shouldn't change too much.

Whit this have been said, I think that, we shouldn't worry too much about Japanese, Chinese, German and French. It's a really good idea to check them (as doing it, we will see potential future issues), but for now, what we should do is:

  • Improve the "interface" (more on this in the next code review of the currently open PR: #12).
  • Add language-specific tests (issue: #15)

After having a good coverage for those languages, we can work on adding support for other languages without missing anything for the currently supported languages.

I have just created a new ticket to track the point 2.1: #18

However, as mentioned before, don't worry too much about the other languages and be focused on adding support for Hindi, Spanish and Russian (apart from English). I'm sure that we will be able to fix other languages in the future 😄