unicode-org/inflection

The best lexicon type/format to use

Opened this issue · 7 comments

Lexicons are a critical part of the inflection project. They need to be used at the runtime, and will also be used by our tools for potential ML training.

We need to decide on the format we collect the data in. This decision needs to be based on multiple criteria:

  1. Is the format open, or under a friendly license?
  2. Can other lexicons be converted into that format, so we have consistent data>
  3. Is the format efficient, to reduce size & allow quick lookup?
  4. Quality of existing tools to operate on lexicon
  5. Can the lexicon data be easily pruned to what the user needs, to reduce deployment size?

An example tool and format, used in some universities (see languages):

  1. Unitex/GramLab from university in France, https://unitexgramlab.org/ (LGPL)
  2. Unitex lexicons (22 languages, with varied coverage), https://unitexgramlab.org/language-resources (LGPLLR)

They use Dela class of dictionaries (couldn't find a better link to describe Dela format).

What are other options we can use? Other criteria for selecting a lexicon/dictionary?

Another approach is to use UniMorph package.

Before answering the great nciric@ questions, we also need to decide what is our endgoal:

(1) do we want to build/store a lexicon? (e‧g.: store "house, n: house,sing/houses,plur")
(2) do we want to be able to generate inflection forms based on a "lemma" and grammatical info. (e‧g: input: house + plural output:houses
(3) do we want to be able to analyse an inflected form and provide its lemma and grammatical info. (e‧g.: input: houses output: house, plur).
(4) ... other?

If we can clarify these points first, then we can choose the lexicon format

Hello everyone, Jelena Mitrovic here, very excited to have been invited by @nciric to join this effort

@BrunoCartoni I have worked with UNITEX and DELA dictionaries for Serbian during my PhD (a while back). If we choose to go this route, the answers to your questions would be:

(1) the lexica DELAC (for simple words) DELAF (for compounds) do have all the forms available alongside the lemma already.

(2) the dictionary format contains this already - the problem is that dictionaries are limited, domain specific, and project-specific - so people build their own and do not share them. We would thus have to find people who are willing to supplement the existing resources, or share the ones they have.

(3) this would be ideal, and again, possible with DELA dictionaries.

Regarding UniMorph package, I do not have experience with it, but it seems to go well with Universal Dependancies. Inflection is dependant on syntax, so it might make sense to include at least the simplest UDs for each language.

The issues we are dealing with here are quite complex, and I hope to have a better understanding of the overall requirements for Unicode after our meeting.

Thanks Bruno and Jelena, see my answers below:

  1. We will need to build/store some words in the lexicon - e.g. exceptions to the rules.
  2. I would prefer to generate inflection forms where possible, to reduce the size of the lexicon (and lookup time). For example, for Serbian we would need 14 forms, including lemma. Generating them would reduce number to only 1.
  3. This is a secondary goal at the moment, but it's something we may get for free using Finite State Transducers as they often work both ways. I am sure there's a number of exceptions that would have to be stored in the lexicon. We need to see what trade off we need to make, if any, when deciding on this point.

My use case for the library is:

  1. Take a message format message with placeholders. Placeholders have annotated case, e.g. VOC, SINGULAR.
  2. Take in the parameters from the user, e.g. Beograd (Belgrade), Rim (Rome)
  3. Look up grammatical info from the lexicon, e.g. Beograd -> masculine, inanimate, Rim -> masculine, inanimate
  4. Pass the parameters to our new API, inflect("Beograd", grammatical_info), same for Rim, or directly to message format (which would then automatically do the necessary call)
  5. Get the formatted message

We can think of other scenarios, like building an index for search and asking for lemmatization, where your point 3) holds.

It would be helpful to see a summary of what formats are available out there. Using a format that is compatible with the Unicode license is preferable.

Mihai did point out DMLex, which seems promising.

I like the idea of interoperability and leveraging existing repositories. On the other hand, I don't like restrictions on adding grammemes and other morphological information. For example, it's important to know the phonetic information or if a word starts or end with a vowel or consonant. That's important for grammatical agreement in several languages, including English, French and Korean. So if adding such information is difficult, I'd like to steer clear of such restrictions.

If a lexicon format can help sort a list of English adjectives correctly, that would be a strong format to consider, but it's not a requirement. Adjective order in English is a helpful problem to solve.

Conceptually, I'd like the data structured in a way to have a lemma associated with all its surface forms, and each surface form annotated with grammemes to differentiate it from other surface form under a lemma. As an example, Wiktionary has katt in Swedish, and it has a well annotated declension table and pronunciation.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

Mihai did point out DMLex, which seems promising.

DMLex also sounds promising, thanks for linking.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

I opened #3 for discussing APIs (and use cases).