jzohrab/lute

Add some way to auto-lemmatize (auto-assign parents) a book.

jzohrab opened this issue · 4 comments

Summary

The current method of defining all terms can be streamlined by some form of "lemmatization", i.e., finding root terms of words.

Currently, Lute treats every word as different: eg, "blancas" and "blancos" are different, though both have the same parent term "blanco", as are "escribo" and "escribieron", though both are forms of the verb "escribir." When I first started out, I didn't mind having to manually make all of these mappings, but as I progress, I feel that's a hassle. I often want to have the parent images available for the child terms, just for my own enjoyment.

It would be nice to have an "auto-lemmatize" feature that can take a given text or book, and automatically map terms to existing parents.

Currently, the only functionality around parent terms, but a significant one in my experience, is the ability to see a bunch of sentences for a term when looking at the references. Eg. for me, the term "albergado" is linked to the parent term "albergar", and when I click on the "sentences" link of "albergado" I get an extensive list of sentences with albergar, albergaba, albergó, albergado, etc etc, which is great b/c I can see the term in my readings. In the future, I can also see this being useful for something like "create Anki cards for only parent terms, with examples of child terms" etc..

First iteration: create a mapping file outside of Lute, then import.

This iteration would be good enough for me, at present!

  • Lemmatizing could, at first, be handled outside of Lute, using a tool like spaCy. This could generate a mapping file of terms in a given text/book, child -> parent. See code below.
  • The resulting file could be imported into Lute, and mappings done. New children (status = unknown) could be created and auto-assigned to the parent, with the same status as the parent (or with status = 1, maybe).
  • Potentially, new parents could also be made ... but that gets into new term creation, which I'm really not sure how much I want to get into!
  • The lemmatization could also be applied after-the-fact to existing terms, but then things might get weird with people creating terms with a given status being mapped to parents with different status ... not sure! For the first iteration, it could just work when importing a new book, perhaps.

Sample code using spacy-stanza

This only finds lemma that are different than the original term.

import stanza
import spacy_stanza

# Download the stanza model if necessary
# print("downloading model ...");
# stanza.download("es")

# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("es")

text = """
Los acomodé contra las paredes, pensando en la comodidad y no en la estética.
"""

# with nlp.select_pipes(enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']):
doc = nlp(text)

# for token in doc:
#     print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
# print(doc.ents)
lemmatized = [ token for token in doc if token.text != token.lemma_ ];
for token in lemmatized:
    print(token.text, token.lemma_)

Run with python3 -W ignore ex.py (when all dependencies are installed in a python venv):

Output:

Los él
acomodé acomodar
las el
paredes pared
pensando pensar
la el
la el

The lemmatizing code takes a while to load due to the extensive data, but that's ok. If people run the process outside of Lute, they'll understand the processing needs. And this is a first-pass idea anyway.

This data could be loaded into a file and then passed back to Lute for magic processing.

ref code links for spacy

Future iterations

Obviously, having Lute manage this would be great, but it implies a full installation of some form of Python and spaCy or similar. This could be done with Docker containers too, managed by compose.

I don't think this would need a constantly running server for the lemma process, it could just run a "docker command" style microcontainer that just processes some input (list of terms) and returns the mapping.

However, possibly in the future it would be nice to do the lemmatization on-the-fly, which would need some kind of REST API server running. This might require a bunch of config though, to get the corpus(es) necessary for users with their specific languages.

Pushed branch parent_mapping with some starting code, service layer for doing mapping. Still need everything else ... could even do this with a "symfony command" or script just to start.

More detail on the lemmatizing I have in mind:

  • if an existing Term ("dogs") has a root form ("dog"), and that root form exists, that should be set as the parent. ("dogs" has "dog" as parent)
  • if an existing Term ("dogs") has a root form in the mapping file or function, and that root form does not exist, create the root form and map it.
  • if a new term in a book ("cats") has a root form ("cat") and that root form exists, the new term will be created, and then mapped to the existing parent, with a note in the new term saying that it was auto-created, and it will be linked to the parent
  • if a new term in a book ("parrots") has a root form ("parrot"), but that root form does not exist, don't do anything!

Done in thedevelop branch, docs on this are at https://github.com/jzohrab/lute/wiki/Bulk-Mapping-Parent-Terms. Will work with it for a bit on my instance before launching, but I think it's good to go.

Launched in v2.0.2.