Add some way to auto-lemmatize (auto-assign parents) a book.
jzohrab opened this issue · 4 comments
Summary
The current method of defining all terms can be streamlined by some form of "lemmatization", i.e., finding root terms of words.
Currently, Lute treats every word as different: eg, "blancas" and "blancos" are different, though both have the same parent term "blanco", as are "escribo" and "escribieron", though both are forms of the verb "escribir." When I first started out, I didn't mind having to manually make all of these mappings, but as I progress, I feel that's a hassle. I often want to have the parent images available for the child terms, just for my own enjoyment.
It would be nice to have an "auto-lemmatize" feature that can take a given text or book, and automatically map terms to existing parents.
Currently, the only functionality around parent terms, but a significant one in my experience, is the ability to see a bunch of sentences for a term when looking at the references. Eg. for me, the term "albergado" is linked to the parent term "albergar", and when I click on the "sentences" link of "albergado" I get an extensive list of sentences with albergar, albergaba, albergó, albergado, etc etc, which is great b/c I can see the term in my readings. In the future, I can also see this being useful for something like "create Anki cards for only parent terms, with examples of child terms" etc..
First iteration: create a mapping file outside of Lute, then import.
This iteration would be good enough for me, at present!
- Lemmatizing could, at first, be handled outside of Lute, using a tool like spaCy. This could generate a mapping file of terms in a given text/book, child -> parent. See code below.
- The resulting file could be imported into Lute, and mappings done. New children (status = unknown) could be created and auto-assigned to the parent, with the same status as the parent (or with status = 1, maybe).
- Potentially, new parents could also be made ... but that gets into new term creation, which I'm really not sure how much I want to get into!
- The lemmatization could also be applied after-the-fact to existing terms, but then things might get weird with people creating terms with a given status being mapped to parents with different status ... not sure! For the first iteration, it could just work when importing a new book, perhaps.
Sample code using spacy-stanza
This only finds lemma that are different than the original term.
import stanza
import spacy_stanza
# Download the stanza model if necessary
# print("downloading model ...");
# stanza.download("es")
# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("es")
text = """
Los acomodé contra las paredes, pensando en la comodidad y no en la estética.
"""
# with nlp.select_pipes(enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']):
doc = nlp(text)
# for token in doc:
# print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
# print(doc.ents)
lemmatized = [ token for token in doc if token.text != token.lemma_ ];
for token in lemmatized:
print(token.text, token.lemma_)
Run with python3 -W ignore ex.py
(when all dependencies are installed in a python venv):
Output:
Los él
acomodé acomodar
las el
paredes pared
pensando pensar
la el
la el
The lemmatizing code takes a while to load due to the extensive data, but that's ok. If people run the process outside of Lute, they'll understand the processing needs. And this is a first-pass idea anyway.
This data could be loaded into a file and then passed back to Lute for magic processing.
ref code links for spacy
- https://spacy.io/usage
- https://spacy.io/usage/models
- https://stackoverflow.com/questions/60534999/how-to-solve-spanish-lemmatization-problems-with-spacy
- https://github.com/explosion/spacy-stanza
- https://stackoverflow.com/questions/59636002/spacy-lemmatization-of-a-single-word
Future iterations
Obviously, having Lute manage this would be great, but it implies a full installation of some form of Python and spaCy or similar. This could be done with Docker containers too, managed by compose.
I don't think this would need a constantly running server for the lemma process, it could just run a "docker command" style microcontainer that just processes some input (list of terms) and returns the mapping.
However, possibly in the future it would be nice to do the lemmatization on-the-fly, which would need some kind of REST API server running. This might require a bunch of config though, to get the corpus(es) necessary for users with their specific languages.
Pushed branch parent_mapping
with some starting code, service layer for doing mapping. Still need everything else ... could even do this with a "symfony command" or script just to start.
More detail on the lemmatizing I have in mind:
- if an existing Term ("dogs") has a root form ("dog"), and that root form exists, that should be set as the parent. ("dogs" has "dog" as parent)
- if an existing Term ("dogs") has a root form in the mapping file or function, and that root form does not exist, create the root form and map it.
- if a new term in a book ("cats") has a root form ("cat") and that root form exists, the new term will be created, and then mapped to the existing parent, with a note in the new term saying that it was auto-created, and it will be linked to the parent
- if a new term in a book ("parrots") has a root form ("parrot"), but that root form does not exist, don't do anything!
Done in thedevelop
branch, docs on this are at https://github.com/jzohrab/lute/wiki/Bulk-Mapping-Parent-Terms. Will work with it for a bit on my instance before launching, but I think it's good to go.
Launched in v2.0.2.