appledora/mwparserfromhtml

Ensure clear connection between HTML nodes and plaintext

appledora opened this issue · 0 comments

In GitLab by @geohci on Sep 15, 2022, 16:05

Many use-cases for HTML plaintext do require some knowledge of where each word came from -- e.g., knowing which part of the sentence is a link or was italicized in the HTML can be crucial to training models for link prediction. For the plaintext methods, we should have the ability to see which type of node contributed each character/word while also easily joining them together into a pure string object.