appledora/mwparserfromhtml

add functions to extract plaintexts to library

appledora opened this issue · 4 comments

In GitLab by @appledora on Jul 12, 2022, 15:45

In GitLab by @geohci on Jul 15, 2022, 22:23

a few thoughts based on https://public.paws.wmcloud.org/User:Appledora/plaintext_examples.ipynb:

  • Potentially interesting plaintext does appear outside of <p> tags though the vast majority of plaintext does seem to be found in <p> tags. Maybe <span> tags too? Tables seem to mostly contain facts/data but not fully-formed sentences.
  • However lots of non-interesting text appears within <p> tags too -- e.g., the stub template text -- so filtering to <p> tags alone is insufficient as a filter.
  • Knowing whether a <p> element came from a template or not is an obvious filter that would help reduce the redundant text without needing to build a database of sentences and how often they appear.
  1. To address this we can traverse through all the body tags inside [except for styles, meta etc], identify their types and keep/ignore specific tags. It's quite trivial to identify tables because they always start with the <table> tag.
  2. It is also trivial to identify the stubs because we have identified the specific class associated with them.
  3. As a rule-of-thumb, so far we have observed that, templates usually have a about attribute which has a value in the form #mwtN (N representing a number). This can be approached in two ways, i think :
    • we can traverse each node/tag and check if it's a template
    • right at the start we can rip out all the templates the same way we remove all the useless tags like style and meta.

But overall, what remains to be more confusing for me, is how we should structure the output of this method.

created branch 32-add-functions-to-extract-plaintexts-to-library to address this issue

In GitLab by @martingerlach on Aug 18, 2022, 14:00

mentioned in commit d48e18f