appledora/mwparserfromhtml

feature: extract external links - [merged]

appledora opened this issue · 59 comments

Merges 13-extlink-extraction -> main

Closes #13

requested review from @geohci

added 1 commit

  • 335744c - update: added documentation

Compare with previous version

added 1 commit

Compare with previous version

added 1 commit

Compare with previous version

added 2 commits

  • cea5ccb - update: added transclusion
  • 11d2bb0 - Merge branch '13-extlink-extraction' of...

Compare with previous version

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 17

thanks for adding this documentation! this makes me realize that we probably want to adhere to mwparserfromhell names where possible to make usage as consistent as possible. for wikilink, they have a title and text attributes that we could mirror (and then obviously we'll have a bunch of attributes not in mwparserfromhell). https://github.com/earwig/mwparserfromhell/blob/7738752b016b4c8f4dd3b381ca125724ac2a3af7/src/mwparserfromhell/nodes/wikilink.py#L28

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 24

to be more clear: "...if the wikilink was transcluded onto the page."

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 21

to be more clear "...if the wikilink leads to a disambiguation page"

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 51

I think this should actually be something like self.bare (enwiki vocabulary I've seen) or self.autolinked (HTML spec vocabulary) as I think it captures links that don't have brackets.

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 26

My inclination is to not include attributes like this that don't contain new information -- i.e. are derived directly from a combination of other attributes. Open to thoughts about the pros/cons: in some cases, having a shortcut can be worth the overhead if it's widely used/understood but I'm not sure in this case if we know what specific combinations of attributes folks will care about.

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 78

what other things do we see in the about property? should this be a stricter if html_string["about"].startswith("mwt"): to avoid false positives and be slightly more efficient? or is the startswith assumption not always true?

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/utils.py line 171

low priority but a few quick thoughts:

  • good catch not to use self but type is a built-in so we probably don't want to use that either.
  • I guess this is not so much link normalization as page title normalization? should we change the function/parammeter names?
  • how do you envision the type parameter being used?

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/article.py line 76

can you add a test case or two to test the new function? as we expand out the library, keeping tests and classes/functions aligned will be more important

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/article.py line 1

what's this do?

yeah, I will create an issue for it and work on it separately? what do you think?

possibly, auto imported by my ide. removing this.

Usually, the about attribute appears like this about="#mwtn" in the HTML where n is a numerical value. So, making it stricter with .startswith() makes sense.

yes, should definitely change the type param and also rename the function. I noticed that, link/title patterns are different for different element (different prefixes, different space separators etc.). Primarily, I only identified for Category and Template, and wondering whether other different patterns for other different elements also exist and normalize the titles/links based on that separately?

so instead of self.standard, we use self.bare/autolinked, right?

added 1 commit

Compare with previous version

changed this line in version 6 of the diff

changed this line in version 7 of the diff

added 1 commit

Compare with previous version

changed this line in version 8 of the diff

added 1 commit

Compare with previous version

changed this line in version 9 of the diff

added 1 commit

Compare with previous version

an updated version of this method is actually present on the other MR xD

changed this line in version 10 of the diff

added 1 commit

Compare with previous version

added 1 commit

Compare with previous version

Good idea, honestly. I should have kept this in my mind. I will go through this, before making new changes.

would it be odd if I just don't mention one attribute i.e: standard in the docstring?

In GitLab by @geohci on Jul 1, 2022, 04:28

Commented on src/parse/elements.py line 51

yep -- i don't have a strong preference but maybe autolinked is more descriptive?

In GitLab by @geohci on Jul 1, 2022, 04:29

Commented on src/parse/elements.py line 26

sorry, i commented on the docstring but i intended to ask more broadly (keep it as an attribute or not)

In GitLab by @geohci on Jul 1, 2022, 04:30

Commented on src/parse/article.py line 76

yep, that's fine by me. thanks!

Ohhh... I think I get it now. That Wikilinks (or similar elements) should be by default "standard", while their other states or attributes may change. Users would expect them to be normal by default, but their interests might lie in whether such links are transcluded or point to a disambiguation link. Am I going in the right direction with this?

changed this line in version 12 of the diff

added 1 commit

Compare with previous version

In GitLab by @geohci on Jul 1, 2022, 20:47

Commented on src/parse/elements.py line 26

yep! this isn't a hard rule that everything has a default that we don't make explicit -- e.g., technically if an external link is neither numbered or named, you know it's autolinked. But it's still useful to have autolinked as an attribute because the user shouldn't need to know that there are only three types of external link formats etc. but my feeling with this one is that we can leave it as implicit that a "standard" blue link is the combination/absence of a few explicit properties.

In GitLab by @geohci on Jul 1, 2022, 20:47

Commented on src/parse/elements.py line 17

thanks -- do you want to do this as part of this merge request or do it more broadly in a future MR? if the former, let me know and we can probably merge this.

Tbh, the change is really small so i can fix it right away. But the only issue being, for wikilinks, would there be any difference between a title and the text?

Although, in the mwparserfromhell, for the text attribute they do mention "if any". So we can take the same approach, i guess?

added 1 commit

  • 30ef201 - update: aligned with mwparserfromhell

Compare with previous version

added 1 commit

  • 74c4eb7 - update: aligned with mwparserfromhell

Compare with previous version

resolved all threads

In GitLab by @geohci on Jul 1, 2022, 21:44

Commented on src/parse/elements.py line 8

i'd take this out -- we can do the strip() part when we convert to plaintext but here we should preserve the raw data

In GitLab by @geohci on Jul 1, 2022, 21:44

Commented on src/parse/elements.py line 12

a few things for these functions:

  • I don't think we need the .strip() parts for title/text (see above about preserving raw data except for when we do the plaintext conversion) and we control the classnames and they shouldn't have any weird spaces in them so .strip() won't be necessary
  • all three of these then would just be returning attributtes. i'm less familiar with python class norms -- if we make the above changes, should we still have get methods that only return attributes that the user could directly invoke?

changed this line in version 15 of the diff

changed this line in version 15 of the diff

added 1 commit

Compare with previous version

well, yeah. We can get rid of it.

resolved all threads

changed this line in version 16 of the diff

added 1 commit

Compare with previous version

added 1 commit

Compare with previous version

changed this line in version 18 of the diff

added 1 commit

Compare with previous version

In GitLab by @geohci on Jul 1, 2022, 22:35

mentioned in commit d810eda