feature: extract external links - [merged]
appledora opened this issue · 59 comments
added 2 commits
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 17
thanks for adding this documentation! this makes me realize that we probably want to adhere to mwparserfromhell names where possible to make usage as consistent as possible. for wikilink, they have a title and text attributes that we could mirror (and then obviously we'll have a bunch of attributes not in mwparserfromhell). https://github.com/earwig/mwparserfromhell/blob/7738752b016b4c8f4dd3b381ca125724ac2a3af7/src/mwparserfromhell/nodes/wikilink.py#L28
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 24
to be more clear: "...if the wikilink was transcluded onto the page."
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 21
to be more clear "...if the wikilink leads to a disambiguation page"
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 51
I think this should actually be something like self.bare (enwiki vocabulary I've seen) or self.autolinked (HTML spec vocabulary) as I think it captures links that don't have brackets.
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 26
My inclination is to not include attributes like this that don't contain new information -- i.e. are derived directly from a combination of other attributes. Open to thoughts about the pros/cons: in some cases, having a shortcut can be worth the overhead if it's widely used/understood but I'm not sure in this case if we know what specific combinations of attributes folks will care about.
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 78
what other things do we see in the about
property? should this be a stricter if html_string["about"].startswith("mwt"):
to avoid false positives and be slightly more efficient? or is the startswith assumption not always true?
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/utils.py line 171
low priority but a few quick thoughts:
- good catch not to use self but
type
is a built-in so we probably don't want to use that either. - I guess this is not so much link normalization as page title normalization? should we change the function/parammeter names?
- how do you envision the
type
parameter being used?
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/article.py line 76
can you add a test case or two to test the new function? as we expand out the library, keeping tests and classes/functions aligned will be more important
yeah, I will create an issue for it and work on it separately? what do you think?
possibly, auto imported by my ide. removing this.
Usually, the about
attribute appears like this about="#mwtn"
in the HTML where n is a numerical value. So, making it stricter with .startswith()
makes sense.
yes, should definitely change the type
param and also rename the function. I noticed that, link/title patterns are different for different element (different prefixes, different space separators etc.). Primarily, I only identified for Category and Template, and wondering whether other different patterns for other different elements also exist and normalize the titles/links based on that separately?
so instead of self.standard
, we use self.bare/autolinked
, right?
changed this line in version 6 of the diff
changed this line in version 7 of the diff
changed this line in version 8 of the diff
changed this line in version 9 of the diff
an updated version of this method is actually present on the other MR xD
changed this line in version 10 of the diff
Good idea, honestly. I should have kept this in my mind. I will go through this, before making new changes.
would it be odd if I just don't mention one attribute i.e: standard
in the docstring?
In GitLab by @geohci on Jul 1, 2022, 04:28
Commented on src/parse/elements.py line 51
yep -- i don't have a strong preference but maybe autolinked is more descriptive?
In GitLab by @geohci on Jul 1, 2022, 04:29
Commented on src/parse/elements.py line 26
sorry, i commented on the docstring but i intended to ask more broadly (keep it as an attribute or not)
In GitLab by @geohci on Jul 1, 2022, 04:30
Commented on src/parse/article.py line 76
yep, that's fine by me. thanks!
Ohhh... I think I get it now. That Wikilinks (or similar elements) should be by default "standard", while their other states or attributes may change. Users would expect them to be normal by default, but their interests might lie in whether such links are transcluded or point to a disambiguation link. Am I going in the right direction with this?
changed this line in version 12 of the diff
In GitLab by @geohci on Jul 1, 2022, 20:47
Commented on src/parse/elements.py line 26
yep! this isn't a hard rule that everything has a default that we don't make explicit -- e.g., technically if an external link is neither numbered or named, you know it's autolinked. But it's still useful to have autolinked as an attribute because the user shouldn't need to know that there are only three types of external link formats etc. but my feeling with this one is that we can leave it as implicit that a "standard" blue link is the combination/absence of a few explicit properties.
In GitLab by @geohci on Jul 1, 2022, 20:47
Commented on src/parse/elements.py line 17
thanks -- do you want to do this as part of this merge request or do it more broadly in a future MR? if the former, let me know and we can probably merge this.
Tbh, the change is really small so i can fix it right away. But the only issue being, for wikilinks, would there be any difference between a title
and the text
?
Although, in the mwparserfromhell
, for the text
attribute they do mention "if any". So we can take the same approach, i guess?
resolved all threads
In GitLab by @geohci on Jul 1, 2022, 21:44
Commented on src/parse/elements.py line 8
i'd take this out -- we can do the strip() part when we convert to plaintext but here we should preserve the raw data
In GitLab by @geohci on Jul 1, 2022, 21:44
Commented on src/parse/elements.py line 12
a few things for these functions:
- I don't think we need the .strip() parts for title/text (see above about preserving raw data except for when we do the plaintext conversion) and we control the classnames and they shouldn't have any weird spaces in them so .strip() won't be necessary
- all three of these then would just be returning attributtes. i'm less familiar with python class norms -- if we make the above changes, should we still have get methods that only return attributes that the user could directly invoke?
changed this line in version 15 of the diff
changed this line in version 15 of the diff
well, yeah. We can get rid of it.
resolved all threads
changed this line in version 16 of the diff
changed this line in version 18 of the diff