feature: template extraction method - [merged]
appledora opened this issue · 47 comments
In GitLab by @geohci on Jul 7, 2022, 23:13
just a comment -- the big changes in const.py are just code formatting so I'm accepting them. none of the content changed as far as I can tell.
In GitLab by @geohci on Jul 8, 2022, 24:43
Commented on src/parse/article.py line 89
i think this code is a mistake copy-paste error? not reachable
In GitLab by @geohci on Jul 8, 2022, 24:43
Commented on src/parse/elements.py line 12
is there a strong reason to keep it in the base Element class? maybe just default to empty strings or None and let the individual Wikilink, Template, etc. override the attributes? i assume tables and some of the other wiki elements yet to be implemented might introduce further complexities if we try to keep it here
In GitLab by @geohci on Jul 8, 2022, 24:43
Commented on src/parse/elements.py line 122
per our discussion, maybe leave out params as its own attribute because I don't think any functions we'd include in this library would ever need to access it and just allow that information to be accessed if desired from the raw HTML string attribute?
In GitLab by @geohci on Jul 8, 2022, 24:43
Commented on src/parse/elements.py line 7
should the html_string be stored as an attribute too by default? that way if folks want to access information that we don't parse out from the raw HTML, it's easy to do so?
In GitLab by @geohci on Jul 8, 2022, 24:43
Commented on src/parse/utils.py line 47
nothing you need to do but I'm honestly confused why the diff shows this information as being added because it's already in the library: https://gitlab.wikimedia.org/repos/research/html-dumps/-/blob/main/src/parse/utils.py#L50
I assume it won't break anything but we can always clean up later if needed...
In GitLab by @geohci on Jul 8, 2022, 24:43
Commented on src/parse/article.py line 112
this will find all the nodes with a data-mw attribute that that has template appear somewhere in it. sometimes that'll be a template key but i see it appearing elsewhere in our examples -- e.g., data-mw='{"name":"templatestyles",...
also gets matched. i could see two potential routes to fix this:
- update the regex to be more exact. Simple but might miss other edge cases
- write a slightly more complicated lambda function instead that checks for data-mw attr and specifically looks for the template key in the parts dictionary of it. this would presumably be more exact and less prone to error though perhaps more complicated / overhead.
In GitLab by @geohci on Jul 8, 2022, 24:43
Commented on src/parse/article.py line 119
any narrowing down of this search we can do? e.g., only show up under certain keys? or can templates really appear anywhere essentially under data-mw?
Yes, this actually happened i think, when I was resolving merge conflicts after rebase.
Yeah, at first I thought this will be a default attribute for all the elements. But for now, keeping empty or None seems like a much better option.
yes, good call :D
I may have caused this while figuring out rebasing :3
So I have tried to adopt the second approach. Let me know if it's what you were thinking.
usually appears inside the parts
key, but I can't say with certainty that it is always the same. For the 3 test files I am using, it does seem to follow this pattern.
Adding this. But additionally, all elements that can be transcluded
contains the about
(id) attribute too. So, I am including this attribute for rest of such elements (Wikilinks, externallinks, categories)
changed this line in version 3 of the diff
changed this line in version 3 of the diff
changed this line in version 3 of the diff
added 2 commits
changed this line in version 3 of the diff
In GitLab by @martingerlach on Jul 11, 2022, 18:02
Commented on src/parse/elements.py line 91
Is there a reason we only use the normalization for category-titles?
In GitLab by @martingerlach on Jul 11, 2022, 18:04
Commented on src/parse/elements.py line 93
we are using this in each class. should we add a function to utils so we do the same in each case? (maybe its not worth but just a thought)
From our observations, Categories (which appear inside link
tags) do not contain any title
attribute or enclose any plaintext. The rest of the elements (implemented, so far) contain either titles or plaintexts or both.
Initially, we were using this check to identify transclusions
, currently, we also use this to store ID for an element(if any). However items like templates don't have any such transclusion
attribute (because templates ARE transclusions). There might be other elements with similar edge cases and not have IDs either. This is why implementing this in the BaseClass
might create complexities in the future.
In GitLab by @martingerlach on Jul 11, 2022, 18:15
Commented on src/parse/elements.py line 91
I see. My worry was that here, using the title normalization, we also replace "_" by " ". Is that consistent with the title-attributes of the other elements?
In GitLab by @martingerlach on Jul 11, 2022, 18:16
Commented on src/parse/elements.py line 106
probably I am missing something (sorry). Where is the data_dictionary coming from? I couldnt find it anywhere else in the code.
data_dictionary
is passed from the get_templates()
methods inside the article.py
. Since templates have a pretty asymmetric structures compared to the other elements, we have to pass both the HTML_string and the processed tempalte data to create a class instance.
In GitLab by @martingerlach on Jul 11, 2022, 18:26
Commented on src/parse/elements.py line 93
Yes, keep the transclusion attribute here. I thought more whether to have a function to extract the id since that repeats in different places?
The title attributes usually appear in the correct format (no '_' to represent space). Is this what you were referring to?
In GitLab by @martingerlach on Jul 11, 2022, 18:28
Commented on src/parse/elements.py line 91
Yes, got it. makes all sense now. please resolve.
oh yeah, that makes sense. But IDs and transclusions are kind of tied together (if it's transcluded, it has an ID inside the about
attribute of HTML), I think. Will it be okay to separate the ID extraction part to the BaseClass
, while this may not be applicable for all the elements? Additionally, I have also sometimes seen an id
HTML attribute either by itself or together with the ID inside about
tag. I don't know yet, how these two compare or relate.
In GitLab by @martingerlach on Jul 11, 2022, 19:11
Commented on src/parse/elements.py line 106
ah ok, got it. makes perfect sense.
In GitLab by @martingerlach on Jul 11, 2022, 19:25
Commented on src/parse/elements.py line 93
I thought of it as a function similar to "title_normalization" in utils.
If the extraction is not applicable we could return None.
If too complex, we can leave as is now and refine later.
yes, I think I get it now. I am implementing this.
This makes me wonder, whether we should also make the transclusion-check a utils function?
changed this line in version 4 of the diff
added 1 commit
- 3c88a5c - update: id and transclusion attribute moved to base elements class
In GitLab by @martingerlach on Jul 11, 2022, 21:08
Commented on src/parse/elements.py line 93
To me this solution looks good.
Just thinking whether we want to be more specific about the naming of "id". I could imagine there are other ids. So maybe we call it a transclusion-id? maybe in short tid? so we would have self.tid
and get_tid
.
In GitLab by @martingerlach on Jul 11, 2022, 21:30
Commented on src/parse/article.py line 119
If this covers all the known patterns now, lets keep it like this. we can always file a new issue and improve if we realize, we are missing something.
yeah, for now searching inside the parts
key
I was also thinking about this. Making the changes accordingly and resolving.
resolved all threads
In GitLab by @martingerlach on Jul 11, 2022, 22:00
approved this merge request