appledora/mwparserfromhtml

feature: template extraction method - [merged]

appledora opened this issue 2 years ago · 47 comments

appledora commented 2 years ago

Merges 14-template-extraction -> main

Closes #14

appledora commented 2 years ago

requested review from @geohci

appledora commented 2 years ago

added 2 commits

ae30f5f - add static namespace list and utility for generating it to help with namespace...
d85f393 - update: refactor+rebase

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Jul 7, 2022, 23:13

just a comment -- the big changes in const.py are just code formatting so I'm accepting them. none of the content changed as far as I can tell.

appledora commented 2 years ago

In GitLab by @geohci on Jul 8, 2022, 24:43

Commented on src/parse/article.py line 89

i think this code is a mistake copy-paste error? not reachable

appledora commented 2 years ago

In GitLab by @geohci on Jul 8, 2022, 24:43

Commented on src/parse/elements.py line 12

is there a strong reason to keep it in the base Element class? maybe just default to empty strings or None and let the individual Wikilink, Template, etc. override the attributes? i assume tables and some of the other wiki elements yet to be implemented might introduce further complexities if we try to keep it here

appledora commented 2 years ago

In GitLab by @geohci on Jul 8, 2022, 24:43

Commented on src/parse/elements.py line 122

per our discussion, maybe leave out params as its own attribute because I don't think any functions we'd include in this library would ever need to access it and just allow that information to be accessed if desired from the raw HTML string attribute?

appledora commented 2 years ago

In GitLab by @geohci on Jul 8, 2022, 24:43

Commented on src/parse/elements.py line 7

should the html_string be stored as an attribute too by default? that way if folks want to access information that we don't parse out from the raw HTML, it's easy to do so?

appledora commented 2 years ago

In GitLab by @geohci on Jul 8, 2022, 24:43

Commented on src/parse/utils.py line 47

nothing you need to do but I'm honestly confused why the diff shows this information as being added because it's already in the library: https://gitlab.wikimedia.org/repos/research/html-dumps/-/blob/main/src/parse/utils.py#L50

I assume it won't break anything but we can always clean up later if needed...

appledora commented 2 years ago

In GitLab by @geohci on Jul 8, 2022, 24:43

Commented on src/parse/article.py line 112

this will find all the nodes with a data-mw attribute that that has template appear somewhere in it. sometimes that'll be a template key but i see it appearing elsewhere in our examples -- e.g., data-mw='{"name":"templatestyles",... also gets matched. i could see two potential routes to fix this:

update the regex to be more exact. Simple but might miss other edge cases
write a slightly more complicated lambda function instead that checks for data-mw attr and specifically looks for the template key in the parts dictionary of it. this would presumably be more exact and less prone to error though perhaps more complicated / overhead.

appledora commented 2 years ago

In GitLab by @geohci on Jul 8, 2022, 24:43

Commented on src/parse/article.py line 119

any narrowing down of this search we can do? e.g., only show up under certain keys? or can templates really appear anywhere essentially under data-mw?

appledora commented 2 years ago

Yes, this actually happened i think, when I was resolving merge conflicts after rebase.

appledora commented 2 years ago

Yeah, at first I thought this will be a default attribute for all the elements. But for now, keeping empty or None seems like a much better option.

appledora commented 2 years ago

yes, good call :D

appledora commented 2 years ago

I may have caused this while figuring out rebasing :3

appledora commented 2 years ago

So I have tried to adopt the second approach. Let me know if it's what you were thinking.

appledora commented 2 years ago

usually appears inside the parts key, but I can't say with certainty that it is always the same. For the 3 test files I am using, it does seem to follow this pattern.

appledora commented 2 years ago

Adding this. But additionally, all elements that can be transcluded contains the about (id) attribute too. So, I am including this attribute for rest of such elements (Wikilinks, externallinks, categories)

appledora commented 2 years ago

changed this line in version 3 of the diff

appledora commented 2 years ago

changed this line in version 3 of the diff

appledora commented 2 years ago

changed this line in version 3 of the diff

appledora commented 2 years ago

added 2 commits

9e7c3fe - update: refactor
cff23d9 - update: review updates

Compare with previous version

appledora commented 2 years ago

changed this line in version 3 of the diff

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 18:02

Commented on src/parse/elements.py line 91

Is there a reason we only use the normalization for category-titles?

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 18:04

Commented on src/parse/elements.py line 93

we are using this in each class. should we add a function to utils so we do the same in each case? (maybe its not worth but just a thought)

appledora commented 2 years ago

From our observations, Categories (which appear inside link tags) do not contain any title attribute or enclose any plaintext. The rest of the elements (implemented, so far) contain either titles or plaintexts or both.

appledora commented 2 years ago

Initially, we were using this check to identify transclusions , currently, we also use this to store ID for an element(if any). However items like templates don't have any such transclusion attribute (because templates ARE transclusions). There might be other elements with similar edge cases and not have IDs either. This is why implementing this in the BaseClass might create complexities in the future.

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 18:15

Commented on src/parse/elements.py line 91

I see. My worry was that here, using the title normalization, we also replace "_" by " ". Is that consistent with the title-attributes of the other elements?

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 18:16

Commented on src/parse/elements.py line 106

probably I am missing something (sorry). Where is the data_dictionary coming from? I couldnt find it anywhere else in the code.

appledora commented 2 years ago

data_dictionary is passed from the get_templates() methods inside the article.py. Since templates have a pretty asymmetric structures compared to the other elements, we have to pass both the HTML_string and the processed tempalte data to create a class instance.

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 18:26

Commented on src/parse/elements.py line 93

Yes, keep the transclusion attribute here. I thought more whether to have a function to extract the id since that repeats in different places?

appledora commented 2 years ago

The title attributes usually appear in the correct format (no '_' to represent space). Is this what you were referring to?

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 18:28

Commented on src/parse/elements.py line 91

Yes, got it. makes all sense now. please resolve.

appledora commented 2 years ago

oh yeah, that makes sense. But IDs and transclusions are kind of tied together (if it's transcluded, it has an ID inside the about attribute of HTML), I think. Will it be okay to separate the ID extraction part to the BaseClass, while this may not be applicable for all the elements? Additionally, I have also sometimes seen an id HTML attribute either by itself or together with the ID inside about tag. I don't know yet, how these two compare or relate.

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 19:11

Commented on src/parse/elements.py line 106

ah ok, got it. makes perfect sense.

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 19:25

Commented on src/parse/elements.py line 93

I thought of it as a function similar to "title_normalization" in utils.
If the extraction is not applicable we could return None.
If too complex, we can leave as is now and refine later.

appledora commented 2 years ago

yes, I think I get it now. I am implementing this.

appledora commented 2 years ago

This makes me wonder, whether we should also make the transclusion-check a utils function?

appledora commented 2 years ago

changed this line in version 4 of the diff

appledora commented 2 years ago

added 1 commit

3c88a5c - update: id and transclusion attribute moved to base elements class

Compare with previous version

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 21:08

Commented on src/parse/elements.py line 93

To me this solution looks good.
Just thinking whether we want to be more specific about the naming of "id". I could imagine there are other ids. So maybe we call it a transclusion-id? maybe in short tid? so we would have self.tid and get_tid.

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 21:30

Commented on src/parse/article.py line 119

If this covers all the known patterns now, lets keep it like this. we can always file a new issue and improve if we realize, we are missing something.

appledora commented 2 years ago

yeah, for now searching inside the parts key

appledora commented 2 years ago

I was also thinking about this. Making the changes accordingly and resolving.

appledora commented 2 years ago

added 1 commit

6ed791c - Update elements.py

Compare with previous version

appledora commented 2 years ago

added 1 commit

d1b8662 - Update utils.py

Compare with previous version

appledora commented 2 years ago

resolved all threads

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 11, 2022, 22:00

approved this merge request