feature: metadata extraction - [merged]

I do think we'll want to remove this attribute. Right now I think it's just misleading ('en' is the language, which we already have, not the page_namespace). What's the reasonining for keeping this in? also, the page_namespace_id implementation looks good to me -- thanks!

Answer 7 · 2022-08-23T15:10:34.000Z

In GitLab by @geohci on Aug 23, 2022, 21:10

Commented on src/parse/utils.py line 248

let's add a quick comment explaining why we skip these

Answer 8 · 2022-08-23T15:14:55.000Z

Well, this attribute is actually used in the wiki link namespace extraction task. The NAMESPACE dictionary is nested. The first key is a namespace(I used to think it's only language acronyms, but it also contains simple) , inside this are the actual wiki namespaces like article and talks. To get to the actual namespace we have to do something like NAMESPACE[primary_namespace][secondary_namespace]. The primary namespace for the wikilinks come from the page_namespace. I actually should call it something other than page_namespace. SUggestions?

Answer 9 · 2022-08-23T15:17:05.000Z

added 1 commit

31a4370 - Update utils.py

Compare with previous version

Answer 10 · 2022-08-23T15:20:55.000Z

I just checked and found that the page_namespace and the page_namespace_id do not correspond to each other, so we DEFINITELY have to change it. Good grief =_=

Answer 11 · 2022-08-23T15:23:11.000Z

changed this line in version 8 of the diff

Answer 12 · 2022-08-23T15:23:12.000Z

added 2 commits

41e2fdb - update: page_namespace renamed to primary_namespace
d545fc8 - Merge branch '41-metadata-extraction' of...

Compare with previous version

Answer 13 · 2022-08-23T15:27:16.000Z

In GitLab by @geohci on Aug 23, 2022, 21:27

Commented on src/parse/article.py line 26

ahhh I think I understand now. I had forgotten about our prior conversation about this (https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/12#note_9977). sorry, let me try to clarify:

what you are calling secondary_namespace is what I mean by namespace. This can have two forms -- the numeric identifier (0) or the prefix/name (Main). I prefer that we record the numeric identifier, which is what you have under page_namespace_id. no change needed there.
what you are calling primary_namespace is the database name for the wiki. This is what you are currently extracting as self.page_namespace. You don't need to change the extraction but to avoid confusion, you actually want to call this self.wiki_db to match how it's usually referred to (i earlier suggested self.wiki but self.wiki_db is even clearer).
we also have self.language, which is not relevant for namespaces but we should continue to extract. It's usually the same as self.wiki_db but there are some crucial differences -- e.g., simple is the wiki_db but en is the language for Simple English Wikipedia. again, no changes needed there.

Answer 14 · 2022-08-23T15:33:12.000Z

added 1 commit

4ee97b6 - update: primary_namespace renamed to wiki_db

Compare with previous version

Answer 15 · 2022-08-23T15:33:41.000Z

made the renaming changes!

Answer 16 · 2022-08-23T15:58:54.000Z

In GitLab by @geohci on Aug 23, 2022, 21:58

Commented on src/parse/article.py line 26

perfect, thanks!

Answer 17 · 2022-08-23T15:58:54.000Z

In GitLab by @geohci on Aug 23, 2022, 21:58

resolved all threads

Answer 18 · 2022-08-23T18:57:48.000Z

In GitLab by @geohci on Aug 24, 2022, 24:57

mentioned in commit 03cb911