Git-Mediawiki/Git-Mediawiki

Support for fetching namespaces

moy opened this issue · 31 comments

moy commented

We currently fetch the main namespace by default, and the File: namespace if mediaimport is true, but we don't fetch e.g. the Template: or discussion namespaces.

I'd love to see support for fetching templates and talk pages... two big features of mediawiki that I use extensively. For me, just fetching content from the default namespace (regular articles) is not a good clone of my mediawiki.

moy commented

This should be rather straightforward to add, as the code internally works with namespaces already. I won't have time to implement it myself any time soon though.

Would be very convenient to have a possibility to work with SemanticForms' namespaces (Template, Form, Property...) because it is pain to work with them thru web interface..
I would implement it myself but I don't know Perl good enough.

Well, I managed to make a patch which seems to work. It allows the user to explicitly add some categories and force them to load during repository initializing.
How can I submit the patch?

moy commented

How can I submit the patch?

Git-Mediawiki is part of Git, so you should submit patches to the Git
mailing-list, Cc-ing me.

Read this before:
https://github.com/git/git/blob/master/Documentation/SubmittingPatches

kyv commented

@MarSoft , can you share the patch? In a gist or a link to the submission to git? I need this too I can help test and fix some things if necesecary.

@MarSoft I would be really interested in the patch too, and could help cleaning it up for integration!

Just as a note, it's important that different namespaces can contain identically named pages... so probably subdirs of the checkout would be appropriate...

Hello. Just found that patch on my computer. Posted it here: https://gist.github.com/MarSoft/ca00cecbd9d426d9e614

Hi guys. Just want to ask: was it implemented?

kyv commented

Here´s a patch. My git may or may not be in sync with upstream. So the patch may or may not apply directly, but I think you will get the drift of it.

https://gist.github.com/kyv/9e3f4a1b447bf5e8f150

@kyv, thank you, it's working.

moy commented

@kyv: to integrate the patch, you need to follow the normal procedure to contribute to Git. See here: https://github.com/git/git/blob/master/Documentation/SubmittingPatches

Please, Cc: me when you send your patch.

Thanks,

kyv commented

@moy, I no longer use git-mediawiki, so do not have much interest in going through the procedure. I just put it there to be helpful.

moy commented

Submitting code to Git is fun, you should do it ;-).

More seriously, if you agree with Git's Developer's Certificate of Origin 1.1, can you add your Sign-off-by: to your patch (see https://github.com/git/git/blob/master/Documentation/SubmittingPatches#L234). This way, someone else (possibly me when I get time) can submit your code.

Thanks,

kyv commented

Ok I´ll do that later then.

kyv commented

@moy, I created a new patch. I signed off on this one. I also generated agains current master and squashed together what previously appeared as two commits in one.

https://gist.github.com/kyv/9e3f4a1b447bf5e8f150

@kyv that's great! can you send the patch to the mailing list? i believe you need to send it to git@vger.kernel.org

oh, and it seems that "all pages" doesn't actually fetch from all namespaces with that fix, it seems that it's an improvement that could be done on the patch. indentation also seems to be a little off.

finally, trying the modified version, i get this when trying to specify a namespace:

$ git  -c remote.origin.namespaces=Talk clone mediawiki::http://...
Clonage dans '...'...
[...]
3: apunknown_apnamespace: Unrecognized value for parameter 'apnamespace': Talk

There seems to be a problem with the patch, running it vanilla gave me:

git clone -c remote.origin.namespaces=Talk mediawiki::http://supertux.lethargik.org/wiki/
Cloning into 'wiki'...
Searching revisions...
No previous mediawiki revision found, fetching from beginning.
Fetching & writing export data by pages...
Listing pages on remote wiki...
3: apunknown_apnamespace: Unrecognized value for parameter 'apnamespace': Talk
Checking connectivity... fatal: bad object 0000000000000000000000000000000000000000
fatal: remote did not send all necessary objects

apnamespace seems to expect an id not a string, changing the line:

 apnamespace => $local_namespace,

to:

 apnamespace => get_mw_namespace_id($local_namespace),

seems to make things work.

With that change in place, the way namespaces are split up would also need adapting, as namespaces frequently contain spaces and splitting is currently done by space or newline (e.g. it fails with "File talk"):

my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.${remotename}.namespaces"));

@Grumbel i updated the patch in https://gist.github.com/anarcat/f821fa285c6b8b6b16a5

but i am not sure i covered all the changes you described, could you clarify how the last change is done?

then we do need someone to carry this to the git mailing list...

I just did it the quick and dirty way and replace the space in the regex with a comma:

my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.${remotename}.namespaces"));

to:

my @tracked_namespaces = split(/[,\n]/, run_git("config --get-all remote.${remotename}.namespaces"));

That was enough to make it work for my uses, but I don't know what the valid characters for namespaces are and comma might be one of them, so there might be a better way to handle the splitting.

hmm... the documentation in the file there says:

# Accept both space-separated and multiple keys in config file.
# Spaces should be written as _ anyway because we'll use chomp.

so it seems to me that the space-separated idea should stay... besides it would break every other config out there...

also, re-reading https://github.com/git/git/blob/master/Documentation/SubmittingPatches - we will need unit tests before this gets merged in, unfortunately.

The issue is that with the current code you can't checkout namespaces that have spaces in them:

git clone -c "remote.origin.namespaces=File talk" mediawiki::http://supertux.lethargik.org/wiki/ 

will make it look for File and talk namespaces, not the namespace File talk. Using "_" instead of space doesn't help with namespaces, as:

git clone -c "remote.origin.namespaces=File_talk" mediawiki::http://supertux.lethargik.org/wiki/ 

will complain about File_talk not being found. The reason for that is that get_mw_namespace_id() checks for the canonical name and the canonical name contains a space, not a _ (not sure if that is guaranteed for all namespaces or just the case with the default ones).

curl "http://supertux.lethargik.org/wiki/api.php?action=query&format=json&meta=siteinfo&siprop=namespaces" |  python -m json.tool

        "9": {
            "*": "MediaWiki talk",
            "canonical": "MediaWiki talk",
            "case": "first-letter",
            "id": 9,
            "subpages": ""
        }

A fix for this would be to take the namespaces in the File_talk notation and then translate them to their canonical representation by replacing all _ with spaces:

my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.${remotename}.namespaces"));
for (@tracked_namespaces) { s/_/ /g; }
chomp(@tracked_namespaces);

Patch added to Gentoo git patchset (even if it's not perfect yet :) )

Thank you for implementing this feature :)

What is the recommended way to clone some user namespaces + the main namespace?
Since I was not able doing it with the patch of @anarcat, I added some minor changes:
https://gist.github.com/johannesloetzsch/910155f3ba70b6582906

hi all

started looking into this again, and got tired of the gisting... i published a branch on my fork here:

https://github.com/anarcat/git/tree/mediawiki-namespaces

which tries to merge in the patches from @kyv, @Grumbel and my own, along with a way to fetch the "main" namespace, an idea suggested by @johannesloetzsch but i used a slightly different approach: anarcat/git@17e1d97

with my approach, you specify "(Main)" as normal in the list of namespaces and it's simply treated differently in the namespace processor (because, dumbly, the MW API doesn't know how to translate that name). i used "(Main)" instead of "MAIN" because that is the name used in the documentation.

i seem able to fetch a full wiki with all namespaces with that approach.

and honestly, i think that should just be the default already - but that's another patch... there's a hint of how that could be done in anarcat/git@a624e45#diff-d1ae99a08192b4b3e5ad8570fdb59aa0R1337 - as soon as we fetched the namespace/id mapping, we know all the namespaces and we could just use that as a default. but meh. at this point, it's easier to just copy-paste the list...

i have also sent a modified patch series to the mailing list, in the hope of getting more traction on this:

https://public-inbox.org/git/20171029160857.29460-1-anarcat@debian.org/T/#m4c55498911654e05a3a84ab0754a34737a2d72ce

hopefully, we'll finally get this somewhere!

i believe this was merged to git master, so this can be closed.