Mirroring a wiki
Closed this issue · 2 comments
lahwaacz commented
References:
- https://github.com/WikiTeam/wikiteam/
- https://www.mediawiki.org/wiki/Manual:Grabbers
- https://www.mediawiki.org/wiki/API:Database_field_and_API_property_associations
- https://www.mediawiki.org/wiki/Incremental_dumps
- https://dumps.wikimedia.org/other/incr/
To do list:
- abstraction for working with an SQL database
- only a couple of tables of interest, similar but not exactly the same schema as MediaWiki has (compatibility only via the API layer - see below)
- for inspiration:
- grabbers for fetching important tables from API
- namespace, namespace_name (custom tables)
- recentchanges, logging
- user, user_groups, ipblocks
- page, page_props, page_restrictions, protected_titles
- archive, revision, text: wait for
list=allrevisions
module in MediaWiki 1.27 https://phabricator.wikimedia.org/T113885 - tags
- interwiki
- handle difficult actions involving DELETE or UPDATE queries as part of the syncing process:
- removing from user groups
- unblock
- unprotect
- delete (move from
revision
toarchive
) - undelete (move from
archive
torevision
, also checkpage_id
) - selective undelete
- merge (works assuming that both source and target page were not deleted before the sync)
- import (works assuming that the imported pages were not deleted/merged/whatever before the sync)
- delete/revision, delete/event
- tag/update (separately for recentchanges, logging, revision, archive)
- other log events: https://wiki.archlinux.org/api.php?action=help&modules=query%2Blogevents
- let the SQL database serve as a source of data instead of the API
- list=recentchanges
- list=logevents
- list=allpages
- list=protectedtitles
- list=allrevisions
- list=alldeletedrevisions
- titles=, pageids= for use with prop=
- common executor for the DB select queries (for easy profiling)
- framework for tests
- pytest fixture for web server (nginx)
- pytest fixture for php-fpm
- pytest fixture for MediaWiki installation (depends on nginx, php-fpm, postgresql + MW sources, config, initial SQL)
- write the tests...
implement a double-source wrapper, which yields from the API and checks the DB selects, ignoringsplit into #50NotImplementedError
s etc. (usable for unit tests as well as real-world testing)
lahwaacz commented
Upstream bug reports which block further improvement:
- archiving:
- merging:
lahwaacz commented
This is somewhat finished and working nicely, so it's time to close this.