Migrate from JSON to SQL caching

Question

Migrate from JSON to SQL caching

Closed this issue 5 years ago · 22 comments

Use the stuff implemented in #35 in the scripts.

find-closed-discussions.py: prop=revisions
link-checker.py: prop=revisions
list-redirects-broken-fragments.py: prop=revisions
statistics_histograms.py
statistics_per_user.py
statistics.py
compare in the current JSON and SQL databases (they differ mostly because of the lost deleted revisions)
migrate extra data from JSON to SQL
remove JSON caching database

Answer 1 · 2019-12-15T19:06:38.000Z

@kynikos I'm actually going to do this "soon" - are you ready? 😉

The biggest change is that the SQL database will not include the deleted revisions from before the big elephant entered our china shop. I don't plan on migrating these data into SQL, because we haven't recorded all attributes anyway. I've compared the data in my SQL and JSON databases (you should be able to run the script compare_json_to_sql.py too) and besides the lost revisions there shouldn't be almost difference. (Some fields were actually wrong in JSON because it doesn't include incremental updates like SQL does.) The Statistics page will be affected most, but not too much... I'll prepare a diff.

Answer 2 · 2019-12-16T14:12:20.000Z

I was born ready, always looking towards the future B-)

I've tried compare_json_to_sql.py but got:

Traceback (most recent call last):
  File "./compare_json_to_sql.py", line 152, in <module>
    db = Database.from_argparser(args)
  File "/home/dario/dev/arch/wiki-scripts/ws/db/database.py", line 97, in from_argparser
    raise ValueError("Cannot create database connection: db_name cannot be None")
ValueError: Cannot create database connection: db_name cannot be None

I have the feeling that I have to configure my database hehe I've had a glance at the docs to no avail, little help?

Now if you'll excuse though I'll be in my bedroom to cry my last tears in the memory of those deleted revisions T_T

:P

Answer 3 · 2019-12-16T22:36:25.000Z

The documentation seems to be very outdated, I should work on that... Basically you need a PostgreSQL database (and the related optional dependencies) and set the db-* settings in the config. For example, I have this:

db-dialect = postgresql
db-driver = psycopg2

db-user = wiki-scripts
db-password = password
db-host = localhost

[ArchWiki]
api-url = https://wiki.archlinux.org/api.php
index-url = https://wiki.archlinux.org/index.php
db-name = ws_archwiki

Note that the first synchronization will probably run a long time... compare_json_to_sql.py will not download any content/wikitext, but it still needs to fetch the metadata of all revisions.

Now if you'll excuse though I'll be in my bedroom to cry my last tears in the memory of those deleted revisions T_T

Don't worry, we can back them up in a text file just in case if we ever need them 😉 The script compare_json_to_sql.py will filter them out.

Answer 4 · 2019-12-17T16:20:39.000Z

Thanks, getting there I guess ^^'

% ./compare_json_to_sql.py
WARNING  API warning(s) for query {'list': 'allrevisions', 'arvprop': 'ids|timestamp|flags|user|userid|comment|size|sha1|contentmodel|tags', 'arvlimit': 'max', 'arvdir': 'newer', 'arvstart': '2019-12-17T15:20:49Z', 'action': 'query', 'continue': ''}:
* Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.
* Because "arvslots" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used.
INFO     Synchronization of the database took 5.03 seconds.
INFO     Loading data from /home/dario/.cache/wiki-scripts/wiki.archlinux.org/AllUsersProps.db.json.gz ...
INFO     Fetching properties of 398 possibly modified user accounts...
INFO     Fetching properties of 370 possibly modified user accounts...
INFO     Fetching properties of 370 possibly modified user accounts...
INFO     Loading data from /home/dario/.cache/wiki-scripts/wiki.archlinux.org/AllRevisionsProps.db.json.gz ...
INFO     Fetching revids 591854-592086
Invalid users in the JSON database:
[]
Traceback (most recent call last):
  File "./compare_json_to_sql.py", line 167, in <module>
    compare_users(db_allusers, json_allusers)
  File "./compare_json_to_sql.py", line 59, in compare_users
    assert db_user["userid"] == json_user["userid"]
AssertionError
./compare_json_to_sql.py  17.26s user 0.61s system 51% cpu 34.719 total

That exception doesn't sound new, it may be a deja-vu but I think something at least similar happened a while ago about a particular username that broke another script. If you want I can upload my JSON file somewhere. Last time I synced it was 2019-12-15 and everything went fine...

Answer 5 · 2019-12-17T16:34:24.000Z

I guess the lists being compared are just not sorted the same way... Can you try again with the last commit?

Answer 6 · 2019-12-18T15:31:38.000Z

Nailed it, thanks :) Now I wish I hadn't seen the contents though...
Another question: I used to regularly back up the JSON database, what do you suggest as the best way to back up this new database?

Answer 7 · 2019-12-18T15:59:39.000Z

P.S. Actually I may be interested in trying to contribute a migration script for the old database entries (reverse-engineering them to give them the same keys and value types as the latest ones, trying to fill in holes, convert values etc.). It wouldn't be the first time that I do something like that, I'll look into it this weekend, unless you already attempted something similar and want to discourage me after encountering really tough problems.

Answer 8 · 2019-12-18T21:29:08.000Z

The problem with those entries is that they contain just the comment, revid, timestamp and user fields (and "minor" for minor edits), there is no userid (could be guessed) and parentid (not necessary), but most importantly, there is also no namespace and page title. We would have to find them elsewhere, which is probably not possible after the time, or create some dummy title for all those revisions - I think I would actually like that after all 😄

As for backups, probably easiest is to use pg_dump. PostgreSQL has more options though, see https://www.postgresql.org/docs/current/backup.html

Answer 9 · 2019-12-22T17:15:39.000Z

I've tried to write a very simple migration script, kynikos@4a5a085 (intended usage: migrate_json_db.py path/to/AllRevisionsProps.db.json > path/to/AllRevisionsProps.migrated.db.json, it requires manually ungzipping AllRevisionsProps.db.json.gz), however I must be missing something, I can't find any revisions in my database that have a userid, page title or namespace, not even the latest.

Answer 10 · 2019-12-22T17:48:10.000Z

Hmm disregard the above, I misinterpreted your comment, you were referring to fields that exist in the new SQL database but were actually never collected at all in the JSON one. Also, more importantly there's currently no script to insert any old JSON entries in the new database, so I should work on the diff json output by compare_json_to_sql.json.

In that case can you give me some quick hints on the new database's schema? Can the old revisions be imported in only one table independently, or are more related tables involved?

About the page title we can just make something up indeed, e.g. "Deleted archived revision (original title lost)". The namespace can default to 0 (Main). Usernames can be taken from AllUsersProps.db.json.

Answer 11 · 2019-12-22T18:03:10.000Z

You can find the schema here, it's based on the MediaWiki schema (as of 2016/17...) I think that importing only into the archive table should be enough.

Answer 12 · 2019-12-24T08:53:32.000Z

I think I've found a solution, kynikos@41fdbef, the script is called import_json_compared_to_sql.py and is meant to be used something like this:

$ ./compare_json_to_sql.py > json_compared_to_sql.json
$ ./import_json_compared_to_sql.py json_compared_to_sql.json

There are some caveats:

I've taken the shortcut of modifying compare_json_to_sql.py directly to only output the "Extra revisions from the JSON database:" section in JSON format; I can easily push a separate script just for the purpose if you prefer;
The import script is not idempotent: calling it multiple times will insert the recovered revisions over and over (so have a test database and/or a backup ready);
userid for MediaWiki default and several usernames recorded as IP addresses can't be found (they were all anonymous edits); I've assigned userid 0 to them, since it doesn't seem to be in use, and 0 seems to be returned by getId() in User.php for anonymous edits;
userid for Thayer.w can't be found; that user was likely renamed to Thayer at some point in time, maybe even with a low-level query, anyway I've given it Thayer's userid, which is 3583.

Test after importing the revisions, using the original version of compare_json_to_sql.py:

$ ./compare_json_to_sql.py
...
Invalid users in the JSON database:
[]
Extra users from the SQL database:
[{'editcount': 0,
  'groups': ['*', 'sysop', 'user'],
  'name': 'Abuse filter',
  'registration': datetime.datetime(2016, 6, 16, 11, 10, 14),
  'userid': 46974}]

EDIT: force-pushed a small fix

Answer 13 · 2019-12-24T10:07:40.000Z

Great, it looks good! I'll have a closer look, but some quick comments:

Calling the script multiple times shouldn't break the database, because the ar_rev_id column has a unique constraint. To avoid the crash, we can use update instead of insert, or just skip existing rows.
That's good, wiki-scripts assume that all anonymous revisions have userid 0.
The user name Thayer.w should be renamed to Thayer, some time ago there were both names in my database and I had to manually run update revision set rev_user_text = 'Thayer' where rev_user = 3583; to avoid differences with the current values returned by MediaWiki API. MediaWiki changed its behaviour with the actor migration so the old username was discarded and only the new username is available.

Answer 14 · 2019-12-24T15:09:48.000Z

Oh good to know all that, I've improved the script accordingly then: kynikos@a0319ec (Thayer.w gets renamed to Thayer now).

I don't feel like complicating the query too much, personally I'm fine if the script just crashes on the constraint when running it a second time (I was so sure it would actually duplicate the records that clearly I didn't even try once, otherwise I'd have found out by myself ^^'). If however you still prefer changing them to upserts or something, just go ahead of course.

Another thing I was thinking is whether we should agree on the JSON diff to merge, rather than each of us importing his. I've added my file to my branch (kynikos@acb49d6), maybe you can diff it with yours, and hopefully they just match.

I was also wondering whether this repo should just track the JSON diff in the master branch, because if somebody in the future starts using this bot and syncs the database from upstream, they won't have another way to find and merge those lost records too.

Answer 15 · 2019-12-24T21:26:16.000Z

It's a good idea to keep the revisions in a JSON file around, I've got the same file after I manually sorted the keys (see my last commits in the branch). Now I'll try to actually import the data into PostgreSQL...

Answer 16 · 2019-12-25T11:42:56.000Z

I've tried the import and it almost worked as expected, I just had to update the schema to make the SHA1 columns nullable (originally they weren't and empty fields were recorded as empty strings, which caused some errors). You'll need to run alembic upgrade head to run the migration, it doesn't matter if you do it before or after importing the JSON file (and wiki-scripts should print an error with instructions if you forgot).

I think that's all for the migration of the lost revisions, thanks for the help! I'll wait until you migrate too (note that I made some changes to the import script) and then I'll start updating statistics.py and slowly removing the JSON cache support.

Answer 17 · 2019-12-26T15:38:18.000Z

Perfect, I think everything succeeded and my pg database should be ready: I've done everything on d97b48b, however as a last step I thought I'd test with compare_json_to_sql.py again, but got a long list of:

...
...
...

--- db_entry
+++ api_entry
@@ -2,4 +2,4 @@
  'minor': '',
  'revid': 32043,
  'timestamp': '2007-11-08T16:46:38',
- 'user': 'Thayer'}
+ 'user': 'Thayer.w'}

db_entry no. 31991:
{'comment': '',
 'minor': '',
 'revid': 32043,
 'timestamp': '2007-11-08T16:46:38',
 'user': 'Thayer'}
api_entry no. 31991:
{'comment': '',
 'minor': '',
 'revid': 32043,
 'timestamp': '2007-11-08T16:46:38',
 'user': 'Thayer.w'}

Traceback (most recent call last):
  File "/home/dario/dev/arch/wiki-scripts/grab.py", line 47, in _check_lists
    _check_entries(i, db_entry, api_entry)
  File "/home/dario/dev/arch/wiki-scripts/grab.py", line 35, in _check_entries
    assert db_entry == api_entry
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dario/dev/arch/wiki-scripts/grab.py", line 52, in _check_lists
    raise AssertionError from last_assert_exc
AssertionError
./compare_json_to_sql.py  19.81s user 0.53s system 65% cpu 31.028 total

I guess because the JSON database still has "Thayer.w" instead of "Thayer" in all those revisions. In order to make the test pass I've had to make this further change kynikos@de6f8b5

Answer 18 · 2019-12-27T07:58:40.000Z

Yes, it's because Thayer.w stayed in JSON, so nothing to worry about.

Edit: The "Thayer.w"/"Thayer" comparison should be done in compare_json_to_sql.py, though.

Answer 19 · 2019-12-27T09:53:22.000Z

I've updated the statistics.py script and there are some differences - you can see them by comparing these revisions:

592714 (generated with the original script from the master branch)
592715 (generated with the updated script from 1fe56ba)
592716 (generated with the script from the master branch, but with b3f8dc8 applied)
592722 (generated with the updated script from 6c99763)

The difference between 1. and 3. is that new page creations did not count as a recent edit, so users who made just one such edit in the past 30 days were not included in the table in 1. (so it's a bug). For the differences between 2. and 3.: I was originally thinking to keep it simple and use only the revision and archive tables for counting recent edits (3. counts recent edits using the recentchanges table), so some "diffable" log actions like page protection changes and page moves were counted by 2. and not by 3. In 4. I used the recentchanges table again, so the only difference between 3. and 4. is Thayer's edit count.

Answer 20 · 2019-12-28T05:51:52.000Z

I think you've done a great job!
I won't dare to run the script to save updates to the stats page or sort the maintainers until you merge json-drop into master though ;)

Answer 21 · 2019-12-28T09:31:08.000Z

OK, thanks!

Answer 22 · 2019-12-29T03:17:13.000Z

It looks like it worked fine for me too :)