/wikiprojects

Scripts for Wikipedia WikiProjects

Primary LanguagePython

Table of Contents

Promoting and measuring editor participation in WikiProjects

Scripts and metrics for tracking participation in WikiProjects, groups of Wikipedia editors who work together, and on other editor engagement projects on wikis.

Wikitools patches

Right now, you have to patch your copy of Wikitools a bit to get some of this to work. In future, we should just make a class, but for now...

  • wikitools/wiki.py.setSiteInfo: apply this patch to line 109.
  • wikitools/page.py.edit: add 'sectiontitle' to the list of valid args.

Tables

MySQL database tables are used to track relevant metrics about your project, such as who was invited, who visited, and which wikipages are included.

invited_yourprojectname

This table provides a log of the users who have been invited to the Project. You populate the majority of the metadata with a query like this one.

 +-------------------+------------------+------+-----+---------+----------------+
 | Field             | Type             | Null | Key | Default | Extra          |
 +-------------------+------------------+------+-----+---------+----------------+
 | id                | int(11) unsigned | NO   | PRI | NULL    | auto_increment |
 | user_id           | int(11)          | YES  | UNI | NULL    |                |
 | user_name         | varbinary(200)   | YES  |     | NULL    |                |
 | user_registration | varbinary(14)    | YES  |     | NULL    |                |
 | user_editcount    | int(11)          | YES  |     | NULL    |                |
 | sample_date       | datetime         | YES  |     | NULL    |                |
 | invited           | tinyint(1)       | YES  |     | 0       |                |
 | bot_invited       | tinyint(1)       | YES  |     | 0       |                |
 | bot_skipped       | tinyint(1)       | YES  |     | 0       |                |
 | user_talkpage     | int(11)          | YES  |     | NULL    |                |
 | ut_is_redirect    | tinyint(1)       | YES  |     | 0       |                |
 +-------------------+------------------+------+-----+---------+----------------+

The script that invites these users to your project then updates the invited, bot_invited and bot_skipped columns.

pages_yourprojectname

This table provides a log of the pages in your project space. Project members often create new pages (i.e. WikiProject_Cats/Members, Teahouse/Host_lounge/Announcements). Having an updated list of the pages in your project space makes it easier to keep track of who is editing what: for instance, it makes it easier to find out when users you invited to your project start participating.

 +-------------+------------------+------+-----+---------+----------------+
 | Field       | Type             | Null | Key | Default | Extra          |
 +-------------+------------------+------+-----+---------+----------------+
 | id          | int(11) unsigned | NO   | PRI | NULL    | auto_increment |
 | page_id     | int(11)          | YES  | UNI | NULL    |                |
 | p_title     | varbinary(255)   | YES  |     | NULL    |                |
 | p_namespace | int(11)          | YES  |     | NULL    |                |
 +-------------+------------------+------+-----+---------+----------------+

You can keep this table up to date by running a query like this one on cron: insert ignore into th_pages (page_id, page_namespace, page_title, page_touched) select page_id, page_namespace, page_title, page_touched from enwiki.page where page_namespace in (4,5) and page_title like "YourProjectName/%";

articles_yourprojectname

Most WikiProjects are focused on editing articles within particular categories. If you are trying to get people to participate in your WikiProject, you probably want to know when they edit articles related to those WikiProjects. But reading through the entire categorylink table takes a long time. It may be easier to create a local lookup table that contains the names and pageids of all the articles you're interested in tracking. That table might look something like this:

 +-------------+------------------+------+-----+---------+----------------+
 | Field       | Type             | Null | Key | Default | Extra          |
 +-------------+------------------+------+-----+---------+----------------+
 | id          | int(11) unsigned | NO   | PRI | NULL    | auto_increment |
 | page_id     | int(11)          | YES  | UNI | NULL    |                |
 | p_title     | varbinary(255)   | YES  |     | NULL    |                |
 | p_namespace | int(11)          | YES  |     | NULL    |                |
 +-------------+------------------+------+-----+---------+----------------+

You can then find edits to your project's articles by users you invited to the project with a fairly speedy query like this: SELECT * FROM somewiki.revision AS r, yourdb.invited_yourprojectname AS i, yourdb.articles_yourprojectname AS a WHERE r.rev_user = i.user_id AND r.rev_page = a.page_id;

You can populate this table with articles from multiple super- and sub-categories (e.g. WikiProject_Milkshake_articles, Milkshake_stubs) by reading the category tree. There are multiple ways to do this (API, somewiki.categorylinks, external tools). In Reading the Category Tree below, I describe how to do this by querying the Mediawiki database.

Scripts

findNewEditors.py

This script grabs a sample of editors to invite to a project, stores them in invited_yourprojectname, and then prints a report of these invitees to a wikipage.

Output
The output markup of the invitee report should look like this:

 ==Daily Report==

 ===Active new editors===

Below is a list of editors who registered within the last 24 hours, have since made more than 2 edits, and were not blocked at the time the report was generated.

 {| class="wikitable sortable plainlinks"
 |-
 ! Editor #
 ! User name
 ! Edit count
 ! Contribs
 ! Already invited?
 |-
 | 299
 | [[User_talk:Ironaxus|Ironaxus]]
 | 2
 | [[Special:Contributions/Ironaxus|contribs]]
 | 0
 |-
 | 300
 | [[User_talk:Daniel Steinman|Daniel Steinman]]
 | 12
 | [[Special:Contributions/Daniel Steinman|contribs]]
 | 0
 |-
 | 301
 | [[User_talk:Rickibraga|Rickibraga]]
 | 2
 | [[Special:Contributions/Rickibraga|contribs]]
 | 0
 |-
 | 302
 | [[User_talk:Rodrigocswm|Rodrigocswm]]
 | 2
 | [[Special:Contributions/Rodrigocswm|contribs]]
 | 0
 |-
 | 303
 | [[User_talk:TeofilHistoricul13|TeofilHistoricul13]]
 | 2
 | [[Special:Contributions/TeofilHistoricul13|contribs]]
 | 0
 |-
 | 304
 | [[User_talk:Manuel Quaresma de M. Júnior|Manuel Quaresma de M. Júnior]]
 | 13
 | [[Special:Contributions/Manuel Quaresma de M. Júnior|contribs]]
 | 0
 |-
 |}

A similar (live) report can be viewed at http://en.wikipedia.org/wiki/Wikipedia:Teahouse/Hosts/Database_reports

inviteNewEditors.py

This script sends invitations to the users who have recently been added to invited_yourprojectname. The invitation can be personalized adding an editor's username (see the curhosts list). You can also check whether the page contains templates that would preclude the user from receiving an invite, such as certain user warnings and nobots templates. After the invites are sent, the script updates invited_yourprojectname, indicating which users were invited and which ones were skipped.

Reading the category tree

It is in fact relatively easy (though in some cases somewhat time-consuming) to reconstruct a category tree from the MediaWiki database.

Reading down the tree
For example, lets say you want to get all the subcategories of WikiProject Anatomy Articles:

select * from categorylinks where cl_to = "WikiProject_Anatomy_articles" AND cl_type = "subcat";

You see that there are four subcategories. Now you want to get the subcategories of one of those categories, such as Anatomy articles by importance:

select * from categorylinks where cl_to = "Anatomy_articles_by_importance" AND cl_type = "subcat";

Cool! Now you want to see if one of those subcategories also has subcategories:

select * from categorylinks where cl_to = "Mid-importance_Anatomy_articles" AND cl_type = "subcat";

That returns 0 rows, because there are no subcategories to this category. But if you change the cl_type you're looking for to "page", you'll see that there are lots of pages in that subcategory, most of which will probably not be listed under the parent category (unless someone has explicitly added both a parent and child category to the same page, which does happen, so as you read down the tree you'll have to screen out duplicates and/or assign some pages both primary and secondary categories)

select * from categorylinks where cl_to = "Mid-importance_Anatomy_articles" AND cl_type = "page";

So... the moral of the story is that this can be scripted up to output all the pages and subcategories under any given parent category. You can put these into a table like articles_yourprojectname to more efficiently monitor who's doing what on the articles your project is interested in.

Reading up the tree
It's a little trickier, but still doable, to read the tree backwards. You can't look at cl_type: the only types are 'page', 'subcat' and 'file'.

select cl_type from categorylinks where cl_type NOT IN ("page", "subcat", "file") limit 1; ...returns 0 rows :(

There's no 'supercat', 'is_subcat', 'is_parent' or anything.

So you have to see if the sub-category page itself (cl_from) is listed categorized under a broader category, by matching page_id = cl_from. So for example this page: http://en.wikipedia.org/wiki/Category:Anatomy_articles_by_importance

...is categorized under 'WikiProject Anatomy Articles' and 'Wikipedia 1.0 assessments', which you can see in the database with a query like this:

select * from enwiki.page as p, enwiki.categorylinks as c where p.page_id = c.cl_from and p.page_namespace = 14 and p.page_title = "Anatomy_articles_by_importance";

Measuring participation

There are many interesting datapoints around project participation that you could track. But it is a good practice to know what you want to measure before you start collecting data. This not only saves you time, it also helps you avoid the common research pitfall of data dredging, where you hunt through a bunch of datapoints looking for significant correlations without a firm idea of which correlations are most important, or what the correlations you find actually mean.

When you invite someone to participate in a project on a community wiki, you usually want to be able to answer specific questions about how participating in that project might effect the editing behavior of the users you invite. You probably also want to know how many of the people you invited actually saw your invitation, how many acted on it by visiting your project, and what they did when they got there (e.g. posting on the talk page, adding their name to the member list). You may also want to compare the activity of your visitors with a control group.

Basic metrics
Below, I provide a few datapoints that you should consider tracking about the users you invite to your project, and describe why they are useful. Many of these datapoints, such as who was invited and whether they visited your project, can be stored in a single table (such as your invited_yourprojectname table) and updated automatically. Others, such as how many edits your new project members made to project articles, you may want to wait and only calculate after your project pilot has concluded, to feature in your project report.

  • users you invited: Which users did you invite to participate to your project? This data is stored by default in the invited_yourprojectname table.
  • users who visited: Of those who were invited, which ones ended up visiting your project? You can count someone as a 'visitor' is to measure how many of them made at least one edit to the project page, talk page, or sub-pages. If you have access to clickthrough or pageview data, you might also check to see how many of these people clicked on the link in your invitation message, even if they did not edit.
  • users who declined: It can be hard to tell if someone chose not to visit your project because they just weren't interested, or if they actually did not see the message you posted. Fortunately, most(?) Wikimedia Wikis use a little yellow notification banner to let you know if someone has posted a message on your user talk page since your last login. So if you have access to user session data that allows you to see whether someone has logged in since you delivered their invite, you can be somewhat confident that they at least saw your message. If you don't have access to session logs, you can still get an estimate of users who declined your invitation by checking to see whether they have made at least 1 edit somewhere on the wiki after receiving the invitation.
  • project page edits by visitors: How many times did these users edit your project pages? Which pages did they edit? You can even get a sense of what kinds of edits they made, without actually viewing the text of the revision. For example, if someone edits the project talk page, you can be reasonably certain that they posted a message there. Using edit comment metadata available on Wikimedia wikis, you can often tell which sections of a page they edited. For example, if they edited a section called /*project members*/, you can reasonably assume that they added their name to the project member list.
  • visitor edits by namespace: You can look at which namespaces the visitors edited most: for example, are they editing more article pages, talk pages or user pages?
  • articles edited/created: It is also useful to know which articles they edited, or what categories of articles. Do your visitors mostly edit articles that are related to your WikiProject? Do they create new articles in the same topic area?
  • edits over time: It is useful to know how long your project members continue to edit. On many Wikimedia wikis, most new editors stop editing within a few days or weeks (because they get bored, get frustrated, who knows?). But research has shown that joining a project with other editors can make new editors continue participating longer. So consider analyzing how many of your visitors make at least x edits per week for y weeks/months. You can set the edit threshold at different levels, but if you want to be able to compare your study to other studies of editor retention, it is probably best to use a common measure of activity, such as "active editors" (5 edits per month) or "very active editors" (100+ edits per month), which are used by stats.wikimedia.org.
Control groups
A control group is a sample of people (in this case, wiki editors) who did not receive a treatment. Establishing some sort of control group is necessary if you want to determine whether joining your project made a difference in what users did on Wikipedia (such as how many edits they made, how long they continued to edit, whether they vandalized articles, etc.). There are many ways of establishing a control group, but one fairly straightforward way is to only invite some of the editors who meet your criteria for invitation. For example, if every day on your wiki there are about 150 new users who register an account and make at least 5 edits, you could automatically invite 100 of these editors to your project, but not invite the other 50.

You can keep track of who's-who by creating a sample_type field in your <invited_yourprojectname></invited_yourprojectname> table, and populating it with "exp" or "con" (or 1 and 0), based on whether an editor was sent an invitation or not. This way, you can later compute the metrics listed above for both groups, and compare their averages. It may also be informative to use visitors who saw your message, but didn't participate in your project as a control group (see "users who declined" above). If visitors to your project remain active longer, if they create more articles, and/or if they have their edits reverted less, you may be able to infer that your project has a positive impact on new editor behavior. Just remember: correlation does not equal causation, and the way you set up your control group determines the claims you can make about your data!