A package for parsing, chucking and modifying wikimarkup in R.
Author: Oliver Keyes
License: MIT
Status: In development
Wikimarkup is the language used on Wikipedia and similar projects, and as such contains
a lot of valuable data both for scientists studying collaborative systems and people
studying things documented on or in Wikipedia. mwparser
parses wikimarkup, allowing a
user to filter down to specific types of tags such as links or templates, and then extract components of those tags.
library(mwparser)
library(magrittr)
wikitext <- "this is wikitext with \n [[a|link]] [[or|two]]"
link_paths <- parse_wikitext(wikitext) %>%
get_wikilinks %>%
wikilink_paths(text = TRUE)
link_paths
[1] "a" "or"
mwparser
depends on two things; the reticulate R package and the Python library mwparserfromhell. To install the whole stack, assuming you have pip
:
# In the terminal
pip install mwparserfromhell
# In R
install.packages("reticulate")
devtools::install_github("ropenscilabs/mwparser")
With that, you're good to go!
The library currently has accessors to extract most common types of attribute and components from within them. The next step is exposing the rest of mwparserfromhell
's functionality, which includes:
- More accessors
- The ability to modify wikimarkup pages and their component elements;
- The ability to write out the resulting, modified markup.
Some time after that the goal is to integrate MediaWiki's actual parser, as a replacement for the mwparserfromhell
dependency, using piton.