/WikiTalkParser

A library for extracting and parsing Wikipedia talk pages

Primary LanguagePythonMIT LicenseMIT

WikiTalkParser

WikiTalkParser is a library for extracting and parsing Wikipedia talk pages, identifying comments with their signature, date and indentation in the thread structure. In the current version, talk pages are extracted from the WIkipedia API, given in input a list of articles. Only the English language version is supported.

Language

Tested with Python 2.7

Authors

David Laniado and Riccardo Tasso

Limitations/TODO

  • The parser works only for the English Wikipedia. We are currently working to make it multilingual
  • This version was only tested with article talk pages. Support for user talk pages will be added
  • Users are identified via user name, and user id generated by the software (official Wikipedia user ids are not supported)
  • "Outdent" command is currently not managed

References

For further information, see research paper: When the Wikipedians talk: network and tree structure of Wikipedia discussion pages