/PHP-Wikipedia-Syntax-Parser

Given raw contents and title of a Wikipedia article, this will output highly useful information in an organized fashion.

Primary LanguagePHP

JungleDB PHP Wikipedia Parser

This is an attempt at extracting useful information out of raw Wikipedia page syntax, written as a portable PHP class. Originally written for JungleDB. Released the most recently updated (2015-02-13) version of the wiki_parser.php script, which is a significant improvement over the last copy.

I don't expect to update this repository in the forseeable future.

How to use

  1. $wikipedia_syntax_parser = new Jungle_WikiSyntax_Parser($raw_wikipedia_syntax, "George Harrison");

    $raw_wikipedia_syntax is the raw Wiki syntax from a database dump or from the Edit textarea of a given page. An example of this syntax is provided in sample_input.txt.

    "Goerge Harrison" is a string containing the full Wiki page title (e.g.: George Harrison, Template:Wikipedia Syntax, File:image.png) and is optional (this helps determine the page_type [Main, Template, Special, File, ...])

  2. $parsed_wiki_syntax = $wikipedia_syntax_parser->parse();

    Your $parsed_wiki_syntax variable becomes an array with information about the Wiki page itself and useful information extracted from within. An example of this output (using the old_version/wiki_parser.php), after parsing sample_input.txt, can be found in sample_output.txt. No preview of the latest revision is available but it is vastly improved and worth the effort to get it working on your end.

Notes

  • When reading Wiki syntax files from disk, make sure they are properly encoded in UTF-8. To read these correctly encoded files, please use implode(file('WIKI_RAW_SYNTAX.TXT')) as file_get_contents('WIKI_RAW_SYNTAX.TXT') seems to mess up language-specific characters.

Usage

If you make use of all or any portion of this code, please add an attribution linking to this github repo.