/MwParserFromScratch

A basic .NET Library for parsing wikitext into AST.

Primary LanguageC#Apache License 2.0Apache-2.0

CXuesong.MW.MwParserFromScratch | CXuesong.MW.MwParserFromScratch NuGet version (CXuesong.MW.WikiClientLibrary)

MwParserFromScratch

A .NET Library for parsing wikitext into AST. The repository is still under development, but it can already handle most part of wikitext.

FuGet Gallery: See library classes and API documentation.

Usage

This package is now on NuGet. You may install the package using one of the following commands

#  Package Management Console
Install-Package CXuesong.MW.MwParserFromScratch -Pre
#  .NET CLI
dotnet add package CXuesong.MW.MwParserFromScratch -v 3.0.0-int.6

After adding reference to this library, import the namespaces

using MwParserFromScratch;
using MwParserFromScratch.Nodes;

Then just pass the text to the parser

var parser = new WikitextParser();
var text = "Paragraph.\n* Item1\n* Item2\n";
var ast = parser.Parse(text);

Now ast contains the Wikitext instance, the root of AST.

You can also take a look at ConsoleTestApplication1, where there're some demos. SimpleDemo illustrates how to search and replace in the AST.

static void SimpleDemo()
{
    // Fills the missing template parameters.
    var parser = new WikitextParser();
    var templateNames = new [] {"Expand section", "Cleanup"};
    var text = @"==Hello==<!--comment-->
{{Expand section|
date=2010-10-05
}}
{{Cleanup}}
This is a nice '''paragraph'''.
==References==
{{Reflist}}
";
    var ast = parser.Parse(text);
    // Convert the code snippets to nodes
    var dateName = parser.Parse("date");
    var dateValue = parser.Parse(DateTime.Now.ToString("yyyy-MM-dd"));
    Console.WriteLine("Issues:");
    // Search and set
    foreach (var t in ast.EnumDescendants().OfType<Template>()
        .Where(t => templateNames.Contains(MwParserUtility.NormalizeTemplateArgumentName(t.Name))))
    {
        // Get the argument by name.
        var date = t.Arguments["date"];
        if (date != null)
        {
            // To print the wikitext instead of user-friendly text, use ToString()
            Console.WriteLine("{0} ({1})", t.Name.ToPlainText(), date.Value.ToPlainText());
        }
        // Update/Add the argument
        t.Arguments.SetValue(dateName, dateValue);
    }
    Console.WriteLine();
    Console.WriteLine("Wikitext:");
    Console.WriteLine(ast.ToString());
}

The console output is as follows

Issues:
Expand section (2010-10-05)

Wikitext:
==Hello==<!--comment-->
{{Expand section|
  date=2017-02-26}}
{{Cleanup|date=2017-02-26}}
This is a nice '''paragraph'''.
==References==
{{Reflist}}

ParseAndPrint can roughly print out the parsed tree. Here's a runtime example

Please input the wikitext to parse, use EOF (Ctrl+Z) to accept:
==Hello==
* ''Item1''
* [[Item2]]
---------
<span style="background:red;">test</span>
^Z
Parsed AST
Wikitext             [==Hello==\r\n* ''Item1]
.Paragraph           [==Hello==\r]
..PlainText          [==Hello==\r]
.ListItem            [* ''Item1''\r]
..PlainText          [ ]
..FormatSwitch       ['']
..PlainText          [Item1]
..FormatSwitch       ['']
..PlainText          [\r]
.ListItem            [* [[Item2]]\r]
..PlainText          [ ]
..WikiLink           [[[Item2]]]
...Run               [Item2]
....PlainText        [Item2]
..PlainText          [\r]
.ListItem            [---------\r]
..PlainText          [\r]
.Paragraph           [<span style="backgro]
..HtmlTag            [<span style="backgro]
...TagAttribute      [ style="background:r]
....Run              [style]
.....PlainText       [style]
....Wikitext         [background:red;]
.....Paragraph       [background:red;]
......PlainText      [background:red;]
...Wikitext          [test]
....Paragraph        [test]
.....PlainText       [test]
..PlainText          [\r\n]

That's fine, but where to get wikitext?

You can use MediaWiki API to acquire the wikitext. For .NET programmers, I've made a client, WikiClientLibrary, that lies beside this repository. There are also MediaWiki API clients in API:Client code.

There's also a simple demo for fetching and parsing without the dependency of WikiClientLibrary in ConsoleTestApplication1, like this

/// <summary>
/// Fetches a page from en Wikipedia, and parses it.
/// </summary>
private static Wikitext FetchAndParse(string title)
{
    if (title == null) throw new ArgumentNullException(nameof(title));
    const string EndPointUrl = "https://en.wikipedia.org/w/api.php";
    var client = new HttpClient();
    var requestContent = new Dictionary<string, string>
    {
        {"format", "json"},
        {"action", "query"},
        {"prop", "revisions"},
        {"rvlimit", "1"},
        {"rvprop", "content"},
        {"titles", title}
    };
    var response = client.PostAsync(EndPointUrl, new FormUrlEncodedContent(requestContent)).Result;
    var root = JObject.Parse(response.Content.ReadAsStringAsync().Result);
    var content = (string) root["query"]["pages"].Children<JProperty>().First().Value["revisions"][0]["*"];
    var parser = new WikitextParser();
    return parser.Parse(content);
}

You may need Newtonsoft.Json NuGet package to parse JSON.

Limitations

  • For now it does not support table syntax, but I'll work on this.
  • Text inside parser tags (rather than normal HTML tags) will not be parsed an will be preserved in ParserTag.Content. For certain parser tags (e.g. <ref>), You can parse the Content again to get the AST.
  • It may handle some pathological cases differently from MediaWiki parser. E.g. {{{{{arg}} (See Issue #1).