Add notice about Parse Wiki Text
Closed this issue · 5 comments
As you already know, parsing wiki text is very challenging, and Mediawiki Parser is just a work in progress. I'm afraid it will forever be just a work in progress, because wiki text was never designed to be parsed by a formal parser. I have taken another approach and developed a parser for wiki text which aims to correctly parse all wiki text, not just a subset, and is already production ready, with hundreds of test cases for the most challenging edge cases, making sure it parses wiki text exactly the same as Mediawiki, even when there's obviously a bug in Mediawiki. It's also very fast. Please see the readme file for more information.
You can find Parse Wiki Text on Github or crates.io.
I have also created additional tools for working with data from wikis, such as Parse Mediawiki Dump (Github, crates.io).
I would like to ask of you to add a notice to the readme file of Mediawiki Parser to suggest to people who need a production ready parser to have a look at Parse Wiki Text. The readme file of Parse Wiki Text also contains a section describing the differences from Mediawiki Parser.
Thank you and happy hacking!
Thanks for the hint, I updated the readme file to make the project scope more clear. I would still argue with your description, where you claim that the subset was too small to parse actual wiki pages, since we are using it for this exact purpose. But we want it to fail if the user types someting funny. This project is also not a linter.
Thank you anyway for putting up with all the MediaWiki weirdness, it's great we now have a crate for that!
Thanks for the update. If you feel my comparison is unfair you're welcome to open an issue in Parse Wiki Text. I did however check Mediawiki Parser with three simple pages from Wiktionary (1, 2, 3) just to be sure. None of them failed and none was correctly parsed. (parameter name not parsed, external link parsed where there is none, preformatted text not parsed, link trail not parsed, link/category/image distinction not performed, character entity not parsed)
Yes, the external reference parsing is wrong here, I will open an Issue for this. Could you please point out where preformatted text is not parsed correctly?
Regarding link/category/image destinction or html entity parsing: This parser only builds a purely syntactic representation, I don't want semantic interpretation.
The preformatted text is there: https://cs.wiktionary.org/wiki/%C4%8Dlov%C4%9Bk#skloňování
Ah thanks. I was not aware mediawiki considers this as preformatted.