Wiki Data

Question

Wiki Data

Opened this issue 3 months ago · 23 comments

Answer 1 · 2024-06-18T21:32:23.000Z

The wikiarchive wikis are chugging along, I think the main bottleneck is I/O contention so there isn't much we can do to speed it up.

The wikiarchive wikis don't tend to include the talk pages, which is another reason we will want to re-scrape in v2

The wikimedia data I have processed (wikipedia, wiktionary, etc) didn't have talk pages either. I'm updating code, redownloading, and reprocessing to include the talk pages (will post some stats about how much more data that is when done)

wikitext parsing updates

wtf_wikipedia seems to work well. One place it leaves weird artifacts is images with really long (could be caused by newlines) captions, it leaves the [[ in there. It also remove section names, but I have an easy fix to get them back.

I've also been working on handling math parsing. wtf_wikipedia is pretty inconsistent on how it handles maths. It handles simple stuff well but strips out anything more complex. When processing 1 shard of the wikipedia dataset, 49% of the <math> tags got stripped out (of those, 91% of the more complex <math display="block"> sections were stripped). It does a bit better with the math template syntax that wikitext supports (but that might be because people tend to use it for less complex stuff). Only ~5% of those are totally stripped, but it's processing of the ones that aren't removed is still wrong, the clearest way is that it removes the <sup> and <sub> tags.

My current approach is to:

convert all the <math> tags to appropriate $ and $$. This converts it to latex and seems to cause wtf_wikipedia to leave it alone
detect the {{math|...}} templates in the wikitext and either:
a) remove them from the wikitext, leaving a simple marker, convert the template into latex, use wtf_wikipedia to convert the rest of it, and substitute them back in at the correct place. An issue with this is that the templates can contain random wikitext like the link syntax so we'll need to ~reimplement some wikitext parsing into math template -> latex conversion code.
b) Do rewrites in the templates for things like <sub>/<sup> into a more latex-like format and then pass things into wtf_wikipedia. It it unclear how many of these rewrites will be needed to avoid stripping these templates (for example wtf_wikipedia can handle a template like {{math|1=F=ma}} but will strip a template like {{math|F {{=}} ma}}. We would also want to add $ around these parts.

Some mathlike templates (for example {{mvar|X}} can appear without the {{math|...}} template so my plan is to find the {{math|...}} ones first, which will contain most of the other template, and then pick up the other ones later.

An open question is the handling of unicode symbols. Lots of math articles have symbols like π or θ directly. In a latex conversion would we want to convert these to \pi and \theta? Similarly, there are fraction templates like {{sfrac|1|2}} that wtf_wikipedia converts to 1/2 but might make more sense to have as \frac{1}{2}

I might convert all the dumps with just wtf_wikipedia as a v0 it get approximate token counts while I work on refining the math handling.

The end of the wikipedia page is generally a long list of references, links to related pages, etc. wtf_wikipedia converts these in a more plaintext format, but it is still pretty long and unnatural. I'm thinking of looking into removing these final sections.

Answer 2 · 2024-06-19T00:44:44.000Z

Nice!

I dug out my old codes and notes on wtf_wikipedia processing. Apart from the math stuff you raised, I also found that it would dump out stuff like
File:V Train AC.JPG|Class 310
when there was an image, and that it dumped out wikitext tables (which we probably want to omit in the context of modeling natural text). There also was one example where there were a bunch of instances of

In:
Out:

Maybe that stuff has changed by now though.

Regarding standalone symbols, I think it's appropriate to leave them as unicode rather than try to convert them to LaTeX.

I think when I was messing with wtf_wikipedia before I also just had code that manually stripped out the references section. If it's not natural text, let's just remove it.

Answer 3 · 2024-06-24T21:38:49.000Z

I've seen some of the image thing, looking into how to fix that. I haven't seen the In: Out: thing yet.

I've updated stuff to remove the references sections (also things like external links, see also, etc.)

I found an issue that comes up in how editors tend to style math, basically the indentation is handled wrong and text from below an equation can end up before it. I opened an issue spencermountain/wtf_wikipedia#577 but I don't know JS/wtf wikipedia enough to solve it myself right now. My current workaround is to add newlines between :indent lines and the following text lines. This seems to create results in the correct order

Answer 4 · 2024-06-25T15:04:20.000Z

hey! just stumbled on this, and am happy to help with any issues - it's really useful to find examples where things break.

There is a max-length for images - happy to increase this, if there are known issues.

I've been blessed to never learn latex, and this is probably the cause of the shy parsing. Happy to tweak a regex so more things pass through. I always assumed .text() would just kill that template, but it doesn't have to. Amazing what these new models can do.
cheers

Answer 5 · 2024-06-25T15:25:20.000Z

Hi Spencer, I think the issue with images is just that in our use case we don't want the "File:V Train AC.JPG|Class 310" text. I can see why that would be reasonable to include otherwise though.

Answer 6 · 2024-06-26T22:17:24.000Z

I have some math template munging that seems good enough for now. The general approach is to use a regex to find where to start editing, then iterate forward in the text to find the end of the scope. Then you loop over a bunch of these edits. Only a few cases support nesting of the same template (mset and abs are the main ones that needed it) but it does support nesting of different templates (you can have a overline template inside of a strong template for example)

Examples:

{{math|{{abs|''ε''(''x'')}} < 7.5·10<sup>−8</sup>}} -> |ε(x)| < 7.5·10^{−8}
{{math|''v''<sub>p</sub>&nbsp;{{=}}&nbsp;''ω''/''k''}} -> v_{p} = ω/k
{{math|''x''<sup>p<sub>i</sub></sup><sub>j</sub> {{ = }} {{Fraction|{{delta}}|{{abs|1 - {{abs|y}}}}}}}}-> x^{p_{i}}_{j} = \delta⁄|1 - |y||
{{math|x<sup>2<sup>4</sup></sup>}} -> x^{2^{4}}

I extracted all the {{math|...}} templates from wikipedia and ran my fixes over them. There are 161,459 math templates, and only 971 them now result in an empty string (0.6%). There are a few templates type that make up most of these errors. The commonality of these is that they require parsing the template into parameters and doing different things based on those. For example {{overset|~|a}} puts a tilde over the a. If we want to handle these, the approach I used will get really brittle and we probably need to switch over to an actually grammar (I'm going to try my hand at this, a help page claims wikitext isn't context-sensitive, it just has really complex error handling, but I'm not sure, having things like | to insert a | that doesn't delimit parameters seems pretty context sensitive to me). The mis-handled templates include:

{[vec|...}}
{{font color|...}}
{{color|...}}
{{pars|...}}
{{nuclide|...}}
{{Subatomic Particle|...}}
{{music|...}}
{{thinspace|...}}
{{#if|....}}
{{gap|...}}
{{su|...}}
{{SubSup|...}}
{{braket|...}}
{{bra-ket|...}}
{{overset|...}}

There are also a few cases where the output for {{mset|...}} are a bit mangled as you can use the different arguments to create sets that look like {x | x \in S} There are also a lot of templates defined in wikitext that don't appear in wikipedia so I have no code that handled them. Not sure if they appear in other wikis.

There are some cases that I don't handle atm, for example, one template is {{math|<big><big>(</big></big>...}}. This becomes just ( after running wtf_wikipedia. It would be more correct to translate it to some latex like \bigg( but currently none of the code exposes a nesting depth count. I figured this was good enough.

There are also a few symbols that get stripped out of {{math|...}} templates that seem like they shouldn't. I noticed :, #, * when they are by themselves and () when there isn't anything between them.

Answer 7 · 2024-06-26T23:40:40.000Z

This seems reasonable for v1!

Answer 8 · 2024-07-10T13:50:49.000Z

I've been fixing a bug where a malformed template stops processing of all the other templates later in the document. Now the malformed on is removed and the rest is processed correctly.

Working on running the preprocessor on all the data next

Answer 9 · 2024-07-30T06:19:53.000Z

I'm finished processing the MediaWiki wikis, working on the wikiteam archives next (delayed by having to move data around). The data so far is here https://huggingface.co/datasets/blester125/wiki

The MediaWikis (which include talk pages) have ~14 billion tokens.

Some Stats:

On my desktop (Ryzen 9 7950X) it took ~4hours 15min to process the 64 million documents.
36 documents caused issues with wtf_wikipedia (mostly crashes from trying to look up the read attribute on undefined, I assume a regex didn't match and it isn't handled, 1 crashed from "Invalid String Length", it seems like this is from a document being too long for the maximum string length that v8 allows lol)
12 documents causes OoM issues (I've tried to avoid these by tweaking things like heap sizes but it still happens)
83 documents took so long (>3 minues) to do wtf_wikipedia parsing that our server setup timed them out
~1000 documents ended up blank after wtf_wikipedia parsing, I haven't looked into the quality of these documents before going in (i.e. should they be saved by special case handling)

Answer 10 · 2024-07-30T14:32:44.000Z

Nice. It seems like there's a lot of markup in the text field though, is that expected? e.g.

[[Image:CormacportraitBig.jpg|thumb|100px|right|Greetings!]] Hi, my real name is Cormac Lawler. I've been closely involved with setting up and developing Wikiversity. This process has been the focus of my ongoing PhD (which began in September 2006, and which I will be submitting early 2011): to participate in the definition of what this space is, what it means in the context of the wider world of education, and what its opportunities and limitations are. *Real name: Cormac Lawler *Email: cormaggio (at) gmail (dot) com *Cormaggio on [[w:user:Cormaggio|Wikipedia]] | [[m:User:Cormaggio|Meta]] | [[commons:User:Cormaggio|Commons]] | [[b:User:Cormaggio|Wikibooks]] and many other places :-) (It's an old nickname that I resurrected for an email address, which has subsequently stuck as an online moniker) *[http://www.cormaggio.org/wiki/index.php?title=User:Cormac more detailed bio and interests] *[http://cormaggio.org blog] {{#babel:en|es-1}} ==Research== * [[Wikiversity:Developing Wikiversity through action research|Developing Wikiversity through action research]] - main research page I'm using a [http://cormaggio.org/ blog] and a [http://www.cormaggio.org/wiki/index.php?title=Main_Page wiki] to help document and facilitate my research, but I also intend to carry out much (if not most) of this on Wikiversity itself, particularly as we have developed this project with the potential for actually ''doing'' [[Wikiversity:Research|research]] (though these [http://beta.wikiversity.org/wiki/Wikiversity:Scope_of_research/En guidelines] still need to be developed). ==Courses== * I participated for a while in [[Action research/AREOL25|AREOL25]] (course home page [http://www.scu.edu.au/schools/gcm/ar/areol/areolhome.html here]), until I lost track of a few weeks' emails, and gave up. :-( * Hoping to participate in David Wiley's [http://opencontent.org/wiki/index.php?title=Intro_Open_Ed_Syllabus Open Ed Intro course] - but didn't :-( * [[Composing free and open online educational resources]] - beginning March 3rd 2008 ==Conferences== * [[Learning and learning about learning in Wikiversity]] (Paper at [[Wikimania/2007|Wikimania 2007]]) * [[User:Cormaggio/ALT-C|ALT-C resources]] ([http://www.alt.ac.uk/altc2007/timetable/abstract.php?abstract_id=1318 details]) * I'll be participating in a [http://wiki.cetis.ac.uk/ECSIG_May08_meeting ECSIG meeting] on "openness", in which I'll be outlining Wikiversity and its implications for education (see [http://wiki.cetis.ac.uk/Creating_open_content_for_Wikiversity notes]) * [[Collaborative research on Wikiversity]] (paper for Wikimania 2008) ==Statistics== At '''{{CURRENTTIME}}, [[{{CURRENTMONTHABBREV}} {{CURRENTDAY2}}]], [[{{CURRENTYEAR}}]]''' ([[UTC]]): {{NUMBEROFPAGES}} pages, {{NUMBEROFARTICLES}} articles, {{NUMBEROFFILES}} files, {{NUMBEROFUSERS}} users & {{NUMBEROFADMINS}} admins; version {{CURRENTVERSION}} * At 22:27, Jan 23, 2007 (UTC): 20,606 pages, 1,628 articles, 892 files, 5,725 users & 14 admins; version 1.10alpha (r19597) ==Bureaucrat== I was made an English Wikiversity bureaucrat as a temporary measure on the project's first day, and held that lonesome position for quite a while (unnecessarily long as I subsequently found out), since when I've been joined by [[User:Sebmol|Sebmol]]. If you need any assistance, you can '''[http://en.wikiversity.org/w/index.php?title=User_talk:Cormaggio&action=edit&section=new leave me a message]''' on my [[User_talk:Cormaggio|talk page]]. If it's a general custodian (ie "admin") query, you might be better off with [[Wikiversity:Request custodian action]], as this is monitored by all custodians, not just me. ==Some things I'm interested in doing== *Write research-related guidelines (eg. [[Wikiversity:Research]] and [http://beta.wikiversity.org/wiki/Wikiversity:Scope_of_research/En on Beta]) *Help out on providing a comprehensive and easy-to-follow tutorial on '''what Wikiversity is, and how you can contribute and/or benefit from it'''. *Help in providing a framework for building Wikiversity, which I don't really think exists yet (so far, it's a bit piece-meal, and particular learning materials are connected with particular pedagogies, without making that explicit) **Develop [[School:Education]], [[Wikiversity:Learning]], [[Wikiversity:Disclosures]] *Work on [[Media literacy]] (and all my other interests), hopefully learning about what will work on Wikiversity through that process **[[Making an educational video]] *Metadata - what's happening on that front, I wonder..? *Participate in [[Portal:Audio engineering|Topic:Audio Engineering]] - learn about making music ==[[Portal:Learning Projects|Projects]]== *[[Learning to learn a wiki way]] *[[Wikiversity:Wiki as a tool for learning]] *[[Wikiversity the Movie]] ==Interests== *[[Portal:Media]] *[[Wikiversity:Research|Research]] *[[School:Education|Education]] ==Workspaces== *[[User:Cormaggio/My? research]] *[[User:Cormaggio/Visions of Wikiversity]] *[[User:Cormaggio/Issues for Wikiversity]] *[[User:Cormaggio/Ongoing discussions]] ==Contacts== *[http://www.teachforward.org/ Teachforward] and [http://wikimania2006.wikimedia.org/wiki/Proceedings_talk:KD1 Rob and Kevin's Wikimania talk talkpage] *[http://wikigogy.org/Main_Page Wikigogy] (yet to contact) *[http://www.cnx.org/ cnx.org] Very interesting resource - see [[Wikiversity and Connexions collaboration]] *[http://www.teachingforthefuture.com/ Teaching for the future] (blog on technology and education) *[http://academia.wikia.com/wiki/User:MartinY MartinY at academia wikia (Yorkshireman)] [[de:Benutzer:Cormaggio]] [[es:Usuario:Cormaggio]] [[fr:Utilisateur:Cormaggio]] [[it:Utente:Cormaggio]]

Answer 11 · 2024-07-30T19:37:27.000Z

🤔 which shard/example is this?

Answer 12 · 2024-07-30T19:51:15.000Z

It looks like huggingface was picking up the .../raw/documents/*.jsonl.gz files to display in the dataset preview.

I uploaded a metadata file that restricts the dataset viewer to the `.../v0/documents/*jsonl.gz" files, now it seems to be showing the cleaned versions

Answer 13 · 2024-07-30T21:09:41.000Z

It doesn't seem to be updated for me. Here's the first entry in the dataset viewer:

Main Page Featured books • Wikijunior • Cookbook • Browse books • Help • Forum • Using Wikibooks Welcome to Wikibooks, the open-content textbooks collection that anyone can edit. books with pages. <div style="flex: 1 0 50%; width:50%; min-width:10em; float: right; box-sizing: border-box; font-size:95%; display: flex; flex-wrap: wrap;"> <div style="clear:both; width:100%; margin: 25px 0 0 0; display: flex; flex-wrap: wrap; box-sizing: border-box; " id="mp-content"> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> * Computing * Engineering * Humanities * Languages * Mathematics * Miscellaneous * Recreational activities * Science * Social sciences * Standard curricula * Wikijunior * All subjects

Answer 14 · 2024-07-30T23:05:44.000Z

It look like wtf_wikipedia only removed some of the in-line html/css. The original version has a lot more markup

<div id="mf-main" style="width:100%; margin:0px; padding:0px 5px 5px 5px; "><!-- Opening div -->
<div id="mf-help" style="clear:both; text-align:center; font-size:95%; margin:0em 1em 0em 1em;">

[[Wikibooks:Featured books|Featured books]] •
[[Wikijunior|Wikijunior]] •
[[Cookbook:Table of Contents|Cookbook]] •
[[Wikibooks:Card Catalog Office|Browse books]] •
[[Help:Contents|Help]] •
[[Wikibooks:Reading room|Forum]] •
[[Using Wikibooks]]
</div>
<div style="width:40%; clear:both; float:left; text-align:center; padding-bottom: 1em; ">
<!----><div style="font-size:162%; padding:0.1em;">[[Wikibooks:Welcome|Welcome]] to [[Wikibooks:What is Wikibooks|Wikibooks]],</div>
<!----><div style="font-size:95%; padding-top:0.2em;">the open-content textbooks collection that [[Help:Contributing|anyone can edit]].</div>
<!----><div id="pagecount" style="font-size:85%;">[[Wikibooks Stacks/Departments|{{NUMBEROFBOOKS}} books]] with [[Special:Allpages|{{NUMBEROFARTICLES}} pages]].<!-- above div --></div>
</div>
<div style="flex: 1 0 50%; width:50%; min-width:10em; float: right; box-sizing: border-box; font-size:95%; display: flex; flex-wrap: wrap;">
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Computing|Computing]]
* [[Department:Engineering|Engineering]]
* [[Department:Humanities|Humanities]]
<!----></div>
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Languages|Languages]]
* [[Department:Mathematics|Mathematics]]
* [[Department:Miscellaneous|Miscellaneous]]
<!----></div>
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Recreational activities|Recreational activities]]
* [[Department:Science|Science]]
* [[Department:Social sciences|Social sciences]]
<!----></div>
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Standard curricula|Standard curricula]]
* [[Department:Wikijunior|Wikijunior]]
* '''[[Wikibooks Stacks/Departments|All subjects]]'''
<!----></div>
</div>
<div style="clear:both; width:100%; margin: 25px 0 0 0; display: flex; flex-wrap: wrap; box-sizing: border-box; " id="mp-content">
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{:Main Page/Featured}}</div>
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{:Main Page/Wikijunior}}</div>
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{:Main Page/Recipe}}</div><!--

1. Promoted shelf

-->
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{Promoted shelf|title=Shelf:Games|cover=Chess_board_blank.svg|desc=[[Shelf:Games]] contains books on games, and includes the subsection shelves [[Shelf:Athletic games|athletic games]], [[Shelf:Board games|board games]], [[Shelf:Card games|card games]], [[Shelf:Electronic games|electronic games]], and [[Shelf:Game design|game design]].}}</div><!--

2. Promoted shelf

-->
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{Promoted shelf|title=Shelf:Computer programming|cover=Openscad SVG.svg|desc=[[Shelf:Computer programming]] contains books on programming, such as [[LaTeX]], [[OpenSCAD User Manual]], [[Python Programming]], and [[Java Programming]].}}</div><!--

3. Promoted shelf

-->
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{Promoted shelf|title=Shelf:Languages of Asia|cover=Taj Mahal in March 2004.jpg|desc=[[Shelf:Languages of Asia]] contains books on Asian languages, such as [[Marathi]], [[Bengali]], [[Kannada]], [[Hindi]], and [[Sanskrit]].}}</div>
</div>
{{:Main Page/Sisters}}
</div><!-- Closing div -->
[[Category:Main page| ]]

Answer 15 · 2024-07-31T00:41:14.000Z

Got it. In that case we may want to do a simple HTML removal pass (via bs4 or whatever). Also, is it an artifact of the dataset viewer that there are no newlines?

Answer 16 · 2024-07-31T04:46:49.000Z

Also, is it an artifact of the dataset viewer that there are no newlines?

Yeah it must be, the actual data has \ns in it, and everything looks right if you pass an example to print in python

Answer 17 · 2024-07-31T14:25:43.000Z

Got it. In that case we may want to do a simple HTML removal pass (via bs4 or whatever)

I tried this out, it seems a bit non-trivial/bs4 isn't well suited to remove html fragments in the text, the main issues are:

People talk about code and have examples, these get stripped out
People talk about code that look almost like an html tag (regex with < for example) and that crashes the html parser (can pass through, but it stops the removal of other html from the page)

It also doesn't seem very consistent, for example: The div in the example above gets removed, but this div in another example doesn't

...
Yours, Keegan Peterzell Community Liaison, Wikimedia Foundation 23:06, 17 March 2015 (UTC)\n\nRenamed\n<div class=\"plainlinks mw-content-ltr\" lang=\"en\" dir=\"ltr\u201d> This account has been renamed as part of single-user login finalisation.
...

Not sure what the difference is, it isn't something like the divs gets closed in one example but not the other, they are not closed in either example

Answer 18 · 2024-07-31T16:14:15.000Z

I'm less worried about the example code getting stripped out (a small price to pay to make the rest of the text much more "natural"). I'm surprised the parser is brittle to things that "look like" an HTML tag, that's too bad - I would also guess that's parser-dependent though? And it's bizarre that the one of the divs is removed and not the other. I don't want you to go down a rabbit hole, so I will ask around a little.

Answer 19 · 2024-07-31T16:26:37.000Z

Poking around a little more I'm mostly seeing unclosed <div> and <font> tags - I wonder if we can just strip those out via simple postprocessing?

Separately there are many very short pages (many of which are almost empty user talk pages). Probably worth doing some heuristic filtering to remove them (though this could be done before training, not to the dataset itself)

Answer 20 · 2024-07-31T17:27:49.000Z

Are most of these unclosed tags coming from any particular source/namespace/shard?

Answer 21 · 2024-07-31T17:31:14.000Z

Poking around a little I see them on wikinews, wikibooks, wikiversity, etc... and apparently across shards.

Answer 22 · 2024-07-31T18:28:03.000Z

To be conservative, I only stripped out div and font tags to start with (<(div|font).*?> instead of something like <.*?>).

There were 115,201 tags that got stripped from 00000_wikibooks.com.jsonl.gz

There didn't seem to be any false positives (i.e. real looking that gets removed), but a few have a title field, but only like 4 so I don't think it's worth trying to put the title text into the plain text. Examples:

      1 "<div style=\"font-size:20pt\" title=\"Welcome!\">"
      1 "<div style=\"font-size:20pt\" title=\"What are you like? (lit. 'How are you [as a person]?')\">"
      1 "<div style=\"font-size:20pt\" title=\"What's the problem?\">"
      1 "<div style=\"font-size:29pt\" title=\"How much/how many is there?\">"

I'll look through it later, but if you noticed any other tag types that were common let me know. I won't be able to actually run the processing until later

Answer 23 · 2024-07-31T18:35:32.000Z

Thanks. If we want to be more thorough we could write a little script that searches for HTML tags and dumps out the most common of them (if there are 115k div/font tags in one shard, they are probably overrepresented compared to tags that would appear otherwise in example code).

Wiki Data

Data Processing Pipeline:

Data Collection

Archive Sources:

Dump Sources:

Scrape Sources:

WikiText parsing:

wikitext parsing updates