r-three/common-pile

Wiki Data

Opened this issue · 23 comments

This issues tracks all wiki processing as the different sources have been unified. Will be closing #7 and #1

There are three sources of wiki data we will be using:

  • Internet Archive dumps (from the wikiteam)
  • Official wiki dumps (wikimedia---i.e., wikipedia---fandom, etc.)
  • v2: Wiki scrapes using the wikiteam tools

Sources to create a list of wikis:

  • Enumeration of the wikimedia sources (wikipedia, wikibooks, wikinews, wikiquote, wikisource, wikiversity, wikivoyage, wiktionary)
  • Search on the internet archive
  • v2: List of wikis scraped from wikiindex (checked against the internet archive search

Data Processing Pipeline:

Data Collection

Archive Sources:

  • Collect Wikimetadata from the Internet Archive via search
    • Resulted in ~325,000 wikis based on looking for CC-BY, CC-BY-SA, and publicdomain
      • This is restricted to wiki's that are in the wikiteam collection. When you allow the wikicollections tag too it jumps to ~4million wikis.
    • Takes ~40 hours to download the metadata serially using the Internet Archive python library. Speed up by merging the result of mutually exclusive queries.
  • Download wiki's from IA based on their metadata
    • Download of 4.4TB, extracted to 13TB, reduced to 8.8T by removing unneeded things like images.
    • Dumps have multiple formats, multiple compression methods, and multiple places where the real data lived.
    • Can be parallelized by providing --worker_id and --num_workers
    • During downloading and the like the metadata.identifier field is used as the key, but our actual "source" field will use the domain name.
    • Dispatch on special cases
      • When a wiki is dead, download the IA dump (most recent) -> wiki/archive
      • v2: When a wiki is live, if it has an official dump (like fandom), then download that dump -> wiki/dump
      • v2: When a wiki is live, if the IA dump is older than ...days, then rescrape with wikiteam tools -> wiki/scrape
      • Currently all data is downloaded from the IA.
  • In Progress: Converting target directory naming from just ${target_dir}/${id} to ${target_dir}/${id}[:n]/${id}[n:2n]/.../${id}. Things like os.path.exists`, used to check for data already processed are currently really slow (~40 minutes to check if all ~325,000 exist on disk), this should help speed things up.
  • In Progress: Convert each dump to the dolma format such that the "text" field has wikitext in it.

Dump Sources:

  • Download the enumeration of wikimedia sources
  • Extract the wikimedia dumps
  • Convert each dump to the dolma format such that the "text" field has wikitext in it.
    • This is pretty fast, running a single thread on a CPU from 2017 took <2hr to extract all the pages from the wikipedia dump

Currently unclear if a dump grabbed based on IA metadata should be saved into the same area as the wikimedia dumps where processing is based on what lives in the dirs or it should stay in the IA area where the metadata file dictates what is processed.

Scrape Sources:

  • v2: Scrape Wikis listed in wikiindex that aren't included in the IA metadata using the wikiteam3 tools (it is in python3). Make sure to include the talk pages in the scrape.
  • v2: Scrape wikis from our manual list that aren't included in the IA metadata. One difficultly will be entity linking, the id's used in IA are not super consistent, some wiki's have multiple dumps (from different times) available while others have multiple dumps from different points in times as different entities with unique ids (which have the date in them).
  • v2: Include wiki's from the IA that are found with the wikicollection tag in addition to the wikiteam tag

WikiText parsing:

Now that everything is in a unified sharded format, actual parsing of wikitext is essentially infinitely horizontally scalable.

  • Create a dolma parallel processor that converts wikitext in "text" field to plain text.
    • This currently uses wtf_wikipedia behind a simple nodejs server.
    • Dolma handles parallelization over the possible worker cores (each worker processes a shard at a time)
    • The parallel processor calls to the nodejs server.
    • Multiple instances of the nodejs server can be run behind a load balancer like HAproxy
    • My 4 core laptop can process ~700 documents/second with this setup (4 dolma workers and 4 server instances)
  • Modify wtf_wikipedia parsing to do what we want wrt math
    • Currently we are editing the math tags and templates in python in such a way that they aren't touched by wtf_wikipedia.
    • Conversion from wikimath to latex is mostly done, just 0.6% of templates are not handled (they just get removed)
    • In Progress v2?: Convert processing to a real grammar and convert to latex by munging the AST.
  • Spot check wtf_wikipedia output and add python post processing (for example the headers for things like "external links" seem to be left in the pages.
  • Run wikitext conversion for all the wiki dumps
    • MediaWiki
    • WikiTeam3

The wikiarchive wikis are chugging along, I think the main bottleneck is I/O contention so there isn't much we can do to speed it up.

The wikiarchive wikis don't tend to include the talk pages, which is another reason we will want to re-scrape in v2

The wikimedia data I have processed (wikipedia, wiktionary, etc) didn't have talk pages either. I'm updating code, redownloading, and reprocessing to include the talk pages (will post some stats about how much more data that is when done)

wikitext parsing updates

wtf_wikipedia seems to work well. One place it leaves weird artifacts is images with really long (could be caused by newlines) captions, it leaves the [[ in there. It also remove section names, but I have an easy fix to get them back.

I've also been working on handling math parsing. wtf_wikipedia is pretty inconsistent on how it handles maths. It handles simple stuff well but strips out anything more complex. When processing 1 shard of the wikipedia dataset, 49% of the <math> tags got stripped out (of those, 91% of the more complex <math display="block"> sections were stripped). It does a bit better with the math template syntax that wikitext supports (but that might be because people tend to use it for less complex stuff). Only ~5% of those are totally stripped, but it's processing of the ones that aren't removed is still wrong, the clearest way is that it removes the <sup> and <sub> tags.

My current approach is to:

  1. convert all the <math> tags to appropriate $ and $$. This converts it to latex and seems to cause wtf_wikipedia to leave it alone
  2. detect the {{math|...}} templates in the wikitext and either:
    a) remove them from the wikitext, leaving a simple marker, convert the template into latex, use wtf_wikipedia to convert the rest of it, and substitute them back in at the correct place. An issue with this is that the templates can contain random wikitext like the link syntax so we'll need to ~reimplement some wikitext parsing into math template -> latex conversion code.
    b) Do rewrites in the templates for things like <sub>/<sup> into a more latex-like format and then pass things into wtf_wikipedia. It it unclear how many of these rewrites will be needed to avoid stripping these templates (for example wtf_wikipedia can handle a template like {{math|1=F=ma}} but will strip a template like {{math|F {{=}} ma}}. We would also want to add $ around these parts.

Some mathlike templates (for example {{mvar|X}} can appear without the {{math|...}} template so my plan is to find the {{math|...}} ones first, which will contain most of the other template, and then pick up the other ones later.

An open question is the handling of unicode symbols. Lots of math articles have symbols like π or θ directly. In a latex conversion would we want to convert these to \pi and \theta? Similarly, there are fraction templates like {{sfrac|1|2}} that wtf_wikipedia converts to 1/2 but might make more sense to have as \frac{1}{2}

I might convert all the dumps with just wtf_wikipedia as a v0 it get approximate token counts while I work on refining the math handling.

The end of the wikipedia page is generally a long list of references, links to related pages, etc. wtf_wikipedia converts these in a more plaintext format, but it is still pretty long and unnatural. I'm thinking of looking into removing these final sections.

Nice!

I dug out my old codes and notes on wtf_wikipedia processing. Apart from the math stuff you raised, I also found that it would dump out stuff like
File:V Train AC.JPG|Class 310
when there was an image, and that it dumped out wikitext tables (which we probably want to omit in the context of modeling natural text). There also was one example where there were a bunch of instances of

In:
Out:

Maybe that stuff has changed by now though.

Regarding standalone symbols, I think it's appropriate to leave them as unicode rather than try to convert them to LaTeX.

I think when I was messing with wtf_wikipedia before I also just had code that manually stripped out the references section. If it's not natural text, let's just remove it.

I've seen some of the image thing, looking into how to fix that. I haven't seen the In: Out: thing yet.

I've updated stuff to remove the references sections (also things like external links, see also, etc.)

I found an issue that comes up in how editors tend to style math, basically the indentation is handled wrong and text from below an equation can end up before it. I opened an issue spencermountain/wtf_wikipedia#577 but I don't know JS/wtf wikipedia enough to solve it myself right now. My current workaround is to add newlines between :indent lines and the following text lines. This seems to create results in the correct order

hey! just stumbled on this, and am happy to help with any issues - it's really useful to find examples where things break.

There is a max-length for images - happy to increase this, if there are known issues.

I've been blessed to never learn latex, and this is probably the cause of the shy parsing. Happy to tweak a regex so more things pass through. I always assumed .text() would just kill that template, but it doesn't have to. Amazing what these new models can do.
cheers

Hi Spencer, I think the issue with images is just that in our use case we don't want the "File:V Train AC.JPG|Class 310" text. I can see why that would be reasonable to include otherwise though.

I have some math template munging that seems good enough for now. The general approach is to use a regex to find where to start editing, then iterate forward in the text to find the end of the scope. Then you loop over a bunch of these edits. Only a few cases support nesting of the same template (mset and abs are the main ones that needed it) but it does support nesting of different templates (you can have a overline template inside of a strong template for example)

Examples:

{{math|{{abs|''ε''(''x'')}} < 7.5·10<sup>−8</sup>}} -> |ε(x)| < 7.5·10^{−8}
{{math|''v''<sub>p</sub>&nbsp;{{=}}&nbsp;''ω''/''k''}} -> v_{p} = ω/k
{{math|''x''<sup>p<sub>i</sub></sup><sub>j</sub> {{ = }} {{Fraction|{{delta}}|{{abs|1 - {{abs|y}}}}}}}}-> x^{p_{i}}_{j} = \delta⁄|1 - |y||
{{math|x<sup>2<sup>4</sup></sup>}} -> x^{2^{4}}

I extracted all the {{math|...}} templates from wikipedia and ran my fixes over them. There are 161,459 math templates, and only 971 them now result in an empty string (0.6%). There are a few templates type that make up most of these errors. The commonality of these is that they require parsing the template into parameters and doing different things based on those. For example {{overset|~|a}} puts a tilde over the a. If we want to handle these, the approach I used will get really brittle and we probably need to switch over to an actually grammar (I'm going to try my hand at this, a help page claims wikitext isn't context-sensitive, it just has really complex error handling, but I'm not sure, having things like | to insert a | that doesn't delimit parameters seems pretty context sensitive to me). The mis-handled templates include:

  • {[vec|...}}
  • {{font color|...}}
  • {{color|...}}
  • {{pars|...}}
  • {{nuclide|...}}
  • {{Subatomic Particle|...}}
  • {{music|...}}
  • {{thinspace|...}}
  • {{#if|....}}
  • {{gap|...}}
  • {{su|...}}
  • {{SubSup|...}}
  • {{braket|...}}
  • {{bra-ket|...}}
  • {{overset|...}}

There are also a few cases where the output for {{mset|...}} are a bit mangled as you can use the different arguments to create sets that look like {x | x \in S} There are also a lot of templates defined in wikitext that don't appear in wikipedia so I have no code that handled them. Not sure if they appear in other wikis.

There are some cases that I don't handle atm, for example, one template is {{math|<big><big>(</big></big>...}}. This becomes just ( after running wtf_wikipedia. It would be more correct to translate it to some latex like \bigg( but currently none of the code exposes a nesting depth count. I figured this was good enough.

There are also a few symbols that get stripped out of {{math|...}} templates that seem like they shouldn't. I noticed :, #, * when they are by themselves and () when there isn't anything between them.

This seems reasonable for v1!

I've been fixing a bug where a malformed template stops processing of all the other templates later in the document. Now the malformed on is removed and the rest is processed correctly.

Working on running the preprocessor on all the data next

I'm finished processing the MediaWiki wikis, working on the wikiteam archives next (delayed by having to move data around). The data so far is here https://huggingface.co/datasets/blester125/wiki

The MediaWikis (which include talk pages) have ~14 billion tokens.

Some Stats:

  • On my desktop (Ryzen 9 7950X) it took ~4hours 15min to process the 64 million documents.
  • 36 documents caused issues with wtf_wikipedia (mostly crashes from trying to look up the read attribute on undefined, I assume a regex didn't match and it isn't handled, 1 crashed from "Invalid String Length", it seems like this is from a document being too long for the maximum string length that v8 allows lol)
  • 12 documents causes OoM issues (I've tried to avoid these by tweaking things like heap sizes but it still happens)
  • 83 documents took so long (>3 minues) to do wtf_wikipedia parsing that our server setup timed them out
  • ~1000 documents ended up blank after wtf_wikipedia parsing, I haven't looked into the quality of these documents before going in (i.e. should they be saved by special case handling)

Nice. It seems like there's a lot of markup in the text field though, is that expected? e.g.

[[Image:CormacportraitBig.jpg|thumb|100px|right|Greetings!]] Hi, my real name is Cormac Lawler. I've been closely involved with setting up and developing Wikiversity. This process has been the focus of my ongoing PhD (which began in September 2006, and which I will be submitting early 2011): to participate in the definition of what this space is, what it means in the context of the wider world of education, and what its opportunities and limitations are. *Real name: Cormac Lawler *Email: cormaggio (at) gmail (dot) com *Cormaggio on [[w:user:Cormaggio|Wikipedia]] | [[m:User:Cormaggio|Meta]] | [[commons:User:Cormaggio|Commons]] | [[b:User:Cormaggio|Wikibooks]] and many other places :-) (It's an old nickname that I resurrected for an email address, which has subsequently stuck as an online moniker) *[http://www.cormaggio.org/wiki/index.php?title=User:Cormac more detailed bio and interests] *[http://cormaggio.org blog] {{#babel:en|es-1}} ==Research== * [[Wikiversity:Developing Wikiversity through action research|Developing Wikiversity through action research]] - main research page I'm using a [http://cormaggio.org/ blog] and a [http://www.cormaggio.org/wiki/index.php?title=Main_Page wiki] to help document and facilitate my research, but I also intend to carry out much (if not most) of this on Wikiversity itself, particularly as we have developed this project with the potential for actually ''doing'' [[Wikiversity:Research|research]] (though these [http://beta.wikiversity.org/wiki/Wikiversity:Scope_of_research/En guidelines] still need to be developed). ==Courses== * I participated for a while in [[Action research/AREOL25|AREOL25]] (course home page [http://www.scu.edu.au/schools/gcm/ar/areol/areolhome.html here]), until I lost track of a few weeks' emails, and gave up. :-( * Hoping to participate in David Wiley's [http://opencontent.org/wiki/index.php?title=Intro_Open_Ed_Syllabus Open Ed Intro course] - but didn't :-( * [[Composing free and open online educational resources]] - beginning March 3rd 2008 ==Conferences== * [[Learning and learning about learning in Wikiversity]] (Paper at [[Wikimania/2007|Wikimania 2007]]) * [[User:Cormaggio/ALT-C|ALT-C resources]] ([http://www.alt.ac.uk/altc2007/timetable/abstract.php?abstract_id=1318 details]) * I'll be participating in a [http://wiki.cetis.ac.uk/ECSIG_May08_meeting ECSIG meeting] on "openness", in which I'll be outlining Wikiversity and its implications for education (see [http://wiki.cetis.ac.uk/Creating_open_content_for_Wikiversity notes]) * [[Collaborative research on Wikiversity]] (paper for Wikimania 2008) ==Statistics== At '''{{CURRENTTIME}}, [[{{CURRENTMONTHABBREV}} {{CURRENTDAY2}}]], [[{{CURRENTYEAR}}]]''' ([[UTC]]): {{NUMBEROFPAGES}} pages, {{NUMBEROFARTICLES}} articles, {{NUMBEROFFILES}} files, {{NUMBEROFUSERS}} users & {{NUMBEROFADMINS}} admins; version {{CURRENTVERSION}} * At 22:27, Jan 23, 2007 (UTC): 20,606 pages, 1,628 articles, 892 files, 5,725 users & 14 admins; version 1.10alpha (r19597) ==Bureaucrat== I was made an English Wikiversity bureaucrat as a temporary measure on the project's first day, and held that lonesome position for quite a while (unnecessarily long as I subsequently found out), since when I've been joined by [[User:Sebmol|Sebmol]]. If you need any assistance, you can '''[http://en.wikiversity.org/w/index.php?title=User_talk:Cormaggio&action=edit&section=new leave me a message]''' on my [[User_talk:Cormaggio|talk page]]. If it's a general custodian (ie "admin") query, you might be better off with [[Wikiversity:Request custodian action]], as this is monitored by all custodians, not just me. ==Some things I'm interested in doing== *Write research-related guidelines (eg. [[Wikiversity:Research]] and [http://beta.wikiversity.org/wiki/Wikiversity:Scope_of_research/En on Beta]) *Help out on providing a comprehensive and easy-to-follow tutorial on '''what Wikiversity is, and how you can contribute and/or benefit from it'''. *Help in providing a framework for building Wikiversity, which I don't really think exists yet (so far, it's a bit piece-meal, and particular learning materials are connected with particular pedagogies, without making that explicit) **Develop [[School:Education]], [[Wikiversity:Learning]], [[Wikiversity:Disclosures]] *Work on [[Media literacy]] (and all my other interests), hopefully learning about what will work on Wikiversity through that process **[[Making an educational video]] *Metadata - what's happening on that front, I wonder..? *Participate in [[Portal:Audio engineering|Topic:Audio Engineering]] - learn about making music ==[[Portal:Learning Projects|Projects]]== *[[Learning to learn a wiki way]] *[[Wikiversity:Wiki as a tool for learning]] *[[Wikiversity the Movie]] ==Interests== *[[Portal:Media]] *[[Wikiversity:Research|Research]] *[[School:Education|Education]] ==Workspaces== *[[User:Cormaggio/My? research]] *[[User:Cormaggio/Visions of Wikiversity]] *[[User:Cormaggio/Issues for Wikiversity]] *[[User:Cormaggio/Ongoing discussions]] ==Contacts== *[http://www.teachforward.org/ Teachforward] and [http://wikimania2006.wikimedia.org/wiki/Proceedings_talk:KD1 Rob and Kevin's Wikimania talk talkpage] *[http://wikigogy.org/Main_Page Wikigogy] (yet to contact) *[http://www.cnx.org/ cnx.org] Very interesting resource - see [[Wikiversity and Connexions collaboration]] *[http://www.teachingforthefuture.com/ Teaching for the future] (blog on technology and education) *[http://academia.wikia.com/wiki/User:MartinY MartinY at academia wikia (Yorkshireman)] [[de:Benutzer:Cormaggio]] [[es:Usuario:Cormaggio]] [[fr:Utilisateur:Cormaggio]] [[it:Utente:Cormaggio]]

🤔 which shard/example is this?

It looks like huggingface was picking up the .../raw/documents/*.jsonl.gz files to display in the dataset preview.

I uploaded a metadata file that restricts the dataset viewer to the `.../v0/documents/*jsonl.gz" files, now it seems to be showing the cleaned versions

It doesn't seem to be updated for me. Here's the first entry in the dataset viewer:

Main Page Featured books • Wikijunior • Cookbook • Browse books • Help • Forum • Using Wikibooks Welcome to Wikibooks, the open-content textbooks collection that anyone can edit. books with pages. <div style="flex: 1 0 50%; width:50%; min-width:10em; float: right; box-sizing: border-box; font-size:95%; display: flex; flex-wrap: wrap;"> <div style="clear:both; width:100%; margin: 25px 0 0 0; display: flex; flex-wrap: wrap; box-sizing: border-box; " id="mp-content"> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> <div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; "> * Computing * Engineering * Humanities * Languages * Mathematics * Miscellaneous * Recreational activities * Science * Social sciences * Standard curricula * Wikijunior * All subjects

It look like wtf_wikipedia only removed some of the in-line html/css. The original version has a lot more markup

<div id="mf-main" style="width:100%; margin:0px; padding:0px 5px 5px 5px; "><!-- Opening div -->
<div id="mf-help" style="clear:both; text-align:center; font-size:95%; margin:0em 1em 0em 1em;">

[[Wikibooks:Featured books|Featured books]] •
[[Wikijunior|Wikijunior]] •
[[Cookbook:Table of Contents|Cookbook]] •
[[Wikibooks:Card Catalog Office|Browse books]] •
[[Help:Contents|Help]] •
[[Wikibooks:Reading room|Forum]] •
[[Using Wikibooks]]
</div>
<div style="width:40%; clear:both; float:left; text-align:center; padding-bottom: 1em; ">
<!----><div style="font-size:162%; padding:0.1em;">[[Wikibooks:Welcome|Welcome]] to [[Wikibooks:What is Wikibooks|Wikibooks]],</div>
<!----><div style="font-size:95%; padding-top:0.2em;">the open-content textbooks collection that [[Help:Contributing|anyone can edit]].</div>
<!----><div id="pagecount" style="font-size:85%;">[[Wikibooks Stacks/Departments|{{NUMBEROFBOOKS}} books]] with [[Special:Allpages|{{NUMBEROFARTICLES}} pages]].<!-- above div --></div>
</div>
<div style="flex: 1 0 50%; width:50%; min-width:10em; float: right; box-sizing: border-box; font-size:95%; display: flex; flex-wrap: wrap;">
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Computing|Computing]]
* [[Department:Engineering|Engineering]]
* [[Department:Humanities|Humanities]]
<!----></div>
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Languages|Languages]]
* [[Department:Mathematics|Mathematics]]
* [[Department:Miscellaneous|Miscellaneous]]
<!----></div>
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Recreational activities|Recreational activities]]
* [[Department:Science|Science]]
* [[Department:Social sciences|Social sciences]]
<!----></div>
<!----><div style="float:left; width:25%; flex: 1 0 25%; min-width: 12em;">
* [[Department:Standard curricula|Standard curricula]]
* [[Department:Wikijunior|Wikijunior]]
* '''[[Wikibooks Stacks/Departments|All subjects]]'''
<!----></div>
</div>
<div style="clear:both; width:100%; margin: 25px 0 0 0; display: flex; flex-wrap: wrap; box-sizing: border-box; " id="mp-content">
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{:Main Page/Featured}}</div>
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{:Main Page/Wikijunior}}</div>
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{:Main Page/Recipe}}</div><!--

1. Promoted shelf

-->
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{Promoted shelf|title=Shelf:Games|cover=Chess_board_blank.svg|desc=[[Shelf:Games]] contains books on games, and includes the subsection shelves [[Shelf:Athletic games|athletic games]], [[Shelf:Board games|board games]], [[Shelf:Card games|card games]], [[Shelf:Electronic games|electronic games]], and [[Shelf:Game design|game design]].}}</div><!--

2. Promoted shelf

-->
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{Promoted shelf|title=Shelf:Computer programming|cover=Openscad SVG.svg|desc=[[Shelf:Computer programming]] contains books on programming, such as [[LaTeX]], [[OpenSCAD User Manual]], [[Python Programming]], and [[Java Programming]].}}</div><!--

3. Promoted shelf

-->
<!----><div style="box-sizing: border-box; float:left; flex: 1 0 33%; width:33%; border: 0.2em solid #FAFAFA; padding: 1em; background-color:#F3F3F3; min-width: 20em; overflow:hidden; ">{{Promoted shelf|title=Shelf:Languages of Asia|cover=Taj Mahal in March 2004.jpg|desc=[[Shelf:Languages of Asia]] contains books on Asian languages, such as [[Marathi]], [[Bengali]], [[Kannada]], [[Hindi]], and [[Sanskrit]].}}</div>
</div>
{{:Main Page/Sisters}}
</div><!-- Closing div -->
[[Category:Main page| ]]

Got it. In that case we may want to do a simple HTML removal pass (via bs4 or whatever). Also, is it an artifact of the dataset viewer that there are no newlines?

Also, is it an artifact of the dataset viewer that there are no newlines?

Yeah it must be, the actual data has \ns in it, and everything looks right if you pass an example to print in python

Got it. In that case we may want to do a simple HTML removal pass (via bs4 or whatever)

I tried this out, it seems a bit non-trivial/bs4 isn't well suited to remove html fragments in the text, the main issues are:

  • People talk about code and have examples, these get stripped out
  • People talk about code that look almost like an html tag (regex with < for example) and that crashes the html parser (can pass through, but it stops the removal of other html from the page)

It also doesn't seem very consistent, for example: The div in the example above gets removed, but this div in another example doesn't

...
Yours, Keegan Peterzell Community Liaison, Wikimedia Foundation 23:06, 17 March 2015 (UTC)\n\nRenamed\n<div class=\"plainlinks mw-content-ltr\" lang=\"en\" dir=\"ltr\u201d> This account has been renamed as part of single-user login finalisation.
...

Not sure what the difference is, it isn't something like the divs gets closed in one example but not the other, they are not closed in either example

I'm less worried about the example code getting stripped out (a small price to pay to make the rest of the text much more "natural"). I'm surprised the parser is brittle to things that "look like" an HTML tag, that's too bad - I would also guess that's parser-dependent though? And it's bizarre that the one of the divs is removed and not the other. I don't want you to go down a rabbit hole, so I will ask around a little.

Poking around a little more I'm mostly seeing unclosed <div> and <font> tags - I wonder if we can just strip those out via simple postprocessing?

Separately there are many very short pages (many of which are almost empty user talk pages). Probably worth doing some heuristic filtering to remove them (though this could be done before training, not to the dataset itself)

Are most of these unclosed tags coming from any particular source/namespace/shard?

Poking around a little I see them on wikinews, wikibooks, wikiversity, etc... and apparently across shards.

To be conservative, I only stripped out div and font tags to start with (<(div|font).*?> instead of something like <.*?>).

There were 115,201 tags that got stripped from 00000_wikibooks.com.jsonl.gz

There didn't seem to be any false positives (i.e. real looking that gets removed), but a few have a title field, but only like 4 so I don't think it's worth trying to put the title text into the plain text. Examples:

      1 "<div style=\"font-size:20pt\" title=\"Welcome!\">"
      1 "<div style=\"font-size:20pt\" title=\"What are you like? (lit. 'How are you [as a person]?')\">"
      1 "<div style=\"font-size:20pt\" title=\"What's the problem?\">"
      1 "<div style=\"font-size:29pt\" title=\"How much/how many is there?\">"

I'll look through it later, but if you noticed any other tag types that were common let me know. I won't be able to actually run the processing until later

Thanks. If we want to be more thorough we could write a little script that searches for HTML tags and dumps out the most common of them (if there are 115k div/font tags in one shard, they are probably overrepresented compared to tags that would appear otherwise in example code).