Preprocessing language+mathematics corpora for pretraining
dginev opened this issue · 5 comments
A good 2020 use of llamapun would be to use it as a unified preprocessing step for a variety of HTML corpora which also include math syntax by one trick or another. The goal would be to do the legwork on a variety of HTML dialects so that we get clean and maximally denoised data as a plaintext target, with a primary focus on using that textual form for pretraining a neural language model. I will be using this issue as a documentation placeholder for the various targets I have in mind.
What are we looking for?
- primary textual sources (rather than remixed/curated train sets from other experiments)
- openly available for download & research
- processable math syntax that we can reliably normalize and lexematize
- interesting exceptions: e.g. consider synthetic datasets that offer diverse examples of math syntax use, and/or posing problems
I will re-edit this description to include corpora I think I can include for the current pass.
Decided to include (data has been obtained locally, checked when preprocessing is completed):
- our own arXiv as HTML5, as usual
- PubMed Central textmining resources
- oa subset, 3.2 million docs, 260k of which with marked math
- manuscript subset
- Kiwix packaged (fantastic archival effort)
- Wikipedia subset is union of
- subset of "all articles" with math syntax (
alttext
attr) - kiwix-selected subjects of {astronomy, chemistry, climate change, computer science, geography, history, mathematics, medicine, molcell, physics, sociology}
- wikipedia simple articles (all or only with math syntax?)
- subset of "all articles" with math syntax (
- wiktionary and wiktionary simple
- wikiversity
- wikibooks
- wikiquote
- StackExchange subsets of {ai, astronomy, bioinformatics, codereview, cs, cseducators, cstheory, datascience, earthscience, engineering, math, matheducators, mathoverflow, physics, robotics, space, stats} ?
- non-Math sets also included: {academia, chess, ebooks, english, history, law, linguistics, literature, money, patents, philosophy, writers} - rational wiki
- proofwiki - the kiwix distribution is easier to preprocess than the official
latest.xml
, as it is already standard HTML
- Wikipedia subset is union of
- art of problem solving wiki - downloaded separately, as kiwix was missing the
alt=
attributes in math images, needed to lemmatize. - PlanetMath wiki entries
- WikiHow
- UBC course materials
- Stanford encyclopedia of philosophy
- etymonline
- Ancient EU
- deepmind mathematics problems
- deepmind AQuA word problems
- caltech neural PDE datasets
- Stacks project by directly normalizing their HTML dialect, typeset via plastex
- AIMath approved textbooks, typeset via PreText
- ncatlab
- encyclopedia of math
- encyclopedia Britannica
- math.libretexts
- math history
- math programming glossary
- mathworld
To vet:
- LearningQ
- Berkley MATH + AMPS dataset
- openstax
- wikia math & physics problems
- science blogs: sciencealert, sciencebuddies, symmetry magazine, ...
- educational texts - Introduction to Proofs
Vetted, but currently excluded:
- the pile preliminary
- see twitter thread about details why books1 and books3 are too broken for math syntax
- their arxiv conversion is via pandoc and has significantly more breakage than our own, which I've reported back to bmk of EleutherAI
- s2orc - very tempting to just use directly their curated set of 12.7M full-text papers. However, they are all uniformly obtained via PDF scraping (with grobid), so the mathematical markup is badly broken. I'll attach a data sample in the comments below, but PDFs will PDF...
- Open Library Data dumps - just metadata, no content
- PubMedCentral historical OCR has very rocky quality, and no math syntax. So probably better excluded.
- dictionaries such as wordnet are a bit too artificial to fit in
- Project Gutenberg OR wikisource -- adequate preprocessing is rather expensive, and while they are at least partially relevant, will likely defer to a later date.
- wikispecies - a bit too synthetic, great taxonomic language, but little actual sentences fleshing it out.
- vikidia - surprisingly the data quality is a bit poor here, and since STEM is a bit of a minor subset, skipping.
- Elsevier OA CC-BY Corpus
- 40,000 entries, but no traces of entire equations. Small pieces of syntax are traceable though, e.g. 23,250 documents have an equal sign
=
in the texts, 16,000 have a+
. So may be worth including as an extra source for light inline math. - Sadly, a closer look revealed intentional breakage of documents with formulas when producing the JSON - the math syntax is completely missing from the provided data. Since I also find the format rather unpleasant to piece together, I've outright given up on this corpus for now.
- 40,000 entries, but no traces of entire equations. Small pieces of syntax are traceable though, e.g. 23,250 documents have an equal sign
- 800+ textbooks from Open Textbooks - mostly available PDFs, but also some online variants with MathML, and some latex variants. Would take a while to preprocess even to download, so postponing for the next pass.
Maybe add: proofwiki?
Thanks @holtzermann17 , interesting suggestion. I will also "audit" that resource. I've added the link to their latest XML dump in the issue description and downloading now. Doing a gradual step-by-step auditing, preprocessing and packaging into pretrain-friendly versions in the coming weeks.
So please feel free (and anyone else reading here!) to link me to any other large-ish resources with math syntax that feel similar to the list above. And thanks!
Here is an unfortunate example of how s2orc deals with mathematical formulas, a fundamental limitation of scraping PDFs. You'll see scripts silently put on the baseline, badly tokenized paragraphs when there is display math, and of course missing markup. A non-starter in my eyes.
Obtained from their sample.jsonl file.
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"The choice of the function S(x, t) depends on the weights and sup x\u2208\u2126 u 0 (x).",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"Proof. Let S = p + q with p and q such that",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"where r = |x| = x 2 1 + x 2 2 1/2 . The positive constants A, a and \u03ba will be chosen later. Then p and q satisfy",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"If a satisfies ",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"Then, by the choice of \u03ba, k and a, we see that S = p + q satisfies",
"cite_spans":[],"ref_spans":[]},
{
"section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
"text":"and by the inequality (2.6), we obtain for z \u2208 \u2202\u2126,",
"cite_spans":[],"ref_spans":[]},
{
"section":"S(z, t) = p(z, t) + q(z, t) =",
"text":"A e \u03bat + e \u03bat e \u22122a > Ae \u03bat > f (z, \u00b7) 2 ( p(\u00b7, t) 2 + q(\u00b7, t) 2 ) \u2265 \u2126 f(z, y)(p + q) dy = \u2126 f (z, y",
"cite_spans":[],"ref_spans":[]},
{
"section":")S(y, t) dy, 0 < t < T.",
"text":"Note that above inequalities hold for arbitrary positive constant A. After choosing a and \u03ba, let A satisfy 2Ae \u2212a > sup x\u2208\u2126 u 0 (x). Then one has",
"cite_spans":[],"ref_spans":[]},
{
"section":")S(y, t) dy, 0 < t < T.",
"text":"Hence S(x, t) is a supersolution to (1.1), and thus inequality (2.3) holds by Theorem 1.1.",
"cite_spans":[],"ref_spans":[]},
{
"section":")S(y, t) dy, 0 < t < T.",
"text":"By Theorem 2.1, we have a supersolution for any T > 0. Hence the local solution u on D T from Theorem 1.2 is bounded in D T for arbitrary T > 0, and thus u can be extended to the whole time domain.",
"cite_spans":[],"ref_spans":[]},
{
"section":"Decreasing property of boundary values.",
"text":"In this section, boundary behavior of the solution to Problem (1.1) is studied. The difference of largest and smallest boundary values grows exponentially (inequality (3.4) ). When the weights are identically zero on some part of boundary, it is shown that the difference can be nonincreasing in Theorem 3.2.",
"cite_spans":[],"ref_spans":[]}
Preprocessing fine-print for approaching the arXMLiv 2020 and Stackexchange (kiwix 2020) sets. The usual path has been:
HTML -> plaintext -> tfrecords
Pretraining an LM is proudly robust to noise and there shouldn't be much need for overthinking the details here. Getting a sane basic setup, that boosts the core textual English + math syntax signal, should be a priority for our purposes. A range of other content can get dropped.
Mostly asking the questions:
- Some decisions on how to group non-article genres need to be made. Is an Answer to be embedded in vacuum or with some context e.g. preceded by title? Embedding the entire content of a SE page is not realistic, as most autoencoder models tend to use 512 input tokens.
- Unicode should be allowed without alterations, at most sanitized for control chars
- as in 2019, keep in refs and citations, but normalize to a fixed keyword
REF
andCITE
- drop images, tables, footers, headers, any and all metadata, focus on textual content
- preserve punctuation and letter case
- ? include font treatments as we do in math for text? E.g.
𝔄
tofraktur_U
, and word tobold_word
- example from arXiv cs/0003030:
<span class="ltx_text ltx_font_typewriter">total_capacity(T)</span> means that ...
- example from arXiv cs/0003030:
- ? include marks for links? I'm definitely dropping the href but say
<a href="wiki...">abelian group</a>
could be normalized tostart_LINK abelian group end_LINK
. -
? include a special word in the vocabulary for paragraph breaks (Just use\n\n
in plaintext), as an extra cue? Helpful in SE\n\n
- ? include markers for list items? Maybe a unicode bullet •, to also align with the spurious bullets we see in arXiv?
- ? include markers for code blocks (
start_CODE
andend_CODE
?) - ? include markers for quotations (
start_BLOCKQUOTE
andend_BLOCKQUOTE
?) - ? are we handling footnotes correctly?
- ? SELFIES-like lexeme refactor
Will keep adding here as other bits stream in as I study the data/processing...
As I'm looking into the proofwiki data, cataloging some related links with further resources that I stumbled on:
Lists:
- https://proofwiki.org/wiki/ProofWiki:Community_Portal
- https://docs.mathjax.org/en/v2.7-latest/misc/mathjax-in-use.html
- https://en.wikipedia.org/wiki/List_of_online_encyclopedias
Resources:
- https://statproofbook.github.io/I/Table_of_Contents
- http://www.sklogwiki.org/SklogWiki/index.php/WikiNode
- https://mathhelpboards.com/forums/pre-algebra-and-algebra.2/
- http://www.scholarpedia.org/article/Main_Page
- https://asone.ai/polymath/index.php?title=Main_Page
- https://oeis.org/wiki/Main_Page
- https://yutsumura.com/
- https://www.theoretical-physics.net/dev/index.html
- https://www.physicsforums.com/
- https://www.quantumcalculus.org/
- https://annals.math.princeton.edu/
- http://www.randomservices.org/random/index.html
- http://www.fightfinance.com/
- https://www.intmath.com/
... there's more to expand from here. I won't track them all down now, but it's encouraging to see there is such a wide variety of resources to compile data from, rich in math syntax.
Blog Roll
- http://blogs.brandeis.edu
- http://www.astro.gla.ac.uk/users/eduard/cesra/
- https://blog.csiro.au/
- https://blogs.agu.org/
- https://blogs.ei.columbia.edu/
- https://blogs.esa.int/rocketscience/
- https://cosmosmagazine.com/
- https://eos.org/
- https://mosaicscience.com/
- https://phys.org/archive/
- https://scienceblogs.com/
- https://scitechdaily.com/
- https://theness.com/neurologicablog/
- https://theplanets.org/
- https://wattsupwiththat.com/
- https://www.eurekalert.org/
- https://www.futurity.org/
- https://www.iflscience.com/
- https://www.improbable.com/
- https://www.livescience.com/
- https://www.ox.ac.uk/news/science-blog
- https://www.realclearscience.com
- https://www.sciencedaily.com/
- https://www.sciencenewsforstudents.org
- https://www.smithsonianmag.com/
- https://www.theopennotebook.com/
- https://www.universetoday.com/
- https://blog.nature.org
- https://www.zmescience.com/
- https://www.hakaimagazine.com