KWARC/llamapun

Preprocessing language+mathematics corpora for pretraining

dginev opened this issue · 5 comments

A good 2020 use of llamapun would be to use it as a unified preprocessing step for a variety of HTML corpora which also include math syntax by one trick or another. The goal would be to do the legwork on a variety of HTML dialects so that we get clean and maximally denoised data as a plaintext target, with a primary focus on using that textual form for pretraining a neural language model. I will be using this issue as a documentation placeholder for the various targets I have in mind.

What are we looking for?

  • primary textual sources (rather than remixed/curated train sets from other experiments)
  • openly available for download & research
  • processable math syntax that we can reliably normalize and lexematize
  • interesting exceptions: e.g. consider synthetic datasets that offer diverse examples of math syntax use, and/or posing problems

I will re-edit this description to include corpora I think I can include for the current pass.

Decided to include (data has been obtained locally, checked when preprocessing is completed):


To vet:


Vetted, but currently excluded:

  • the pile preliminary
    • see twitter thread about details why books1 and books3 are too broken for math syntax
    • their arxiv conversion is via pandoc and has significantly more breakage than our own, which I've reported back to bmk of EleutherAI
  • s2orc - very tempting to just use directly their curated set of 12.7M full-text papers. However, they are all uniformly obtained via PDF scraping (with grobid), so the mathematical markup is badly broken. I'll attach a data sample in the comments below, but PDFs will PDF...
  • Open Library Data dumps - just metadata, no content
  • PubMedCentral historical OCR has very rocky quality, and no math syntax. So probably better excluded.
  • dictionaries such as wordnet are a bit too artificial to fit in
  • Project Gutenberg OR wikisource -- adequate preprocessing is rather expensive, and while they are at least partially relevant, will likely defer to a later date.
  • wikispecies - a bit too synthetic, great taxonomic language, but little actual sentences fleshing it out.
  • vikidia - surprisingly the data quality is a bit poor here, and since STEM is a bit of a minor subset, skipping.
  • Elsevier OA CC-BY Corpus
    • 40,000 entries, but no traces of entire equations. Small pieces of syntax are traceable though, e.g. 23,250 documents have an equal sign = in the texts, 16,000 have a +. So may be worth including as an extra source for light inline math.
    • Sadly, a closer look revealed intentional breakage of documents with formulas when producing the JSON - the math syntax is completely missing from the provided data. Since I also find the format rather unpleasant to piece together, I've outright given up on this corpus for now.
  • 800+ textbooks from Open Textbooks - mostly available PDFs, but also some online variants with MathML, and some latex variants. Would take a while to preprocess even to download, so postponing for the next pass.

Maybe add: proofwiki?

Thanks @holtzermann17 , interesting suggestion. I will also "audit" that resource. I've added the link to their latest XML dump in the issue description and downloading now. Doing a gradual step-by-step auditing, preprocessing and packaging into pretrain-friendly versions in the coming weeks.

So please feel free (and anyone else reading here!) to link me to any other large-ish resources with math syntax that feel similar to the list above. And thanks!

Here is an unfortunate example of how s2orc deals with mathematical formulas, a fundamental limitation of scraping PDFs. You'll see scripts silently put on the baseline, badly tokenized paragraphs when there is display math, and of course missing markup. A non-starter in my eyes.

Obtained from their sample.jsonl file.

{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"The choice of the function S(x, t) depends on the weights and sup x\u2208\u2126 u 0 (x).",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"Proof. Let S = p + q with p and q such that",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"where r = |x| = x 2 1 + x 2 2 1/2 . The positive constants A, a and \u03ba will be chosen later. Then p and q satisfy",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"If a satisfies ",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"Then, by the choice of \u03ba, k and a, we see that S = p + q satisfies",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"and by the inequality (2.6), we obtain for z \u2208 \u2202\u2126,",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"S(z, t) = p(z, t) + q(z, t) =",
   "text":"A e \u03bat + e \u03bat e \u22122a > Ae \u03bat > f (z, \u00b7) 2 ( p(\u00b7, t) 2 + q(\u00b7, t) 2 ) \u2265 \u2126 f(z, y)(p + q) dy = \u2126 f (z, y",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"Note that above inequalities hold for arbitrary positive constant A. After choosing a and \u03ba, let A satisfy 2Ae \u2212a > sup x\u2208\u2126 u 0 (x). Then one has",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"Hence S(x, t) is a supersolution to (1.1), and thus inequality (2.3) holds by Theorem 1.1.",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"By Theorem 2.1, we have a supersolution for any T > 0. Hence the local solution u on D T from Theorem 1.2 is bounded in D T for arbitrary T > 0, and thus u can be extended to the whole time domain.",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Decreasing property of boundary values.",
   "text":"In this section, boundary behavior of the solution to Problem (1.1) is studied. The difference of largest and smallest boundary values grows exponentially (inequality (3.4) ). When the weights are identically zero on some part of boundary, it is shown that the difference can be nonincreasing in Theorem 3.2.",
   "cite_spans":[],"ref_spans":[]}

Preprocessing fine-print for approaching the arXMLiv 2020 and Stackexchange (kiwix 2020) sets. The usual path has been:

HTML -> plaintext -> tfrecords

Pretraining an LM is proudly robust to noise and there shouldn't be much need for overthinking the details here. Getting a sane basic setup, that boosts the core textual English + math syntax signal, should be a priority for our purposes. A range of other content can get dropped.

Mostly asking the questions:

  • Some decisions on how to group non-article genres need to be made. Is an Answer to be embedded in vacuum or with some context e.g. preceded by title? Embedding the entire content of a SE page is not realistic, as most autoencoder models tend to use 512 input tokens.
  • Unicode should be allowed without alterations, at most sanitized for control chars
  • as in 2019, keep in refs and citations, but normalize to a fixed keyword REF and CITE
  • drop images, tables, footers, headers, any and all metadata, focus on textual content
  • preserve punctuation and letter case
  • ? include font treatments as we do in math for text? E.g. 𝔄 to fraktur_U, and word to bold_word
    • example from arXiv cs/0003030:
      <span class="ltx_text ltx_font_typewriter">total_capacity(T)</span> means that ...
  • ? include marks for links? I'm definitely dropping the href but say <a href="wiki...">abelian group</a> could be normalized to start_LINK abelian group end_LINK.
  • ? include a special word in the vocabulary for paragraph breaks (\n\n in plaintext), as an extra cue? Helpful in SE Just use \n\n
  • ? include markers for list items? Maybe a unicode bullet •, to also align with the spurious bullets we see in arXiv?
  • ? include markers for code blocks (start_CODE and end_CODE ?)
  • ? include markers for quotations (start_BLOCKQUOTE and end_BLOCKQUOTE ?)
  • ? are we handling footnotes correctly?
  • ? SELFIES-like lexeme refactor
    Will keep adding here as other bits stream in as I study the data/processing...

As I'm looking into the proofwiki data, cataloging some related links with further resources that I stumbled on:

Lists:

  1. https://proofwiki.org/wiki/ProofWiki:Community_Portal
  2. https://docs.mathjax.org/en/v2.7-latest/misc/mathjax-in-use.html
  3. https://en.wikipedia.org/wiki/List_of_online_encyclopedias

Resources:

  1. https://statproofbook.github.io/I/Table_of_Contents
  2. http://www.sklogwiki.org/SklogWiki/index.php/WikiNode
  3. https://mathhelpboards.com/forums/pre-algebra-and-algebra.2/
  4. http://www.scholarpedia.org/article/Main_Page
  5. https://asone.ai/polymath/index.php?title=Main_Page
  6. https://oeis.org/wiki/Main_Page
  7. https://yutsumura.com/
  8. https://www.theoretical-physics.net/dev/index.html
  9. https://www.physicsforums.com/
  10. https://www.quantumcalculus.org/
  11. https://annals.math.princeton.edu/
  12. http://www.randomservices.org/random/index.html
  13. http://www.fightfinance.com/
  14. https://www.intmath.com/

... there's more to expand from here. I won't track them all down now, but it's encouraging to see there is such a wide variety of resources to compile data from, rich in math syntax.

Blog Roll

  1. http://blogs.brandeis.edu
  2. http://www.astro.gla.ac.uk/users/eduard/cesra/
  3. https://blog.csiro.au/
  4. https://blogs.agu.org/
  5. https://blogs.ei.columbia.edu/
  6. https://blogs.esa.int/rocketscience/
  7. https://cosmosmagazine.com/
  8. https://eos.org/
  9. https://mosaicscience.com/
  10. https://phys.org/archive/
  11. https://scienceblogs.com/
  12. https://scitechdaily.com/
  13. https://theness.com/neurologicablog/
  14. https://theplanets.org/
  15. https://wattsupwiththat.com/
  16. https://www.eurekalert.org/
  17. https://www.futurity.org/
  18. https://www.iflscience.com/
  19. https://www.improbable.com/
  20. https://www.livescience.com/
  21. https://www.ox.ac.uk/news/science-blog
  22. https://www.realclearscience.com
  23. https://www.sciencedaily.com/
  24. https://www.sciencenewsforstudents.org
  25. https://www.smithsonianmag.com/
  26. https://www.theopennotebook.com/
  27. https://www.universetoday.com/
  28. https://blog.nature.org
  29. https://www.zmescience.com/
  30. https://www.hakaimagazine.com