Preprocessing language+mathematics corpora for pretraining

Question

Preprocessing language+mathematics corpora for pretraining

dginev opened this issue 4 years ago · 5 comments

Answer 1 · 2020-11-28T19:41:45.000Z

Answer 2 · 2020-11-28T19:57:03.000Z

Thanks @holtzermann17 , interesting suggestion. I will also "audit" that resource. I've added the link to their latest XML dump in the issue description and downloading now. Doing a gradual step-by-step auditing, preprocessing and packaging into pretrain-friendly versions in the coming weeks.

So please feel free (and anyone else reading here!) to link me to any other large-ish resources with math syntax that feel similar to the list above. And thanks!

Answer 3 · 2020-11-29T02:59:51.000Z

Here is an unfortunate example of how s2orc deals with mathematical formulas, a fundamental limitation of scraping PDFs. You'll see scripts silently put on the baseline, badly tokenized paragraphs when there is display math, and of course missing markup. A non-starter in my eyes.

Obtained from their sample.jsonl file.

{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"The choice of the function S(x, t) depends on the weights and sup x\u2208\u2126 u 0 (x).",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"Proof. Let S = p + q with p and q such that",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"where r = |x| = x 2 1 + x 2 2 1/2 . The positive constants A, a and \u03ba will be chosen later. Then p and q satisfy",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"If a satisfies ",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"Then, by the choice of \u03ba, k and a, we see that S = p + q satisfies",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Theorem 2.1. For the solution u to Problem (1.1) with nonnegative weights f , there is a sufficiently smooth function S(x, t) such that u(x, t) < S(x, t), t > 0. (2.3)",
   "text":"and by the inequality (2.6), we obtain for z \u2208 \u2202\u2126,",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"S(z, t) = p(z, t) + q(z, t) =",
   "text":"A e \u03bat + e \u03bat e \u22122a > Ae \u03bat > f (z, \u00b7) 2 ( p(\u00b7, t) 2 + q(\u00b7, t) 2 ) \u2265 \u2126 f(z, y)(p + q) dy = \u2126 f (z, y",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"Note that above inequalities hold for arbitrary positive constant A. After choosing a and \u03ba, let A satisfy 2Ae \u2212a > sup x\u2208\u2126 u 0 (x). Then one has",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"Hence S(x, t) is a supersolution to (1.1), and thus inequality (2.3) holds by Theorem 1.1.",
   "cite_spans":[],"ref_spans":[]},
{
   "section":")S(y, t) dy, 0 < t < T.",
   "text":"By Theorem 2.1, we have a supersolution for any T > 0. Hence the local solution u on D T from Theorem 1.2 is bounded in D T for arbitrary T > 0, and thus u can be extended to the whole time domain.",
   "cite_spans":[],"ref_spans":[]},
{
   "section":"Decreasing property of boundary values.",
   "text":"In this section, boundary behavior of the solution to Problem (1.1) is studied. The difference of largest and smallest boundary values grows exponentially (inequality (3.4) ). When the weights are identically zero on some part of boundary, it is shown that the difference can be nonincreasing in Theorem 3.2.",
   "cite_spans":[],"ref_spans":[]}

Answer 4 · 2020-12-20T01:26:10.000Z

Preprocessing fine-print for approaching the arXMLiv 2020 and Stackexchange (kiwix 2020) sets. The usual path has been:

HTML -> plaintext -> tfrecords

Pretraining an LM is proudly robust to noise and there shouldn't be much need for overthinking the details here. Getting a sane basic setup, that boosts the core textual English + math syntax signal, should be a priority for our purposes. A range of other content can get dropped.

Mostly asking the questions:

Some decisions on how to group non-article genres need to be made. Is an Answer to be embedded in vacuum or with some context e.g. preceded by title? Embedding the entire content of a SE page is not realistic, as most autoencoder models tend to use 512 input tokens.
Unicode should be allowed without alterations, at most sanitized for control chars
as in 2019, keep in refs and citations, but normalize to a fixed keyword REF and CITE
drop images, tables, footers, headers, any and all metadata, focus on textual content
preserve punctuation and letter case
? include font treatments as we do in math for text? E.g. 𝔄 to fraktur_U, and word to bold_word
- example from arXiv cs/0003030:
```
<span class="ltx_text ltx_font_typewriter">total_capacity(T)</span> means that ...
```
? include marks for links? I'm definitely dropping the href but say <a href="wiki...">abelian group</a> could be normalized to start_LINK abelian group end_LINK.
~~? include a special word in the vocabulary for paragraph breaks (\n\n in plaintext), as an extra cue? Helpful in SE~~ Just use \n\n
? include markers for list items? Maybe a unicode bullet •, to also align with the spurious bullets we see in arXiv?
? include markers for code blocks (start_CODE and end_CODE ?)
? include markers for quotations (start_BLOCKQUOTE and end_BLOCKQUOTE ?)
? are we handling footnotes correctly?
? SELFIES-like lexeme refactor
Will keep adding here as other bits stream in as I study the data/processing...

Answer 5 · 2021-03-01T23:11:37.000Z

As I'm looking into the proofwiki data, cataloging some related links with further resources that I stumbled on:

Lists:

Resources:

... there's more to expand from here. I won't track them all down now, but it's encouraging to see there is such a wide variety of resources to compile data from, rich in math syntax.

Blog Roll