allenai/cord19

Latex is mixed into some PMC JATS XML

Opened this issue · 0 comments

Some PMC articles use LaTeX for equation representation. These snippets of LaTeX are interspersed with the XML. The LaTeX are rendered in PMC PubReader; however, they should be cleaned from the full text extract.

Around 500 articles in CORD with PMC XML have LaTeX interspersed. These come from a variety of publishers, BMJ, PeerJ, PLOS, NAR and others. Some have one LaTeX snippet, but many have hundreds.

Here are some PMC ids with LaTeX in the XML:

PMC3792191
PMC3898311
PMC3946006
PMC4169344
PMC4207625
PMC4205154
PMC4206702
PMC4234456
PMC4300606
PMC4330382

For example, this is one paragraph of text extracted from the article corresponding to PMC3792191.

'In this study the benchmark dataset was derived from the S-nitrosylated database (version 1.0) (Chen et al., 2010) at http://dbsno.mbc.nctu.edu.tw/, from which 1,530 proteins in human and mouse species and their SNO sites were downloaded. The corresponding peptide fragments for these SNO sites were derived from UniProt database (release 2012_08). To facilitate description later, let us adopt Chou’s formulation for peptides here that was used for studying signal peptide cleavage sites (Chou, 2001c; Chou, 2001d). According to the formulation, a peptide with cysteine located at its center (Fig. 1) can be written as (1)\documentclass[12pt]{minimal}\n\usepackage{amsmath}\n\usepackage{wasysym} \n\usepackage{amsfonts} \n\usepackage{amssymb} \n\usepackage{amsbsy}\n\usepackage{upgreek}\n\usepackage{mathrsfs}\n\setlength{\oddsidemargin}{-69pt}\n\begin{document}\n}{}\begin{eqnarray*} \displaystyle \mathbf{P}={\mathrm{R}}{-\xi }{\mathrm{R}}{-(\xi -1)}\ldots {\mathrm{R}}{-2}{\mathrm{R}}{-1}\mathbf{C}~{\mathrm{R}}{+1}{\mathrm{R}}{+2}\ldots {\mathrm{R}}{+(\xi -1)}{\mathrm{R}}{+\xi }&&\displaystyle \end{eqnarray*}\end{document}P=R−ξR−(ξ−1)…R−2R−1CR+1R+2…R+(ξ−1)R+ξ where the subscript ξ is an integer, R−ξ represents the ξ-th downstream amino acid residue from cysteine (C), Rξ the ξ-th upstream amino acid residue, and so forth (Fig. 2). Peptides with the profile of Eq. (1) can be further classified into the following two categories: (1) SNO peptide if its center is a SNO site; (2) non-SNO peptide if its center is a non-SNO site, as can be formulated by (2)\documentclass[12pt]{minimal}\n\usepackage{amsmath}\n\usepackage{wasysym} \n\usepackage{amsfonts} \n\usepackage{amssymb} \n\usepackage{amsbsy}\n\usepackage{upgreek}\n\usepackage{mathrsfs}\n\setlength{\oddsidemargin}{-69pt}\n\begin{document}\n}{}\begin{eqnarray*} \displaystyle \mathbf{P}\in \left\{\begin{array}{@{}l@{}} \displaystyle \text{SNO peptide},\quad \text{if C is a SNO site }\\ \displaystyle \text{non-SNO peptide},\quad \text{otherwise } \end{array}\right.&&\displaystyle \end{eqnarray*}\end{document}P∈SNO peptide,if C is a SNO site non-SNO peptide,otherwise where ∈ represents “a member of” in the set theory. After some preliminary trials and also considering the practice of previous investigators (Li et al., 2011; Li et al., 2012; Xue et al., 2010; Xu et al., 2013), we choose ξ = 10 to construct the benchmark dataset for P of Eq. (1). If the upstream or downstream in a protein was less than 10, the lacking residues were filled with the dummy code Z. The peptides thus obtained are subject to a screening procedure to winnow those that have ≥40% sequence identity to any other. Finally, we obtained 2,381 SNO peptides and 11,755 non-SNO peptides. Now let us construct the training or learning dataset 𝕊L as defined by (3)\documentclass[12pt]{minimal}\n\usepackage{amsmath}\n\usepackage{wasysym} \n\usepackage{amsfonts} \n\usepackage{amssymb} \n\usepackage{amsbsy}\n\usepackage{upgreek}\n\usepackage{mathrsfs}\n\setlength{\oddsidemargin}{-69pt}\n\begin{document}\n}{}\begin{eqnarray*} \displaystyle {\mathbb{S}}{\mathrm{L}}={\mathbb{S}}{\mathrm{L}}^{+}\cup {\mathbb{S}}{\mathrm{L}}^{-}&&\displaystyle \end{eqnarray*}\end{document}SL=SL+∪SL− where ∪ represents the “union” in the set theory, \documentclass[12pt]{minimal}\n\usepackage{amsmath}\n\usepackage{wasysym} \n\usepackage{amsfonts} \n\usepackage{amssymb} \n\usepackage{amsbsy}\n\usepackage{upgreek}\n\usepackage{mathrsfs}\n\setlength{\oddsidemargin}{-69pt}\n\begin{document}\n}{}${\mathbb{S}}{\mathrm{L}}^{+}$\end{document}SL+ contains 2,300 samples randomly picked from the aforementioned 2,381 SNO peptides, while \documentclass[12pt]{minimal}\n\usepackage{amsmath}\n\usepackage{wasysym} \n\usepackage{amsfonts} \n\usepackage{amssymb} \n\usepackage{amsbsy}\n\usepackage{upgreek}\n\usepackage{mathrsfs}\n\setlength{\oddsidemargin}{-69pt}\n\begin{document}\n}{}${\mathbb{S}}{\mathrm{L}}^{-}$\end{document}SL− 2,300 samples randomly picked from the 11,755 non-SNO peptides. For readers’ convenience, the 2,300 peptide sequences in the positive learning dataset \documentclass[12pt]{minimal}\n\usepackage{amsmath}\n\usepackage{wasysym} \n\usepackage{amsfonts} \n\usepackage{amssymb} \n\usepackage{amsbsy}\n\usepackage{upgreek}\n\usepackage{mathrsfs}\n\setlength{\oddsidemargin}{-69pt}\n\begin{document}\n}{}${\mathbb{S}}{\mathrm{L}}^{+}$\end{document}SL+ and 2,300 peptide sequences in the negative learning dataset \documentclass[12pt]{minimal}\n\usepackage{amsmath}\n\usepackage{wasysym} \n\usepackage{amsfonts} \n\usepackage{amssymb} \n\usepackage{amsbsy}\n\usepackage{upgreek}\n\usepackage{mathrsfs}\n\setlength{\oddsidemargin}{-69pt}\n\begin{document}\n}{}${\mathbb{S}}_{\mathrm{L}}^{-}$\end{document}SL−, along with their sequence positions (sites) in the parent proteins coded in “UniProt IDs”, are given in Supplemental Information S1.'