petermr/CEVOpen

create footers for tables

Opened this issue · 11 comments

Most of the HTML tables from EuropePMC have headers (<thead> or <th>) but no explicit footers. However there is usually a transition in content.

Typical example:

<?xml version="1.0" encoding="UTF-8"?>
<table xmlns="http://www.w3.org/1999/xhtml">
 <caption class="caption">
  <label class="label">Table 1</label>
  <p class="p" xmlns="">Chemical composition of thyme EO</p>
 </caption>
 <tbody class="tbody">
  <tr class="tr" xmlns="">
   <th align="center" rowspan="1" colspan="1" class="th">No.</th>
   <th align="center" rowspan="1" colspan="1" class="th">RT (min)</th>
   <th align="center" rowspan="1" colspan="1" class="th">Area % of total</th>
   <th align="center" rowspan="1" colspan="1" class="th">Constituents*</th>
  </tr>
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">1</td>
   <td align="center" rowspan="1" colspan="1" class="td">5.39</td>
   <td align="center" rowspan="1" colspan="1" class="td">1.06</td>
   <td align="center" rowspan="1" colspan="1" class="td">alpha-Thujene</td>
  </tr>
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">2</td>
   <td align="center" rowspan="1" colspan="1" class="td">5.63</td>
   <td align="center" rowspan="1" colspan="1" class="td">1.07</td>
   <td align="center" rowspan="1" colspan="1" class="td">alpha-Pinene</td>
  </tr>
...
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">15</td>
   <td align="center" rowspan="1" colspan="1" class="td">19.03</td>
   <td align="center" rowspan="1" colspan="1" class="td">0.78</td>
   <td align="center" rowspan="1" colspan="1" class="td">Cyclohexene, 1-methyl-4-(5-methyl-1-methylene-4-hexenyl)</td>
  </tr>
  <tr class="tr" xmlns="">
   <th align="center" rowspan="1" colspan="1" class="th">Total</th>
   <td align="center" rowspan="1" colspan="1" class="td"/>
   <th align="center" rowspan="1" colspan="1" class="th">99.91%</th>
   <td align="center" rowspan="1" colspan="1" class="td"/>
  </tr>
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">*Constituents presented in the order of elution from the VF 35 MS column.</td>
  </tr>
 </tbody>
</table>

Empirical rule: column1 contains "Total"

Task: determine empirical rules for when footer starts.

split tables into body and footer

Determine the start of the footer and transfer it and all subsequent rows to tfoot

I have created a first pass at this. The footer for compound column looks like:

		<column name="compound" case="insensitive" id="comp.col.comp">
		    <title id="comp.col.comp.tit">
			    <query id="comp.col.comp.tit.q">
				    constituent OR
				    compound OR
				    component
					NOT class
			    </query>
		    </title>
			<cell id="comp.col.comp.cell">
	  		  <query id="comp.col.comp.cell.q1">@CHEMICAL@</query>
<!--	  		  <query id="comp.col.comp.cell.q2" mode="lookup">@COMPOUND_DICT@</query> -->
			</cell>
			<footer>
				<query>total OR yield OR terpene</query>
			</footer>

This split the table at the point BEFORE the first match:
typical results are

AMITableTool cTree: PMC4391421
  table: Table 1Chemical composition of thyme EO
      column: compound => Constituents*; 64.7
      column: percentage => Area % of total; 100.0
AMITableTool cTree: PMC5080681
  table: Table 1Chemical composition, concentrations (%) and calculated retention indices, ofT. boveiessential oil as characterized by GC/MS analysis
      column: compound => Constituents; 97.1
215 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT footer 27
215 [main] DEBUG org.contentmine.cproject.util.RectTabColumn  - SPLIT
215 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT [[Trans-geraniol (Lemonol), α-citral (Trans-citral), β-citral (Cis-citral), Cis-geraniol (nerol), 3-octanol, DL-camphor, Eucalyptol(1,8) cineole, 3-octanone, Thymol, β-linalool, β-farnesene, Geranylisobutyrate, L-borneol, Isocaryophyllene, Camphene, Bergamiol, Dihydrocarveol acetate, α-cyclocitral, β-ocimene, Geranyl propionate, β-myrcene, α-terpineol, α-limonene, Nerolidol, α-terpinene, α-phellandrene, β-pinene], [Total, Yield (w/w) %, Number of constituents, Hydrocarbon monoterpenoid, Oxygenated monoterpenoid, Sesquiterpenoid hydrocarbon, Oxygenated sesquiterpenoid, Others]]
      column: percentage => %; 100.0
AMITableTool cTree: PMC5132230
AMITableTool cTree: PMC5203915
  table: Table 1Percentage of composition of essential oils fromRhaponticum carthamoidesroots of soil-grown plants (SGR) and hairy roots (HR).
      column: compound => Constituent; 92.6
271 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT footer 62
271 [main] DEBUG org.contentmine.cproject.util.RectTabColumn  - SPLIT
271 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT [[α-Pinene, Oct-1-en-3-ol, 2-Pentylfuran, α-Phellandrene, p-Cymene, β-Phellandrene, Limonene, (E)-Oct-2-enal, p-Cymenene, (E)-Non-2-enal, p-Cymen-9-ol, Thymol, Carvacrol, (E,E)-Deca-2,4-dienal, Cyprotene, 13-Norcypera-1(5),11(12)-diene, α-Longipinene, Cyperadiene, Cyclosativene, α-Copaene, α-Funebrene, Petasitene, β-Elemene, Thymol methyl ether, Cyperene, Dehydroisolongifolene, α-Cedrene, β-Caryophyllene, trans-α-Bergamotene, Sesquisabinene A, β-Helmiscapene, α-Helmiscapene, (Z)-β-Farnesene, α-Humulene, β-Santalene, Selina-3,7-diene, Rotundene, α-Acoradiene, γ-Gurjunene, Selina-4,11-diene, Dauca-4(11),8-diene, Nardosina-1(10),11-diene, β-Selinene, Pentadec-1-ene, α-Muurolene, Isorotundene, β-Bisabolene, (Z)-γ-Bisabolene, Premnaspirodiene, δ-Cadinene, Cyperene oxide, α-Calacorene, (E)-Nerolidol, β-Caryophyllene oxide, α-Corocalene, Longifolene aldehyde, 2,5,8-Trimethyl-1-naphthol, β-Himachalol, Cadalene, Aplotaxene, Cyperotundone, Palmitic acid], [Total identified,  ,  ,  ,  ,  ]]
      column: percentage => SGR [%]; 100.0
      column: percentage => HR [%]; 92.6
AMITableTool cTree: PMC5237462
  table: Table 1Major constituents of the essential oils ofM. piperita.
      column: compound => Components; 82.4
300 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT footer 12
300 [main] DEBUG org.contentmine.cproject.util.RectTabColumn  - SPLIT
301 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT [[Thuja-2,4(10)-diene, Verbenene, β-Pinene, Mentha-2,8-diene, β-Ocimene, Linalool, Epizonarene, Epoxyocimene, Sesquiphellandrene, Cadinene, Germacrene B, null], [Monoterpene hydrocarbons, Oxygenated monoterpenes, Sesquiterpene hydrocarbons, null,  ]]
      column: percentage => Peak Area (%); 100.0

This works well when all non-chemical names are at the bottom.
It's enough for us to extra enough "good" compounds to see if we have to increase the dictionary.

(one positive aspect is that the names are presumably contained in the mass spec lookup tables so probably "reasonably well" standardised.)

check extracted body and footer

for each true composition table there is:

These may be easier to analyse in the browser. The extracted body is cyan and footer is yellow.

The original table has an implied body of 27 terpenes and and implied footer of 8 summary data starting at Total. Check that the number of rows in each is identical and record any discrepancies:

create new columns

  • original composition
  • extracted compoisition

original should contain:

BODY 27
FOOTER 8

and extracted should be identical

If the extracted disagrees indicate this with an asterisk in extracted , e.g.
BODY 26 *

if the body or footer is missing write
BODY 0 *
and/or
FOOTER 0 *

This is the first 10 article analysis. Added columns


PMCID | raw_table_number | raw_filename | raw_table_title | extracted_subtable_name | matches20191121 | matches20191121_notes | matches20191121_compound | matches20191121_percent | original composition | extracted compoisition | graphic_table | compound_col_name | percent_col_name | additional_percent_col_names | notes | FN | FP |  


PMC4391421 | Table 1 | table_1.xml | Chemical composition of thyme EO | composition_extracted_1.html |  BODY 15FOOTER 1 | BODY 12FOOTER 0 |   | Constituents* | Area % of total |  

PMC5080681 | Table 1 | table_1.xml | Chemical composition, concentrations (%) and calculated retention indices, of T. bovei essential oil as characterized by GC/MS analysis | composition_extracted_1.html  | BODY 27FOOTER 8 | BODY 27FOOTER 8 |   Constituents | % |     | 
PMC5132230 | Table 1 | table_1.xml | Chemical composition of the Aeollanthus suaveolens essential oil. | composition_extracted_1.html  | BODY 19FOOTER 5 | Not extracted. |  Compounds | Relative Percentage (%) |  

PMC5203915 | Table 1 | table_1.xml | Percentage of composition of essential oils from Rhaponticum carthamoides roots of soil-grown plants (SGR) and hairy roots (HR). | composition_extracted_1.html |  BODY 62FOOTER 6 | BODY 62FOOTER 6 |   Constituent ; Class of compound |   SGR [%] ; HR [%] | Two EO profiles. |    

PMC5237462 | Table 1 | table_1.xml | Major constituents of the essential oils of M. piperita. | composition_extracted_1.html |   FN |  BODY 11FOOTER 4 | BODY 11FOOTER 4 |  Components | Peak Area (%) |   

PMC5248495 | Table 1 | table_1.xml | Chemical composition of essential oils of Ocimum basilicum var.purpureum, Ocium basilicum var. thyrsiflora, Ocimum citriodorum | composition_extracted_1.html | FN |   | BODY 33FOOTER 0 | BODY 33FOOTER 0 |  Chemical components |   | O. basilicumvar.purpureum,%b ; O. basilicumvar.thyrsiflora,% ; O. xcitriodorum, | Three EO profile. |    

PMC5282690 | TN |  BODY 0FOOTER 0 | Not extracted. |     

PMC5307246 | TN | BODY 0FOOTER 0 | Not extracted. |  

PMC5307902 | Table 3 | table_3.xml | Percentage chemical composition of the essential oil from leaves of P. amboinicus by gas chromatography-mass spectrometry. | FN | FN |   | FN | FN | BODY 19FOOTER 0 | Not extracted. | YES | FN | FN |   | No EO composition is extracted. | Compounds; Area (%) |   |  

PMC5324201 | Table 9 | table_9.xml | Compound composition (% w/w) in the essential oil and water ... | composition_extracted_1.html |   | FP - table:  Proximate composition of Anethum sowa L. Root ; table:  Fatty acid composition of Anethum sowa L. root extract (cold and hot extracts) by GC |   | FN | BODY 24FOOTER 1 | BODY 25FOOTER 0 |   | Name of Compounds | FN |   | Not regular title. Multiple column headers are there. | Essential oil - Conc. (%); Water extract part - Conc. (%) |   |  

Test sheet with added columns original composition and extracted compoisition - testsheetCompositionAnalysis20191126.tsv.

Sir, please go through the updated sheet for composition extraction - compositionAnalysis20191119.tsv.

Added columns - Original composition , Extracted composition and error*.

OK sir.

Sir, please go through these articles - PMC5590060, PMC5603114, PMC5933692. compositionAnalysis20191119.tsv.

Previously composition was extracted but this time it is FN.

Also, tell should I verify compound_col_name and percent_col_name ? Is there any made changes for them (compound_col_name and percent_col_name)?


PMC5590060 | Table 1 | table_1.xml | Composition of E. foetidum essential oils.  | FN | FN | BODY 34FOOTER 5 | **Not extracted**. |   |   | **Compounds** | **%**


PMC5603114 | Table 1 | table_1.xml | Chemical composition of resin essential oil of P. heptaphyll ... | FN |  BODY 23FOOTER 0 | **Not extracted**. |  **Constituents** |   | **Area (%) EOPh  Com. resins ; Area (%) EOPh  Nat. resins** | Two EO profiles. | 


PMC5933692 | Table 1 | table_1.xml | Essential oil composition of G. rosmarinifolia. Compounds be ... |   | BODY 34FOOTER 1 | **Not extracted.** |   | **Compound | Relative amount (%)** |  

 

Sir, Please go through the revised composition extraction sheet - compositionAnalysis20191119.tsv

I have corrected composition file names and tables as FPs.

compound_col_name and percent_col_name are same as before.