USCDataScience/parser-indexer-py

Update LPSC PDF parsing for removing LPSC header

Closed this issue · 2 comments

wkiri commented

Some LPSC PDFs accidentally include the page header in the text content.

From PHX sentences.csv:

2009_1671|2009_1671-3|The ice surface excavated at Snow White is smooth and looks dark because it is mixed with finely distrib40th Lunar and Planetary Science Conference (2009) uted dust.
2009_2097|2009_2097-7|This can be seen in the tailings of the eastern sections of the trench as well as in the 40th Lunar and Planetary Science Conference (2009) dump pile called Bee Tree, please see Figures 3 and 4.

There are some cases of this in the MPF sentences.csv as well:

1998_1378|1998_1378-42|It is of roughly ovoidal shape so we call it Lunar and Planetary Science XXIX MARS PATHFINDER ROCK MORPHOLOGY: A.
2000_1422|2000_1422-4|Results: Figure 2 shows two spectral samples from different rocks (Frog in green and Moe in red) that Lunar and Planetary Science XXXI SPECTRAL ROCK TYPES AT MPF: R.
2000_1846|2000_1846-1|The far field data, which encompasses a much larger area (and so is less subject to the problem of having one large rock, such as Yogi, in a small counting area) shows that the cumulative number of Lunar and Planetary Science XXXI MARS PATHFINDER ROCKS: M.
2000_1952|2000_1952-0|Figure 1: Atmospheric Contribution to Alpha Mode 12x10 3 10 86 42 0In te ns it y ( Co un ts /1 00 ,0 00 s ec )16014012010080604020 Channel Raw Spectrum for Barnacle Bill Barnacle Bill Modeled Atmospheric Peaks Barnacle Bill minus Atmospheric Peaks Raw Spectrum for AGV1 in Martian Conditions AGV1 minus Atmospheric Peaks AGV1 in Vacuum Example of endpoint shift Vacuum vs CO2 oxygen endpoint (~5 channel shift left due to lower incoming alpha energy) Lunar and Planetary Science XXXI Pathfinder APXS Calibration: C.
2001_1293|2001_1293-0|Sample Na Mg Al Si P S Cl K Ca Ti Cr Mn Fe3+ Soils A-4, soil 0.7 6.0 4.4 19.9 0.8 3.0 0.57 0.50 4.3 0.6 0.1 0.6 13.7 A-5, soil 0.8 5.6 4.6 19.1 0.7 2.6 0.55 0.43 4.7 0.5 0.3 0.3 16.1 A-10, soil 1.0 4.9 3.9 19.5 0.4 2.8 0.53 0.37 4.9 0.6 0.2 0.4 16.5 A.15, soil 0.7 4.5 4.0 20.5 0.4 2.4 0.54 0.72 4.7 0.7 0.2 0.4 16.1 Mean Soil 0.8 5.2 4.2 19.8 0.4 2.7 0.55 0.50 4.7 0.6 0.2 0.4 15.6 Cemented soil A-8, Scooby Doo 1.2 4.4 4.8 21.3 0.3 2.5 0.55 0.65 5.8 0.7 -- 0.4 13.1 Rocks Fe2+ A-3, Barnacle Bill 1.3 1.9 5.8 25.2 0.6 1.1 0.41 1.07 4.3 0.6 0.1 -- 12.6 A-7, Yogi 0.9 4.0 5.1 23.3 0.4 2.0 0.50 0.72 5.3 0.5 -- 0.4 13.0 A-16, Wedge 1.7 2.8 5.4 22.7 0.4 1.3 0.41 0.79 5.8 0.6 -- 0.5 14.7 A-17, Shark 1.5 2.1 5.3 25.8 0.4 0.8 0.38 0.94 6.3 0.4 0.03 0.4 11.5 A-18, Half Dome 1.3 2.4 5.8 24.2 0.4 1.2 0.37 0.91 4.7 0.5 -- 0.4 14.1 Average Error [%] 40 10 7 10 20 20 15 10 10 20 50 25 5 Calculated Rock Soil-free rock 1.8 0.90 5.80 26.5 0.4 0.3 0.32 1.12 5.7 0.4 0.4 12.1 Lunar and Planetary Science XXXII (2001) REVISED DATA OF MARS PATHFINDER APXS, Brückner et al.
2002_1771|2002_1771-7|Using 3D models deduced from a stereo pair of Yogi acquired Lunar and Planetary Science XXXIII (2002) THE TRUE COLOR OF YOGI C.

# And "Lunar and Planetary Science Conference (201x)"
content_ann = re.sub(r'([0-9][0-9].. Lunar and Planetary Science Conference \(201[0-9]\))',
'', content_ann,
flags=re.IGNORECASE)

To fix these problems, we should expand the years in the regular expression (see the code above).

wkiri commented

I have confirmed that these code updates solve the header issue and UTF-8 degree symbol issue, and that parse_all.py runs with our current CoreNLP 4.2.0.