ncss-tech/SoilKnowledgeBase

Changes to OSD formatting

dylanbeaudette opened this issue · 4 comments

Several changes to the OSD formatting standards (NSSH) may cause further inconsistency among OSD formatting styles encountered within the entire collection.

  1. Conversion of doubled hyphen-minus delimiters (--) to em dash () in all sections. This is most likely to affect parsing of the TYPICAL PEDON section. See .extractHzData().

  2. Section headers will now use title case: "TYPICAL PEDON:" → "Typical Pedon:" . This may affect all parsing related to finding section headers, and the downstream use of list element names if those are changed to match.

  3. It is not clear if the encoding of the text or HTML files will change, will the new files be Unicode?

The TYPICAL PEDON section is also modified such that the short narrative is on its own line:

Typical Pedon:
Gamma silt loam with a north-facing, linear, 1 percent slope in an alfalfa field at an elevation of 210 meters. (Colors are for dry soil unless otherwise noted.) 

Ap—0 to 15 centimeters; grayish brown (10YR 5/2) silt loam, very dark grayish brown (10YR 3/2) moist; weak fine granular structure; slightly hard, friable; neutral (pH 6.7 in 1:1 water); abrupt smooth boundary. (10 to 23 centimeters thick)

C—15 to 33 centimeters; stratified grayish brown (10YR 5/2) and light brownish gray (10YR 6/2) silt loam, very dark grayish brown (10YR 3/2) and dark grayish brown (10YR 4/2) moist; massive with evident bedding planes; slightly hard, friable; few fine prominent reddish brown (5YR 4/4) masses of oxidized iron in the soil matrix; neutral (pH 6.7 in 1:1 water); abrupt smooth boundary. (15 to 30 centimeters thick)

Cg1—33 to 48 centimeters; stratified dark gray (10YR 4/1) and grayish brown (10YR 5/2) silt loam, very dark gray (10YR 3/1) and dark grayish brown (10YR 4/2) moist; massive with evident bedding planes; slightly hard, friable; few fine prominent reddish brown (5YR 4/4) masses of oxidized iron in the soil matrix; neutral (pH 6.8 in 1:1 water); abrupt smooth boundary. (10 to 25 centimeters thick)

Cg2—48 to 81 centimeters; stratified grayish brown (10YR 5/2) and light brownish gray (10YR 6/2) silt loam, very dark grayish brown (10YR 3/2) and dark grayish brown (10YR 4/2) moist; massive with evident bedding planes; slightly hard, friable; few fine prominent reddish brown (5YR 4/4) masses of oxidized iron in the soil matrix; neutral (pH 6.9 in 1:1 water); abrupt smooth boundary. (25 to 51 centimeters thick)

Agb1—81 to 112 centimeters; dark gray (10YR 4/1) silt loam, very dark gray (10YR 3/1) moist; massive; hard, friable; neutral (pH 6.8 in 1:1 water); gradual wavy boundary. (0 to 38 centimeters thick)

Agb2—112 to 153 centimeters; dark gray (N 4/) silt loam, black (N 2.5/) moist; massive; hard, friable; neutral (pH 6.8 in 1:1 water).

Ideas on checking encoding of text files. I have no idea if this will change, or how the download process modifies (?) the encoding.

f <- list.files(path = "e:/working_copies/OSDRegistry/OSD/D/", full.names = TRUE)

x <- lapply(f, function(i) {
  .e <- readr::guess_encoding(i, n_max = 1000)
  .osd <- gsub('.txt', '', basename(i))
  .res <- data.frame(osd = .osd, encoding = .e$encoding, confidence = .e$confidence)
  
  return(.res)
})

x <- do.call('rbind', x)

table(x$encoding)

Thanks for spelling this out here, I was aware of changes to Part 614 OSD section but didn't quite realize these minor formatting changes.

Item 1 should be easily fixable right now by adding \\u2014 to the set of allowed separator characters.

  1. Section headers will now use title case: "TYPICAL PEDON:" → "Typical Pedon:" . This may affect all parsing related to finding section headers, and the downstream use of list element names if those are changed to match.

Seriously? We handle this in some instances (for Typical Pedon specifically). But this very well could break tons of things with little to no benefit.

I want to hold off on any changes to the codebase until we actually see these changes coming in via OSDRegistry. No need to change anything unless it is causing parsing problems.

  1. It is not clear if the encoding of the text or HTML files will change, will the new files be Unicode?

Yes, these changes, if implemented, will change the encoding (or inferred encoding) of the files.

Currently the HTML has no declared encoding, but W3C validator detects as windows-1252. e.g. https://validator.w3.org/nu/?doc=https%3A%2F%2Fsoilseries.sc.egov.usda.gov%2FOSD_Docs%2Fb%2FBOOMER.html

Oops, my mistake, if encoding is indeed intended to be "windows-1252" then the emdash is included in that set.

Closing this issue as there have been no significant systematic changes to OSD formatting. We can address specific problems if/when they trickle in