ncss-tech/SoilKnowledgeBase

consider leaving section header out of parsed OSD content

dylanbeaudette opened this issue · 2 comments

I can't remember if we have talked about this or not, but I'd like to weigh the pros/cons of removing the section title from the parsed OSD content. We could do this automatically such that the JSON files are "clean", or as a post-processing step. Either way, need a way to exclude for pattern matching searches or the pending "dump" into new NASIS tables. Are there any reasons to leave the section headers in the text?

library(soilDB)
g <- get_OSD("Marshall")
g$TYPE.LOCATION
TYPE LOCATION: Major Land Resource Area (MLRA) 107B-Iowa and Missouri Deep Loess Hills, Cass County, Iowa subset; about 3 miles northwest of Atlantic; located about 1,227 feet west and 245 feet south of the northeast corner of section 34, T. 77 N., R. 37 W.; USGS Atlantic topographic quadrangle; lat. 41 degrees 25 minutes 55 seconds N. and long. 95 degrees 05 minutes 03 seconds W., NAD 83.

The only reason to include them was because there is a decent amount of non-standard formatting/section headers, and sometimes sections are split apart and combined. For instance "USE:" separated from "VEGETATION:" for "USE AND VEGETATION" section.

To detect those issues it is helpful to have the header content included in the text. We discussed this in #25 and decided to "keep as-is until collapsing and reordering sections into groups is removed; the only way to reliably deparse combined sections is if their headers are included"

There isn't anything that is doing QC on the generalized standard section groups v.s. what is actually in the OSD at this point, but that was always my intention. This is something we talked about and I would be happy to find a way to remove the headers, but it might need to be done as post processing, or the way that split sections are handled changed.

Ah right, thanks for the reminder. Post-processing is totally fine.