parsed TYPICAL PEDON output in JSON may be missing columns

It seem that jsonLite::toJSON() automatically drops columns which are all NA. This is news to me and the obvious solutions (null and na arguments) don't work. Setting na = 'string' causes NA to be encoded as character NA.

Example from an OSD that has no coarse fraction modifiers (cf_class is all NA).

structure(list(name = c("A", "Bw1", "Bw2", "Cr"), top = c(0, 
15, 28, 48), bottom = c(15, 28, 48, 56), dry_hue = c("10YR", 
"10YR", "10YR", "2.5Y"), dry_value = c(7, 7, 7, 8), dry_chroma = c(2, 
2, 2, 2), moist_hue = c("10YR", "10YR", "10YR", "2.5Y"), moist_value = c(4, 
5, 5, 7), moist_chroma = c(2, 3, 3, 4), texture_class = structure(c(13L, 
13L, 13L, NA), .Label = c("coarse sand", "sand", "fine sand", 
"very fine sand", "loamy coarse sand", "loamy sand", "loamy fine sand", 
"loamy very fine sand", "coarse sandy loam", "sandy loam", "fine sandy loam", 
"very fine sandy loam", "loam", "silt loam", "silt", "sandy clay loam", 
"clay loam", "silty clay loam", "sandy clay", "silty clay", "clay"
), class = c("ordered", "factor")), cf_class = c(NA_character_, 
NA_character_, NA_character_, NA_character_), pH = c(5.1, 5.1, 
5, 4.5), pH_class = structure(c(4L, 4L, 3L, 3L), .Label = c("ultra acid", 
"extremely acid", "very strongly acid", "strongly acid", "moderately acid", 
"slightly acid", "neutral", "slightly alkaline", "mildly alkaline", 
"moderately alkaline", "strongly alkaline", "very strongly alkaline"
), class = c("ordered", "factor")), distinctness = c("clear", 
"clear", "abrupt", NA), topography = c("wavy", "wavy", "wavy", 
NA), narrative = c("A--0 to 15 cm; light gray (10YR 7/2) loam, dark grayish brown (10YR 4/2) moist; moderate coarse subangular blocky structure; slightly hard, friable, slightly sticky and slightly plastic; many very fine roots; many very fine and few fine tubular and many very fine interstitial pores; 10 percent gravel; strongly acid (pH 5.1); clear wavy boundary. (3 to 20 cm thick)", 
"Bw1--15 to 28 cm; light gray (10YR 7/2) loam, brown (10YR 5/3) moist; weak medium subangular blocky structure; slightly hard, friable, slightly sticky and slightly plastic; common very fine roots; many very fine and common fine tubular and many very fine interstitial pores; 5 percent gravel; strongly acid (pH 5.1); clear wavy boundary.", 
"Bw2--28 to 48 cm; light gray (10YR 7/2) loam, brown (10YR 5/3) moist; weak medium subangular blocky structure; slightly hard, friable, slightly sticky and slightly plastic; many very fine and common fine tubular and many very fine interstitial pores; few thin colloidal stains on mineral grains, 10 percent gravel mostly at the base of horizon; very strongly acid (pH 5.0); abrupt wavy boundary. (Combined thickness of the Bw horizons is 23 to 41 cm)", 
"Cr--48 to 56 cm; pale yellow (2.5Y 8/2) weakly consolidated rhyolitic tuffaceous sediments, pale yellow (2.5Y 7/4) moist; very strongly acid (pH 4.5)."
)), class = "data.frame", row.names = c(NA, -4L))


jsonlite::toJSON(s, pretty = TRUE, auto_unbox = TRUE)

This isn't a show-stopping bug, but it does mean that 1-at-a-time processing of the current JSON files may give slightly different structured output at each iteration; especially when there are typos or formatting errors.

Possibly related, missing "OSD site" data result in a very strange data.frame when read-back via soilDB::get_OSD('ACADEMY', result = 'json')

I think the missing OSD site data could be relatively easily fixed here...

SoilKnowledgeBase/R/parseOSD_functions.R

Lines 209 to 215 in d7a5506

    
           l <- list() 
        
           l[['site-data']] <- .extractSiteData(x) 
        
           tp <- strsplit(as.character(x$`TYPICAL PEDON`$content), "\n") 
        
           if (length(tp) > 0) 
        
             l[['hz-data']] <- .extractHzData(tp[[1]]) 
        
           else 
        
             l[['hz-data']] <- data.frame(name = NA)

by using similar handlign of "site-data" as is done for "hz-data" in case when there are no columns in result

As far as the JSON itself, could we not store empty strings as placeholders for columns that had no parseable data? This would make many of the JSON files larger but should at least preserve the column names.

Reopening b/c only addressed the site-level stuff so far.

As far as the JSON itself, could we not store empty strings as placeholders for columns that had no parseable data? This would make many of the JSON files larger but should at least preserve the column names.

A good idea. Either way (NA as string | all-NA columns stripped) we will have to use some extra steps when processing the JSON files in bulk. Encoding NA as strings seems to work fine with mixed data (numeric, character, logical)--I wasn't sure if all types would be converted to character in the process. Might be simpler than attempting to fix structural problems with data.table::rbindlist() or similar.

I'd like to test the effects of `toJSON(..., na = 'string') as applied to the horizon data a little more.

My preference would be to store all variables in the JSON as strings. Then get_OSD() can handle type conversion and NA-handling. It is vectorized and with all OSD JSON having a complete structure the "fill" argument to rbindlist() should not be needed.

I tested na = 'string' #37 and it will work fine once get_OSD is set up to handle the resulting data.

Any reason to store numeric / logical as strings? It seems like the encoding / decoding of NA by to/fromJSON does this for us.

Wasn't sure, so had to try.

library(jsonlite)

x <- data.frame(
  a = c(1:10, NA), 
  b = c(NA, letters[1:10]),
  c = c(rep(TRUE, 5), NA, rep(FALSE, 5))
)

toJSON(x, pretty = TRUE)

fromJSON(toJSON(x, pretty = TRUE))

z <- fromJSON(toJSON(x, pretty = TRUE, na = 'string'))

str(z)
summary(z)

I looked at this... not quite right. Am I missing something?

x <- data.frame(
  a = c(1:10, NA), 
  b = c(NA, letters[1:10]),
  c = c(rep(TRUE, 5), NA, rep(FALSE, 5))
)

x2 <- jsonlite::fromJSON(jsonlite::toJSON(x, na = 'string', pretty = TRUE))

all.equal(x, x2)
#> [1] "Component \"b\": 'is.NA' value mismatch: 0 in current 1 in target"

That is expected: we only have to manually decode NA in character class columns. NA is correctly encoded/decoded for free. News to me.

I guess my concern was about cases where columns that would be numeric (say, pH) come back as all NA. This would be interpreted as logical, not numeric? Can we ensure that types will be stable when get_OSD() is combining results from multiple JSON files that may have varying patterns of NA?

Ah, I see. Good point. There is no way to suggest the kind of NA.

We would have to try the entire pile to know for sure, but the following suggests that real data take precedence over NA of unspecified type:

library(jsonlite)

x <- data.frame(
  a = c(1:10, NA), 
  b = c(NA, letters[1:10]),
  c = c(rep(TRUE, 5), NA, rep(FALSE, 5)),
  d = NA
)

y <- data.frame(
  a = c(1:10, NA), 
  b = NA,
  c = c(rep(TRUE, 5), NA, rep(FALSE, 5)),
  d = 1
)

z1 <- fromJSON(toJSON(x, pretty = TRUE, na = 'string'))
z2 <- fromJSON(toJSON(y, pretty = TRUE, na = 'string'))

z <- rbind(z1, z2)
str(z)
summary(z)

# order doesn't seem to matter
all.equal(str(z1, z2), str(z2, z1))

Looks good to me, and it seems that data.table::rbindlist() has same behavior as rbind(). I am not sure why the action didn't run on #37 before, but I manually triggered so we should see results of na = 'string' on whole set soon https://github.com/ncss-tech/SoilKnowledgeBase/actions/runs/1308803220

I believe with the updates to use na = 'string' and handling of "NA" and types in soilDB::get_OSD() this issue is now resolved.

library(soilDB)

x <- get_OSD(c("AABAB", "ACADEMY", "ARPATUTU", "CANISROCKS", "CONEJO", "MUSICK", "STANISLAUS"))
#> Loading required namespace: jsonlite
do.call('rbind', x$HORIZONS) |> str()
#> 'data.frame':    44 obs. of  16 variables:
#>  $ name         : chr  "Oi" "A1" "A2" "Bw1" ...
#>  $ top          : int  0 3 10 30 48 56 112 0 15 30 ...
#>  $ bottom       : int  3 10 30 48 56 112 155 15 30 51 ...
#>  $ dry_hue      : chr  NA "10YR" "10YR" "10YR" ...
#>  $ dry_value    : int  NA 5 6 6 7 7 7 4 3 4 ...
#>  $ dry_chroma   : int  NA 4 4 4 2 3 2 3 3 3 ...
#>  $ moist_hue    : chr  NA "10YR" "10YR" "10YR" ...
#>  $ moist_value  : int  NA 3 4 4 5 5 5 3 3 3 ...
#>  $ moist_chroma : int  NA 4 4 4 3 4 1 2 2 4 ...
#>  $ texture_class: chr  NA "silt loam" "silt loam" "silty clay loam" ...
#>  $ cf_class     : chr  NA NA NA NA ...
#>  $ pH           : num  NA NA 5.4 5.4 5.6 5.4 5.6 NA NA NA ...
#>  $ pH_class     : chr  NA "moderately acid" "strongly acid" "strongly acid" ...
#>  $ distinctness : chr  "abrupt" "clear" "clear" "abrupt" ...
#>  $ topography   : chr  "smooth" "smooth" "wavy" "wavy" ...
#>  $ narrative    : chr  "Oi--0 to 1 inch; slightly decomposed deciduous leaves and twigs; abrupt smooth boundary." "A1--1 to 4 inches; dark yellowish brown (10YR 3/4) medial silt loam, yellowish brown (10YR 5/4) dry; moderate m"| __truncated__ "A2--4 to 12 inches; dark yellowish brown (10YR 4/4) medial silt loam, light yellowish brown (10YR 6/4) dry; mod"| __truncated__ "Bw1--12 to 19 inches; dark yellowish brown (10YR 4/4) silty clay loam, light yellowish brown (10YR 6/4) dry; fe"| __truncated__ ...

I did a bunch of testing yesterday afternoon--all good. Thanks.

	l <- list()
	l[['site-data']] <- .extractSiteData(x)
	tp <- strsplit(as.character(x$`TYPICAL PEDON`$content), "\n")
	if (length(tp) > 0)
	l[['hz-data']] <- .extractHzData(tp[[1]])
	else
	l[['hz-data']] <- data.frame(name = NA)