JorisChau/rrapply

Handling repeated list elements with the same name

siddharthprabhu opened this issue ยท 7 comments

@JorisChau First of all, let me say thank you for writing this awesome package! In my organization, I need to work with deeply nested XMLs and rrapply seems like just the thing I need to unnest everything into a wide data frame for easier analysis.

I've gotten quite far and have a working prototype. However, I've run into a roadblock; I'm unable to figure out how to handle repeated list elements with the same name. The reprex below shows a simplified version of my data structure.

library(rrapply)

example <-
  list(Body = list(
    Cust = list(
      Name = list("ABC"),
      Ctct = structure(list("abc@acme.com"), Mode = "EML"),
      Ctct = structure(list("18001234567"), Mode = "TEL")
    ),
    Cust = list(
      Name = list("DEF"),
      Ctct = structure(list("def@acme.com"), Mode = "EML"),
      Ctct = structure(list("18007654321"), Mode = "TEL")
    )
  ))

rrapply(example, how = "bind", options = list(coldepth = 1))
#>   Body.Cust.Name.1 Body.Cust.Ctct.1
#> 1              DEF      18007654321

Using coldepth = 1 gives me the desired output format i.e. one row per input XML/list. However, it looks like rrapply only retains the last value for repeating data elements (such as Name and Ctct).

If I set coldepth = 0, I get all the values but then do not get the names which makes it hard to know which data element the value pertains to.

rrapply(example, how = "bind", options = list(coldepth = 0))
#>              1
#> 1          ABC
#> 2 abc@acme.com
#> 3  18001234567
#> 4          DEF
#> 5 def@acme.com
#> 6  18007654321

Created on 2022-09-04 with reprex v2.0.2

I'm very new to rrapply, so could use your help in understanding if I'm missing something fundamental in how to accomplish this task.

P.S. From the documentation, I understand that list attributes get dropped when using how = "bind" since the output is no longer a list. I need to retain list attributes for my application, so I have written a small helper function which converts attributes into data elements before running rrapply. This works just fine (although it is a little slow). However, if you can suggest a better way to go about this, that would be a great help.

@siddharthprabhu: as you've already experienced, the option how = "bind" is not ideally suited for dealing with duplicated column names in the replicated observations.

For context, when unnesting replicated observations rrapply() expects all column names to be unique (as duplicates in the column names should normally not occur, since this leads to identification issues). Internally, rrapply() first allocates the data.frame and then populates it by matching the observations by name and not position. This choice is made because the names across observations may not always be identical and/or occur in the same order (which in this case is handled correctly) and duplicated names are generally not expected (which in this case may be handled incorrectly). Your example illustrates the second case where the first Ctct values are being overwritten based on the second Ctct values, instead of being assigned to a separate column.

First, including namecols = TRUE as an option with how = "bind" includes all parent names in the wide data.frame, so this should address the question about including the names when coldepth = 0:

## include parent names
rrapply(
  ex, 
  how = "bind",
  options = list(coldepth = 0, namecols = TRUE)
)
#>     L1   L2   L3            1
#> 1 Body Cust Name          ABC
#> 2 Body Cust Ctct abc@acme.com
#> 3 Body Cust Ctct  18001234567
#> 4 Body Cust Name          DEF
#> 5 Body Cust Ctct def@acme.com
#> 6 Body Cust Ctct  18007654321

Instead of the above code it is better to use how = "melt", which is meant exactly for this purpose:

## include parent names (how = "melt")
rrapply(ex, how = "melt")
#>     L1   L2   L3 L4        value
#> 1 Body Cust Name  1          ABC
#> 2 Body Cust Ctct  1 abc@acme.com
#> 3 Body Cust Ctct  1  18001234567
#> 4 Body Cust Name  1          DEF
#> 5 Body Cust Ctct  1 def@acme.com
#> 6 Body Cust Ctct  1  18007654321

To address the issue with the non-unique column names. My advice would be to first ensure uniqueness of the resulting column names (rrapply() purposefully does not modify column names to avoid any unexpected behavior). In the given example, the most obvious approach seems to be creating separate lists Ctct_EML and Ctct_TEL:

## make unique names
ex_mode <- rrapply(
  ex,
  condition = \(x, .xname) .xname == "Ctct",
  f = \(x) paste("Ctct", attr(x, "Mode"), sep = "_"),
  how = "names"
)
str(ex_mode)
#> List of 1
#>  $ Body:List of 2
#>   ..$ Cust:List of 3
#>   .. ..$ Name    :List of 1
#>   .. .. ..$ : chr "ABC"
#>   .. ..$ Ctct_EML:List of 1
#>   .. .. ..$ : chr "abc@acme.com"
#>   .. .. ..- attr(*, "Mode")= chr "EML"
#>   .. ..$ Ctct_TEL:List of 1
#>   .. .. ..$ : chr "18001234567"
#>   .. .. ..- attr(*, "Mode")= chr "TEL"
#>   ..$ Cust:List of 3
#>   .. ..$ Name    :List of 1
#>   .. .. ..$ : chr "DEF"
#>   .. ..$ Ctct_EML:List of 1
#>   .. .. ..$ : chr "def@acme.com"
#>   .. .. ..- attr(*, "Mode")= chr "EML"
#>   .. ..$ Ctct_TEL:List of 1
#>   .. .. ..$ : chr "18007654321"
#>   .. .. ..- attr(*, "Mode")= chr "TEL"

At this point we can unnest the list to a wide data.frame without any conflicts in the column names, using e.g.:

## unnest list
rrapply(
  ex_mode, 
  how = "bind", 
  options = list(coldepth = 3, namecols = TRUE)
)
#>     L1   L2 Name.    Ctct_EML.   Ctct_TEL.
#> 1 Body Cust   ABC abc@acme.com 18001234567
#> 2 Body Cust   DEF def@acme.com 18007654321

More generally, R's make.unique() function may be useful to ensure uniqueness of the column names.

Another way to go could be to rename the (unnamed) bottom lists based on their positions according to e.g.:

## make unique names (by position)
ex_named <- rrapply(
  ex,
  condition = \(x) is.character(x),
  f = \(x, .xpos) {
    switch(.xpos[length(.xpos) - 1],
           `1` = "name",
           `2` = "ctct_eml",
           `3` = "ctct_tel"
    )
  },
  how = "names"
)
str(ex_named)
#> List of 1
#>  $ Body:List of 2
#>   ..$ Cust:List of 3
#>   .. ..$ Name:List of 1
#>   .. .. ..$ name: chr "ABC"
#>   .. ..$ Ctct:List of 1
#>   .. .. ..$ ctct_eml: chr "abc@acme.com"
#>   .. .. ..- attr(*, "Mode")= chr "EML"
#>   .. ..$ Ctct:List of 1
#>   .. .. ..$ ctct_tel: chr "18001234567"
#>   .. .. ..- attr(*, "Mode")= chr "TEL"
#>   ..$ Cust:List of 3
#>   .. ..$ Name:List of 1
#>   .. .. ..$ name: chr "DEF"
#>   .. ..$ Ctct:List of 1
#>   .. .. ..$ ctct_eml: chr "def@acme.com"
#>   .. .. ..- attr(*, "Mode")= chr "EML"
#>   .. ..$ Ctct:List of 1
#>   .. .. ..$ ctct_tel: chr "18007654321"
#>   .. .. ..- attr(*, "Mode")= chr "TEL"

And we can again unnest the list to a wide data.frame without any conflicts in the column names:

rrapply(ex_named, how = "bind", options = list(coldepth = 3))
#>   Name.name Ctct.ctct_eml Ctct.ctct_tel
#> 1       ABC  abc@acme.com   18001234567
#> 2       DEF  def@acme.com   18007654321

Edit: this has been resolved in v1.2.6.

@siddharthprabhu: this has been addressed in release v1.2.6.

Duplicate list names are now identified correctly in how = "bind" and no longer get overwritten. Note: the returned column names are made unique with base R's make.unique(). To illustrate using the example data above:

## data
ex <- list(Body = list(
    Cust = list(
      Name = list("ABC"),
      Ctct = structure(list("abc@acme.com"), Mode = "EML"),
      Ctct = structure(list("18001234567"), Mode = "TEL")
    ),
    Cust = list(
      Name = list("DEF"),
      Ctct = structure(list("def@acme.com"), Mode = "EML"),
      Ctct = structure(list("18007654321"), Mode = "TEL")
    )
  ))

## bind with name duplicates
rrapply(ex, how = "bind", options = list(coldepth = 3))
#>   Name.1       Ctct.1    Ctct.1.1
#> 1    ABC abc@acme.com 18001234567
#> 2    DEF def@acme.com 18007654321

And including also the parent names as individual columns:

## bind with name duplicates and parent names
rrapply(ex, how = "bind", options = list(namecols = TRUE, coldepth = 3))
#>     L1   L2 Name.1       Ctct.1    Ctct.1.1
#> 1 Body Cust    ABC abc@acme.com 18001234567
#> 2 Body Cust    DEF def@acme.com 18007654321

@JorisChau Awesome! Thank you so much for adding this feature. I was struggling with incorporating make.unique into my parsing function. You've saved me a ton of work. ๐Ÿ‘

Any idea when the new version's Windows binaries will be available on CRAN?

That is surprising, usually it takes only a few days before the windows binaries are up-to-date on CRAN. In the meantime perhaps install the source version of the package?

Yes, I've done that on my local machine but I don't have Rtools on my work machine and can only install binaries. I was just curious if there was anything that devs need to do to get the binaries published on CRAN but I guess not. No worries, I'll just wait a bit longer. Thanks!

I see, no indeed R-package binaries are compiled and published automatically by CRAN.