ianmcook/implyr

Implement access to complex columns (ARRAY, MAP, STRUCT)

henningsway opened this issue · 13 comments

As a beginner it is not immediatly clear to me, how to best use implyr to access Impala-complex types (especially maps, e.g. pull out a couple of columns and join them with the existing data-frame).

The link in the Readme is helpful (to create dbGetQuery()-requests), but a short example - possibly showing dplyr-logic - would be really cool as well. :)

Agreed, this would be useful, and I hope to do this soon. In most aspects the Impala SQL syntax is compatible with the HiveQL syntax used by Hive and Spark SQL, but Impala's syntax diverges when working with maps and arrays—Impala uses join syntax instead of lateral views. I think this does warrant an explanation and some examples in the readme, or possibly in a separate vignette about complex types.

Thank you for your answer and this cool package, which definately eases my transition towards bigger data sets. :)

Do I understand correctly, that the currently suggested way to access complex types would be to execute SQL-queries (via dbGetQuery()) and then joining the result from this query (which extracted the complex type) with the result from a executed dplyr statement?

(I was hoping, that there would be some pipeable dplyr-frontend to extract some parts of a complex type, similar to map(df, "column") or tidyr::unnest or so.)

Currently for array and map columns, yes, I think that's right. I'm doing some experimental work right now to implement access to array and map columns in a dplyr-friendly, tidyr-style way. I'll let you know when I have something ready for you to try.

Sounds interesting and I will surely take that idea for a testdrive. Would be nice to stay within one system, even though the Impala SQL Syntax seems surprisingly accessible as well.

I implemented support for complex columns in the development version of implyr on GitHub. This is currently an experimental feature; the implementation is somewhat hacky and there are some limitations as described below.

@henningsway can you try this out on your data? First install the latest implyr from GitHub:
devtools::install_github("ianmcook/implyr")
Then use the impala_unnest() function, like this:
my_tbl_impala %>% impala_unnest(complex_column)

Current limitations:

  • You can use dplyr verbs after impala_unnest() but not before.
  • ARRAY, MAP, and STRUCT columns are all supported, but currently it only works with one complex column per table.
  • It currently only works with complex columns that contain scalar types, not complex types nested within complex types (like an ARRAY of STRUCT).
  • There are no examples showing how to use this function in the README yet.

This sounds exciting. I will try it soon and let you know.
Unfortunately, I already know, that I will encounter recurring nested columns (usually two levels) regularly. Let me see, what's already possible with the current tooling.
Out of curiosity, what kind of complexity is introduced by multiple levels of complex columns?

I plan to implement support for arbitrarily many nested levels of complex types, but I wanted to solve the simpler case first. There are some complications and design choices I need to consider more; for example, should a single call to the impala_unnest() function perform recursive unnesting, or should the user need to call the function multiple times?

This is an interesting question. First I thought, that repeated calling of imapala_unnest() would be my preference, as it seemed clearer to me. But being able to extract a repeatedly nested "layer" in one line of code also seems attractive!

I've tried to get the unnesting working for me, but I haven't yet been successful.

This is the kind of code I tried

sometable_tbl <- tbl(impala, in_schema("my_schema", "sometable"))
sometable_tbl %>% 
  impala_unnest(nameofcomplexcolumn)

This is the kindo of error I got

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000:
[Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] :
AnalysisException: Could not resolve table reference: 'my_schema.sometable'

I noticed, that the complex column is not represented in sometable_tbl, when I preview it in Rstudio or collect the first 1000 lines or so.

PS: I think it may have to do with the complex column having a complex column nested within itsself. I will try to find another example to work on soon and get back to you.

Thanks for testing this! There was a bug in how column names were quoted. I just fixed this in 21a8557. This should resolve the error you observed. But I have not yet implemented support for multiple levels of nesting (complex columns within complex columns). Hoping to do that soon.

Just started using sparklyr a little more. The functionality from https://github.com/mitre/sparklyr.nested seem quite interesting and possibly relevant for implyr as well?

TBD: Look at tidyverse/dbplyr#158 and test with Impala

TBD: Investigate if this can be redesigned for consistency with tidyr's new unnest_wider() and unnest_longer() functions