STRUCT support in Verdict for scambles

Question

STRUCT support in Verdict for scambles

Opened this issue 6 years ago · 5 comments

Hi guys,

One of our tables has recently started receiving data in the form of a struct (array / row).

For example:

{city=Jackson, state=WY, zip=83001, county=Teton, msa=null, country=US} 

{city=Cheyenne, state=WY, zip=82001, county=Laramie, msa=null, country=US}

{city=Gillette, state=WY, zip=82718, county=Campbell, msa=null, country=US}

I was wondering how Verdict builds its scrambles based on this kind of data? Is this a data structure you actively support? Would each of the internal items be capable of producing fast aggregations?

For example:

SELECT count(distinct(Location.city)) from table

Our scramble performance has dropped significantly but we aren't sure if this correlates?

Answer 1 · 2019-03-14T03:51:04.000Z

VerdictDB should just work. One possible reason is that columnar format may not be very efficient for such data types.

If you can load sample data into the cluster, we may be able to test them.

Answer 2 · 2019-03-14T04:00:07.000Z

@dongyoungy Can you ask someone to investigate this by comparing different compression formats for our scramble tables? Maybe we can try different formats (e.g., ORC or parquet) with different compression schemes.

Answer 3 · 2019-03-14T05:24:36.000Z

I'm unsure as to the internals for it but yes I agree that structs on a columnar are probably not ideal - they seem to be the preferred way in BigQuery (where this data has originated from). We are considering flattening them out as a last resort but we would prefer to get some information on exactly how verdict handles this before we do anything drastic :)

Answer 4 · 2019-04-01T15:55:49.000Z

@dongyoungy Can you ask @Beastjoe to investigate this issue? I see two related problems:

Performance when the table contains array or struct
Possible performance degradation when samples keep appended

Answer 5 · 2019-04-01T22:16:11.000Z

Just an FYI, we are refactoring our tables away from this due to performance issues with these data structures. In BigQuery however these are preferred structures (and fairly efficient) - so might be something you want to look at for that side of things :)

…

On Tue, 2 Apr. 2019, 02:55 Yongjoo Park, ***@***.***> wrote: @dongyoungy <https://github.com/dongyoungy> Can you ask @Beastjoe <https://github.com/Beastjoe> to investigate this issue? I see two related problems: 1. Performance when the table contains array or struct 2. Possible performance degradation when samples keep appended — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#354 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABBAiqzneUwARuepDX0u-knJF2ktIYb4ks5vciwGgaJpZM4bzEWc> .