verdict-project/verdict

STRUCT support in Verdict for scambles

Opened this issue · 5 comments

Hi guys,

One of our tables has recently started receiving data in the form of a struct (array / row).

For example:

{city=Jackson, state=WY, zip=83001, county=Teton, msa=null, country=US} 

{city=Cheyenne, state=WY, zip=82001, county=Laramie, msa=null, country=US}

{city=Gillette, state=WY, zip=82718, county=Campbell, msa=null, country=US}

I was wondering how Verdict builds its scrambles based on this kind of data? Is this a data structure you actively support? Would each of the internal items be capable of producing fast aggregations?

For example:

SELECT count(distinct(Location.city)) from table

Our scramble performance has dropped significantly but we aren't sure if this correlates?

VerdictDB should just work. One possible reason is that columnar format may not be very efficient for such data types.

If you can load sample data into the cluster, we may be able to test them.

@dongyoungy Can you ask someone to investigate this by comparing different compression formats for our scramble tables? Maybe we can try different formats (e.g., ORC or parquet) with different compression schemes.

I'm unsure as to the internals for it but yes I agree that structs on a columnar are probably not ideal - they seem to be the preferred way in BigQuery (where this data has originated from). We are considering flattening them out as a last resort but we would prefer to get some information on exactly how verdict handles this before we do anything drastic :)

@dongyoungy Can you ask @Beastjoe to investigate this issue? I see two related problems:

  1. Performance when the table contains array or struct
  2. Possible performance degradation when samples keep appended