airbnb/streamalert

[bug] Changing type(s) in a log schema will break historical search against data using old schema

chunyong-lin opened this issue · 0 comments

Background

If you have historical search enabled and the file_format is set to parquet, bad news, we will be screwed if we change the type(s) in a log schema and we will get the error HIVE_PARTITION_SCHEMA_MISMATCH error when we try to search historical data across all partitions in the table using the schema we changed.

For example, if we change following timestamp to string, carbonblack_alert_watchlist_hit_feedsearch_bin table partitions will be screwed.

"timestamp": "float",

If we don't change the schema ever, happy life! Unfortunately, this is not the reality 😢

Desired Change

Couple things we can improve.

  1. Standardize Everything on string
    String is larger in memory footprint, but is the most permissive to future changes.

  2. Have a script that can fix this quickly
    Script should drop target table(s) and rebuild them using new schemas, and should recreate partitions. This script may also need to fix underlying data (which might be hard).

  3. Or other solutions we haven't thought about.