/hive-scd-examples

How to manage Slowly Changing Dimensions with Apache Hive

Primary LanguagePLSQL

Managing Slowly Changing Dimensions (SCDs) with Apache Hive

This project provides sample datasets and scripts that demonstrate how to manage Slowly Changing Dimensions (SCDs) with Apache Hive's ACID MERGE capabilities. Using ACID MERGE allows all updates to be applied atomically, ensure readers see all updates or no updates, and handles failure scenarios, rather than requiring application developers to build these things themselves.

Also included is data that simulates a full data dump from a source system, followed by another data dump taken later.

The objective is to merge the data using different styles of slowly-changing dimension strategies

These examples cover Type 1, Type 2 and Type 3 updates.

Procedure

SCD Strategies

Requirements

Instructions

  • Clone this repository onto your Hadoop cluster
  • Run load_data.sh to stage data into HDFS
  • From Hive CLI or beeline, run hive_type1_scd.sql, hive_type2_scd.sql and hive_type3_scd.sql