This repo includes examples to demonstrate how you can create an iceberg supported Athena table and interact with data in S3 via Athena ACID Transactions.
-
Create S3 buckets to host the incoming data. For the demo, we will be creating 2 buckets, ie "homedetailsin" and "homedetailsall" buckets. The homedetailsin bucket will have transactional data thats coming from a database for example. The homedetailsall bucket will hold all the changes for the incoming data is accumulated.
-
Create upload the sample csv file to homedetailsin bucket.
-
Create a database in Glue Data Catalog.
-
Create tables in Athena.
--Create Athena Tables
CREATE EXTERNAL TABLE homedetailsin
(
home_id bigint,
home_name string,
value bigint,
tour_date date,
status string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'S3://homedetailsin/'
--Create Athena Tables with Iceberg format
CREATE TABLE homedetailsall
(
home_id bigint,
home_name string,
value bigint,
tour_date date,
status string
)
PARTITIONED BY (day(tour_date))
LOCATION 's3://homedetailsall/'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' )
- Run a SELECT query to retrive the data from S3 via Athena to ensure you can retrive the data
--query the data in S3 via Athena for homedetailsin
SELECT * FROM homedetailsin
- Insert into homedetailsall table from homedetailsin to populate some data
--populate data to homedetailsall from homedetailsin
INSERT INTO homedetailsall
SELECT * FROM homedetailsin
- Run a SELECT query to retrive the data from S3 via Athena to ensure you can retrive the data
--query the data in S3 via Athena for homedetailsall
SELECT * FROM homedetailsall
-
Explore the S3 bucket homedetailsall to see the partitions that were created and the file in parquet format
-
Update the table homedetailsall
UPDATE homedetailsall
SET VALUE=100000 WHERE home_id=1
--update the homedatsilsall table
UPDATE homedetailsall
SET VALUE=200000 WHERE home_id=1
- Time travel with snapshots Helps us see how data has changed over a period
SELECT * FROM "homedetailsall$history"
The result of the query shows how data has changed over a period of time using Snapshots(snapshot_id for ref)
If we query using a snapshot_id, we can see the data as of that snapshot_id
SELECT * FROM homedetailsall for VERSION AS of <snapshot_id here>
-
Update the csv file simulating changes from the upstream databases & upload the file to S3 homedetailsin bucket.
-
Merge the data to reflect the new updates from homedetailsin under homedetailsall
--MERGE
MERGE INTO homedetailsall hda USING homedetailsin hdi
ON (hda.home_id = hdi.home_id)
WHEN MATCHED
THEN UPDATE SET home_name = hdi.home_name, value = hdi.value, tour_date = hdi.tour_date, status = hdi.status
WHEN NOT MATCHED
THEN INSERT (home_id,home_name,value,tour_date,status)
VALUES(hdi.home_id,hdi.home_name,hdi.value,hdi.tour_date,hdi.status)