This project automates the process of handling files added to an S3 bucket by a source system. It involves creating a temporary Delta table, performing upsert operations, and sending notifications upon completion.
- File Arrival: The source system uploads files to a specified S3 bucket.
- Trigger Workflow: An event trigger in Databricks detects the arrival of new files and initiates the workflow.
- Task 1:
- Creates a temporary Delta table from the new file data.
- Moves the processed file to an archive location in the S3 bucket.
- Task 2: Reads data from the temporary Delta table and performs an upsert operation to the target Delta table.
- Completion Notification: Once the workflow job is complete, an email notification is sent to relevant stakeholders.
- AWS Account
- Access to AWS S3
- Databricks Account
- Email service setup for notifications
- Create an S3 bucket for file uploads.
- Set appropriate permissions to allow Databricks to access the bucket.
- Create a new Databricks workflow.
- Configure the file arrival trigger to initiate the workflow when new files are added to the S3 bucket.
- In Task 1, implement the following steps:
- Create a temporary Delta table from the incoming file data.
- Move the original file to the archive location.
- In Task 2, you will read data from the temporary Delta table created in Task 1 and perform an upsert operation to the target Delta table. This ensures that the latest data is reflected in your final dataset while preserving existing records.
- Read Data from Temporary Delta Table: Use Spark to load the data from the temporary Delta table.
- Perform Upsert Operation: Use the Delta Lake
MERGE
command to delete existing records in the target table and append all records.
- After the workflow completes, it's important to notify stakeholders about the successful processing of the files.