/spark-elt-jobs

Example of Spark ELT jobs using Airflow scheduling

Primary LanguagePythonApache License 2.0Apache-2.0

ETL Spark/Hadoop Jobs

There are different generic jobs based on Spark and Hadoop Common libraries to perform typical ELT tasks:

  • CheckFileExists.scala: checks that certain files are at DFS location. Returns exit code 0 in case files exist and exit code 99 if they do not. Airflow BashOperator is then using skip-exit-code to mark current task as skipped.
  • FileToFile.scala copies files from one DFS location to another one
  • FileToDataset.scala loads data from DFS files into Spark format using Spark Batch API. For example Parquet or other formats
  • FileStreamToDataset.scala is the same as previos job, but using Spark Streaming API.
  • CheckDataRecieved.scala checks that all required data exist at specific locations.

See execution examples in Airflow DAGs folder:

Supported input/output formats:

  • CSV, JSON, Parquet
  • Delta Lake