/DataEngineeringProjects

repo consists for data engineering projects

Primary LanguagePython

Airflow DAG: podcast_summary2

pipeline image

This Airflow DAG named podcast_summary2 is designed to extract, transform and load podcast episodes from the Marketplace podcast feed, and store them in a SQLite database. It also downloads the audio files of the podcast episodes to the specified directory.

Requirements

  • Airflow
  • requests library
  • xmltodict library
  • SQLite

DAG Configuration

  • dag_id: podcast_summary2
  • description: podcasts
  • start_date: March 15, 2023
  • schedule_interval: Daily
  • catchup: False

Tasks

create_table_sqlite

This task creates a table named episodes in the SQLite database, with columns link, title, filename, published, and description.

get_episodes

This task retrieves the XML data from the Marketplace podcast feed, and parses it to extract the podcast episodes. The episodes are returned as a list.

load_episodes

This task loads the new episodes into the SQLite database. It checks if an episode already exists in the database by comparing its link, and only inserts new episodes. The task also generates the filename of the episode and inserts it into the filename column.

download_episodes

This task downloads the audio files of the podcast episodes to the specified directory. It iterates over the list of episodes, and downloads the audio file for each episode if it does not already exist in the specified directory.

Note

This DAG assumes that a SQLite database connection named podcasts has been created in Airflow, and the path of the database file is ~/airflow/dags/episodes.db. The downloaded audio files will be stored in /~/airflow/dags/episodes/. You may need to modify the file paths in the code to match your environment.