/Apache-Airflow-Beam-TensorFlow-Examples

Various examples for TensorFlow Extended using Apache Beam and Airflow to create End2End Pipelines for Machine Learning.

Primary LanguagePythonMIT LicenseMIT

Apache-Airflow-Beam-TensorFlow-Examples

Various examples for TensorFlow Extended using Apache Beam and Airflow

Preface

The Goal of the repository is to show what is possible for Machine Learning End2End Pipelines for Computer Vision problems.

In this repository the possiblities of TensorFlow Extended in combination with Apache Airflow and Apache Beam will be discovered. In order to do that, some Pipelines will be created and presented, which will run the different DAGs (Directed Acyclic Graphs). While showing a basic example with CSV Data, there will be a focus on Computer Vision tasks. Therefore at least an Image Classification and a Semantic Segmantation Pipeline will be presented.

Since I only have an Windows Computer I will run everything on the Ubuntu20.04 WSL (Windows Subsystem for Linux) and will write a guide for a proper setup. Furthermore with the right setup every example and code file should be executable on any Windows (and of courde Linux) computer.

Main Library Features

  • Introduction on how to set up Apache Beam and Airflow on Windows
  • Multiple end2end pipelines build on top of TensorFlow, Apache Beam and Apache Airflow

Table of Contents

Examples

Here all the current example DAGs are listed, this list will be updated over time, while the installation and setup process should stay the same.

For more details about the example look at this README.md.

  • Classification DAG, is an Image Classification example, for classifying 6 different classes. the Utils files can be found here.
  • Segmentation DAG, is an Semantic Segmentation example, for segmenting up to 12 different classes. the Utils files can be found here.

Installation and Setup

To get the repository running just check the following requirements.

Requirements

  1. Python 3.8
  2. tensorflow >= 2.3.0
  3. tfx == 0.24.0
  4. apache-beam == 2.24.0
  5. apache-airflow[celery] == 1.10.12
  6. psycopg2 == 2.8.6
  7. tfx == 0.24.0
  8. tensorflow_advanced_segmentation_models
  9. albumenatations
  10. numpy

Furthermore just execute the following command to download and install the git repository.

Clone Repository

$ git clone https://github.com/JanMarcelKezmann/Apache-Airflow-Beam-TensorFlow-Examples.git

Setup Ubuntu and configure Airflow

Take a look at Markdown file, to get a detailed setup tutorial for Ubuntu on Windows and for the correct configuration of Airflow.

The Setup process for Ubuntu and Airflow is heavily based on the Medium Article written by Ryan Roline. It's main differences are the Python Version and the installation of apache-airflow including the Celery package. Therefore I recommend to read the full article if some problems with the below mentioned steps occur, but be aware to use the correct Versions and Ubuntu Instance for the Setup. The URL reference can be found at the end of the README.

Once you are finished setting up Airflow and its dependencies, you can run a pipeline as explained below, access the browser and run localhost:8080 in a new tab. A local page showing the current DAGs should load. Here all your dags, which are in the above configured "dags_folder" should appear (as far as the code has no bugs in the DAGs Pipeline)

Run a Pipeline

This is the example procedure to run one of the DAGs in the repositories dags folder.

Everything should run, but in order to make it work, either convert the data in the Image Classification directory by running the following command:

cd /mnt/c/dags/classification_pipeline/
python3 convert_data_to_tfrecord.py

this should transform the small sample of the original dataset to a tfrecord file, or simply Add Image data yourself into that directory and convert it to a tfrecord by applying small changes to the "convert_data_to_tfrecord.py" file.

Once the data is set up, you can continue with the following steps:

Steps:

  1. Go to the directory where you cloned the repository
  2. Copy the file "classification_dag.py" to the dags folder you configured above
  3. Copy the folder "classification_pipeline" in to the dags folder you configured above
  4. Open a Ubuntu CLI (Command Line Interface) and run the following two commands:
    1. Run:
    2. airflow initdb
      
    3. Then Run:
    4. airflow webserver -p 8080
      
  5. Open another Ubuntu CLI and run:
  6. airflow scheduler
    
  7. Open another Ubuntu CLI and run:
  8. airflow worker
    
  9. Running the DAGGo into your browser to: localhost:8080
    1. First Method: Using a Webbrowser
      1. Go into you webbrowser of choice and enter: localhost:8080
      2. Click on the Play Button next to the DAG under Links
      3. Click Trigger
    2. Second Method: Using the CLI
      1. Open another CLI
      2. Run:
      3. airflow trigger_dag classification_dag
        

Finished!

The DAG is now running you can now take a closer look if you click on the name of your DAG to see the details.

Citing

@misc{Kezmann:2020,
  Author = {Jan-Marcel Kezmann},
  Title = {Apache Airflow Beam TensorFlow Examples},
  Year = {2020},
  Publisher = {GitHub},
  Journal = {GitHub repository},
  Howpublished = {\url{https://github.com/JanMarcelKezmann/Apache-Airflow-Beam-TensorFlow-Examples}}
}

License

Project is distributed under MIT License.

References