/etl_bn

ETL for Bayesian Networks

Primary LanguagePythonMIT LicenseMIT

Table of Contents

Description

An ETL process for Bayesian Network using Apache Airflow,Apache Kafka and HDFS.

The integration of the Apache Airflow boils down to the fact that we want to exploit the functionality of provided scheduler in order to offer the following steps in a schedule way:

1. Load the dataset from the inputPath local directory
2. Make some transforms(cleansing,add partition_id column) on dataset 
3. Write some stats of dataset to the HDFS directory
4. Load the dataset from the HDFS directory and write to Apache Kafka

Architecture

Arch

Configuration

The configuration of dag run can be passed either using the argument --config or through UI

Hadoop => host,port
Kafka => topic,servers,numPartitions
Local => inputPath
"inputPath": "path_to_local_dir",
"host": "host_url",
"port": "port_number",
"topic": "topic_name",
"servers": "bootstrap_servers",
"numPartitions": "partitions"

Requirements