airflow-ml : ML and Pre-Processing Automation

This is a workflow manager which uses Airflow to automate all the Machine Learning and Data Pre-processing pipelines in our system. Some of the most important steps in our pipeline are

  • Fetching and indexing raw data and entities which are in the bucket
  • Economic News Detection (To-be-done)
  • Text Extraction - By hitting the news URL and fetching data
  • Some cleaning, indexing and preprocessing - Probably storing it in a relational database
  • Sentiment Analysis
  • Fuzzy Matching
  • Entity Extraction
  • Custom Scoring Model (BERT)

Development

Airflow needs persistent storage to store the details of all the executions and pipelines. Right now, we run the postgres docker image as a temporary solution. We need to change it to production ready database.

When setting up airflow for the first time, we need to run the migrations which is in the docker-compose so that the tables can be setup.

  • Set the $AIRFLOW_HOME to the current file root export AIRFLOW_HOME=$(pwd)
  • airflow.cfg shouldn't be copied
  • remove the default airflow.cfg while setting up locally

Testing Locally

  • airflow webserver
  • airflow backfill indexing -s 2019-12-30

airflow test indexing test_entities 2020-01-01 airflow test indexing test_entities 2020-01-01

Migrations: Run only the first time for setting up tables

docker-compose up initdb

docker-compose up --build

  • delete .data and run docker-compose up --build postgres
  • docker-compose up --build initdb
  • and no airflow.cfg and unittest
  • User Auth
  • Disable auto-start docker update --restart=no rey_worker_1 rey_scheduler_1 rey_webserver_1 rey_flower_1 rey_redis_1 rey_postgres_1
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__WEBSERVER__RBAC=True
  • If you fuck up db

    • airflow resetdb
    • docker exec -it rey_scheduler_1 /entrypoint.sh bash
  • Creating a New User

    airflow create_user -r Admin -u dev -e adarsh@alrt.ai -f username -l s -p password

  • Configuring Fernet Key for Production

    Fetch the Fernet key and update initdb to make it work properly (using docker exec) export FERNET_KEY = python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"

Distributing Workers

MetaStore: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow BrokerURL: redis://redis:6379/1

For running workers on different nodes, connect to the metastore db and the redis queue by configuring the docker-compose with POSTGRES_HOST and REDIS_HOST. Also remove the dependency to the local scheduler

Nginx configuration

  • delete /etc/nginx/sites-enabled/default
  • create /etc/nginx/conf.d/reverse-proxies.conf
  • add A Record airflow
  •       listen 80 default_server;
          listen [::]:80 default_server;
          server_name airflow.alrt.ai;
    
          location / {
                proxy_pass http://localhost:8080/;
                proxy_http_version 1.1;
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection 'upgrade';
                proxy_set_header Host $host;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_cache_bypass $http_upgrade;
                proxy_set_header X-Forwarded-Proto $scheme;
          }
    

}



## References

* [Configuring Airflow in docker-compose](https://medium.com/@xnuinside/quick-guide-how-to-run-apache-airflow-cluster-in-docker-compose-615eb8abd67a)
* [Best Practices](https://gtoonstra.github.io/etl-with-airflow/principles.html)
* [Config](https://github.com/kjam/data-pipelines-course/issues/1)
* [Fernet Key and Stuff](https://medium.com/@itunpredictable/apache-airflow-on-docker-for-complete-beginners-cf76cf7b2c9a)