Edge node Knowledge mining

Environment and DB Setup

cp .env.example .env

Install Python environment

  1. It's recommended to use pyenv and to install Python 3.11 locally inside the app's directory so it doesn't clash with other Python version on your machine
    pyenv local 3.11.7
  2. Now that Python is available (python -v), Virtual environment should be set in order to install requirements
    python -m venv .venv && source .venv/bin/activate
  3. Install Python requirements
    pip install -r requirements.txt

Apache airflow setup

Airflow pipelines are part of Knowledge mining service, which are used for creation of automated data processing pipelines. Main purpose of pipelines is to create content for Knowledge assets based on the input file.

Generate default airflow config

airflow config list --defaults

This is path for Airflow config file: ~/airflow/airflow.cfg file

Change the following lines in the config:

load_examples = False
dags_folder = YOUR_PATH_TO/edge-node-knowledge-mining/dags
parallelism = 32
max_active_tasks_per_dag = 16
max_active_runs_per_dag = 16
enable_xcom_pickling = True

Airflow db init

airflow db init

airflow users  create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin

Airflow scheduler

In order to have Airflow running, first Scheduler should be started:

airflow scheduler (to pick up new DAGs/jobs)

Unpause JOBS

airflow dags unpause exampleDAG
airflow dags unpause pdf_to_jsonld
airflow dags unpause simple_json_to_jsonld

Airflow webserver

To keep track how your pipelines perform, webserver should be installed. It will be available on http://localhost:8080. After starting everything pipelines should be available on page http://localhost:8080/home and un-paused
Start airflow server

airflow webserver --port 8080 (port where you can open the dashboard)

Start server for Edge node Knowledge mining

python app.py

MYSQL for logging

CREATE DATABASE ka-mining-api-logging CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci

Examples - make sure to add cookie from auth service /login method

    curl -X POST http://localhost:5005/trigger_pipeline \
    -F "file=@test_pdfs/22pages_eng.pdf" \
    -F "pipelineId=pdf_to_jsonld" \
    -F "fileFormat=pdf" \
    -b "connect.sid=s%3A9XCAe7sos-iY4Z_jIjyVcQYjLaYHVi0H.UeghM8ZRS97nVkZPukbL8Zu%2F%2BbRZSAuOLpq3BMepiD0; Path=/; HttpOnly;"
    curl -X POST http://localhost:5005/trigger_pipeline \
    -F "file=@test_jsons/entertainment_test.json" \
    -F "pipelineId=simple_json_to_jsonld" \
    -F "fileFormat=json" \
    -b "connect.sid=s%3Aw_26GwYGj1rLvXpGPBQW0M_mQxrfbVMW.jZazIh0iv01R7TiOxmF0WKFjlKTi7rWhZJe1M24E21E; Path=/; HttpOnly"

Trigger the vectorization DAG via POST request

curl -X POST http://localhost:5005/trigger_pipeline \
     -F "file=@test_jsonlds/vectorize_test.json" \
     -F "pipelineId=vectorize_ka" \
     -b "connect.sid=s%3AjLYArFLH7IadiB4dkEDrppgEEQJEqNss.35WzNEW3PySPRIxrDpL5tsRZ%2F%2B%2FNo%2BnZgRPDoRz0y7g; Path=/; HttpOnly;"