TogetherCrew/airflow-dags

BUG: sqlalchemy conflict between airflow and llama-index

Closed this issue · 1 comments

As we're updating the llama-index library version to use their newest features (pipelines, docstore, etc), we're hitting an error that is

Error!!!: Too old airflow version.

This error is being raised because docker cannot run the gosu command to get the airflow version therefore it is raising error. Looking at the logs it seems it is raising because of the sqlalchemy version of apache airflow should be <=1.4.49 and the version for us to use the newest llama-index is greater than 2.0. In this case airflow service cannot come up and is raising this error.

To resolve this error we need to migrate to another vector database that is not very dependent on sqlalchemy version.

Researching about it, we found out that our best alternative is Qdrant database which supports async + metadata filtering (ref: QDrant features

To update our systems to use the Qdrant database we have the following tasks

  • Update docker-compose.yaml and docker-compose.test.yaml to use a stable version of qdrant database instead of pgvector
  • Update CustomIngestionPipeline to use qdrant database for vector-stores and docstore
  • Update discord-etl to assign message id to each llama-index Document
  • Update discord-etl to use the CustomIngestionPipeline
  • Update discord-summary-etl to assign a unique value to each summary document *
  • Update discord-summary etl to use the CustomIngestionPipeline
  • Update discourse_vector_store ETL to assign a unique value to each document *
  • Update discourse_summary_vector_store ETL to assign a unique value to each document *
  • Update discourse_vector_store ETL to use the CustomIngestionPipeline
  • Update discourse_summary_vector_store ETL to use the CustomIngestionPipeline
  • Update github_vector_store ETL to assign a unique id to each document *
  • Update github_vector_store ETL to use the CustomIngestionPipeline
  • Update GDrive ETL to use the CustomIngestionPipeline

Note *: IDs should be the same across multiple runs. This is because the docstore could check for duplicated or updated nodes.

For now, we'll be keeping the old codes to use the pgvector and slowly we'll migrate from pgvector to qdrant.