DB migrations are timing out after 1 second
Closed this issue · 2 comments
Checks
- I have checked for existing issues.
- This report is about the
User-Community Airflow Helm Chart
.
Chart Version
8.8.0
Kubernetes Version
Client Version: v1.29.0
Server Version: v1.28.6-eks-508b6b3
Helm Version
version.BuildInfo{Version:"v3.13.3", GitCommit:"c8b948945e52abba22ff885446a1486cb5fd3474", GitTreeState:"clean", GoVersion:"go1.21.5"}
Description
We're trying to upgrade Airflow from version 1.10.12
to 2.7.3
.
Locally, on a Minikube cluster and a local PostgreSQL database, the upgrade works as expected.
However, when trying to deploy it in a remote K8s cluster, connected to an AWS RDS database (PostgreSQL 16.2), the deployment does not work as the database migrations are timing out after 1 second.
After taking a look at the code, we could see that check_migrations
is set by default to 1.
We find it weird that no one has lifted this issue before - since the User-Community Airflow Chart does not allow us to configure this timeout value - as opposed to the official chart, where we can define images. migrationsWaitTimeout
.
We've also tried configuring properties: "?sslmode=require"
in the externalDatabase
configs, but the same issue is occurring.
The issues doesn't seem to be related to the database connection, as the check-db
step is running correctly, and check_migrations
is correctly fetching the latest applied migration (da3f683c3a5a
).
Can anyone help us understand this issue?
Relevant Logs
/home/airflow/.local/lib/python3.8/site-packages/airflow/config_templates/airflow_local_settings.py:193 DeprecationWarning: The remote_logging option in [core] has been moved to the remote_logging option in [logging] - the old setting has been used, but please update your config.
/home/airflow/.local/lib/python3.8/site-packages/airflow/config_templates/airflow_local_settings.py:206 DeprecationWarning: The remote_base_log_folder option in [core] has been moved to the remote_base_log_folder option in [logging] - the old setting has been used, but please update your config.
[2024-03-21T17:56:47.266+0000] {db.py:798} INFO - Waiting for migrations... 0 second(s)
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 822, in _configured_alembic_environment
yield env
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 799, in check_migrations
raise TimeoutError(
TimeoutError: There are still unapplied migrations after 1 seconds. MigrationHead(s) in DB: {'da3f683c3a5a'} | Migration Head(s) in Source Code: {'405de8318b3a'}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1062, in _rollback_impl
self.engine.dialect.do_rollback(self.connection)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
dbapi_connection.rollback()
psycopg2.OperationalError: SSL connection has been closed unexpectedly
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/scripts/db_migrations.py", line 78, in <module>
main(sync_forever=True)
File "/mnt/scripts/db_migrations.py", line 52, in main
if needs_db_migrations():
File "/mnt/scripts/db_migrations.py", line 34, in needs_db_migrations
check_migrations(1)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 799, in check_migrations
raise TimeoutError(
File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 822, in _configured_alembic_environment
yield env
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 219, in __exit__
self.close()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/future/engine.py", line 246, in close
super(Connection, self).close()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1238, in close
self._transaction.close()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2426, in close
self._do_close()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2649, in _do_close
self._close_impl()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2635, in _close_impl
self._connection_rollback_impl()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2627, in _connection_rollback_impl
self.connection._rollback_impl()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1064, in _rollback_impl
self._handle_dbapi_exception(e, None, None, None, None)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
util.raise_(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1062, in _rollback_impl
self.engine.dialect.do_rollback(self.connection)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
dbapi_connection.rollback()
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL connection has been closed unexpectedly
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Custom Helm Values
airflow:
dbMigrations:
enabled: true
externalDatabase:
type: "postgres"
host: "<our host>"
port: "<our port>"
database: "<our database>"
user: "<our user>"
passwordSecret: "<our password secret>"
passwordSecretKey: "<our password secret key>"
@hpereira98 it's not timing out after 1 second, the relevant error is psycopg2.OperationalError: SSL connection has been closed unexpectedly
.
This indicates that your RDS instance is closing the connection for some reason.
After looking online, it's probably related to a lack of resources on the RDS, or some other configuration error like what this person on Reddit found (related to an invalid init_query
).
Yeah, this was actually an issue with our RDS database, where we had a parameter group setting idle_in_transaction_session_timeout
to a value under 1s.
Thanks for your help!