WordPress/openverse-catalog

Make Airflow HTTP connection env vars more readable

AetherUnbound opened this issue ยท 6 comments

Problem

We are currently using environment variables to define Airflow connections for local development. This allows us to jettison the database readily while still retaining connection information.

This approach, however, can be quite confusing. HTTP connections require that the connection start with http:// and then URL encode the rest of the string. This can lead to weird situations1 like:

export AIRFLOW_CONN_SLACK='http://https%3a%2f%2fhooks.slack.com%2fservices%2f...'

or this snippet from our existing env file:

# pre-pended https://, e.g. "https://https%3A%2F%2Fhooks.slack.com%2Fservices%2everythingelse")

As @sarayourfriend pointed out in #397, this can be difficult and confusing to interpret initially.

Description

We should find a way to automatically munge the HTTP connection URLs before booting up Airflow. The munging would URL encode the provided URL, then tack https:// onto the front.

Since these variables need to exist in the environment before Airflow boots up, we'll likely need to create a shim in the Docker entrypoint for altering these variables before launching Airflow. We'll also need to be careful not to alter any other connection types, e.g. postgres:// or s3://.

Alternatives

This may be possible with an Airflow plugin, but we'll have to look into that a bit more.

Additional context

Implementation

  • ๐Ÿ™‹ I would be interested in implementing this feature.

Footnotes

  1. https://issues.apache.org/jira/browse/AIRFLOW-2910?focusedCommentId=16832468&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16832468 โ†ฉ

@AetherUnbound would this be a good-first-issue by any chance?

I don't think so, unfortunately. There are a lot of Airflow-specific pieces at play here, and it might be unclear for new contributors how to check that the connections exist (even if creating the munging piece of the entrypoint is straightforward).

I guess I could document that here! Let me write up some steps for that, then we can mark it as a good first issue ๐Ÿ™‚

Turns out it's actually pretty straightforward! For those looking at this issue, the requisite functionality could be added to the entrypoint before running any of the airflow commands: https://github.com/WordPress/openverse-catalog/blob/main/docker/airflow/entrypoint.sh

The way to test that Airflow has appropriately picked up a connection is run just ipython, then the following:

[ins] In [1]: from airflow.providers.http.hooks.http import HttpHook

[ins] In [2]: http = HttpHook(method="POST", http_conn_id="data_refresh")

[ins] In [3]: http.get_conn()
[2022-03-21 21:37:20,091] {base.py:70} INFO - Using connection to: id: data_refresh. Host: http://172.17.0.1:8001, Port: None, Schema: , Login: None, Password: None, extra: {}
Out[3]: <requests.sessions.Session at 0x7fa23d445b50>

[ins] In [4]: http.base_url
Out[4]: 'http://172.17.0.1:8001'

You should see the URL we want to use in the base_url property. Airflow will complain at the get_conn() step if the connection doesn't exist:

[ins] In [6]: http = HttpHook(method="POST", http_conn_id="does_not_exist")

[ins] In [7]: http.get_conn()
---------------------------------------------------------------------------
AirflowNotFoundException                  Traceback (most recent call last)
Input In [7], in <module>
----> 1 http.get_conn()

File ~/.local/lib/python3.9/site-packages/airflow/providers/http/hooks/http.py:68, in HttpHook.get_conn(self, headers)
     65 session = requests.Session()
     67 if self.http_conn_id:
---> 68     conn = self.get_connection(self.http_conn_id)
     70     if conn.host and "://" in conn.host:
     71         self.base_url = conn.host

File ~/.local/lib/python3.9/site-packages/airflow/hooks/base.py:68, in BaseHook.get_connection(cls, conn_id)
     60 """
     61 Get connection, given connection id.
     62 
     63 :param conn_id: connection id
     64 :return: connection
     65 """
     66 from airflow.models.connection import Connection
---> 68 conn = Connection.get_connection_from_secrets(conn_id)
     69 if conn.host:
     70     log.info(
     71         "Using connection to: id: %s. Host: %s, Port: %s, Schema: %s, Login: %s, Password: %s, "
     72         "extra: %s",
   (...)
     79         redact(conn.extra_dejson),
     80     )

File ~/.local/lib/python3.9/site-packages/airflow/models/connection.py:410, in Connection.get_connection_from_secrets(cls, conn_id)
    403     except Exception:  # pylint: disable=broad-except
    404         log.exception(
    405             'Unable to retrieve connection from secrets backend (%s). '
    406             'Checking subsequent secrets backend.',
    407             type(secrets_backend).__name__,
    408         )
--> 410 raise AirflowNotFoundException(f"The conn_id `{conn_id}` isn't defined")

AirflowNotFoundException: The conn_id `does_not_exist` isn't defined

I'd like to volunteer for this one!

Thanks for volunteering @rwidom, it's yours! ๐Ÿ˜„ Let us know if you run into any trouble!