astronomer/ask-astro

Airflow Doc Ingestion Task in Bulk Ingestion Fails Due to Incorrect Python Typing

Closed this issue · 1 comments

Describe the bug
extract_airflow_docs task in bulk ingestion DAG fails with the following message

[2024-01-24, 23:49:01 UTC] {html_utils.py:126} INFO - https://airflow.apache.org/docs/apache-airflow-providers-jenkins/stable/connections.html
[2024-01-24, 23:49:01 UTC] {taskinstance.py:2699} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/decorators/base.py", line 242, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 199, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 216, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dags/ingestion/ask-astro-load.py", line 207, in extract_airflow_docs
    df = airflow_docs.extract_airflow_docs(docs_base_url=airflow_docs_base_url)[0]
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'set' object is not subscriptable

To Reproduce

  1. Start a clean repo locally.
  2. Populate .env file with proper env vars such as ASK_ASTRO_ENV=local, AIRFLOW_CONN_GITHUB_RO and AIRFLOW_CONN_WEAVIATE_LOCAL
  3. MAKE SURE no parquet file is present!!
  4. Run astro dev start
  5. start bulk ingestion DAG
  6. Check task error log

Expected behavior
Task should run successfully

Screenshots
image

Investigation conclusion

  • Issue can be traced back to this PR 2 weeks ago here #237
  • Reason why it wasn't caught during testing and prod deploy I believe is because we already had parquest file present from previous scraps so this part of code is skipped.
  • Root cause is simple:
    • df = airflow_docs.extract_airflow_docs(docs_base_url=airflow_docs_base_url)[0] at here is trying to index into 0th index.
    • Yet, function call extract_airflow_docs() calls get_internal_links from this file, which returns a set.
    • We cannot index into a set, therefore causing it to error out.

Followup

  • While type casing may solve the issue, it does not solve the problem. It seems like extract_airflow_docs should receive a list of DataFrame not a list of strings. There seems to be a large chunk of logic deleted that goes from a set of URLs to dataframe content.
  • Seems like the part of the code that converts a set of URL links to scrapped docs as a list of Pandas DataFrame is missing. It is likely deleted by accident in this PR #237. The lines below in the screenshot are the lines missing.
    image