Airflow Doc Ingestion Task in Bulk Ingestion Fails Due to Incorrect Python Typing
Closed this issue · 1 comments
davidgxue commented
Describe the bug
extract_airflow_docs
task in bulk ingestion DAG fails with the following message
[2024-01-24, 23:49:01 UTC] {html_utils.py:126} INFO - https://airflow.apache.org/docs/apache-airflow-providers-jenkins/stable/connections.html
[2024-01-24, 23:49:01 UTC] {taskinstance.py:2699} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
result = execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/airflow/decorators/base.py", line 242, in execute
return_value = super().execute(context)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 199, in execute
return_value = self.execute_callable()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 216, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dags/ingestion/ask-astro-load.py", line 207, in extract_airflow_docs
df = airflow_docs.extract_airflow_docs(docs_base_url=airflow_docs_base_url)[0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'set' object is not subscriptable
To Reproduce
- Start a clean repo locally.
- Populate
.env
file with proper env vars such as ASK_ASTRO_ENV=local, AIRFLOW_CONN_GITHUB_RO and AIRFLOW_CONN_WEAVIATE_LOCAL - MAKE SURE no parquet file is present!!
- Run
astro dev start
- start bulk ingestion DAG
- Check task error log
Expected behavior
Task should run successfully
davidgxue commented
Investigation conclusion
- Issue can be traced back to this PR 2 weeks ago here #237
- Reason why it wasn't caught during testing and prod deploy I believe is because we already had parquest file present from previous scraps so this part of code is skipped.
- Root cause is simple:
Followup
- While type casing may solve the issue, it does not solve the problem. It seems like
extract_airflow_docs
should receive a list of DataFrame not a list of strings. There seems to be a large chunk of logic deleted that goes from a set of URLs to dataframe content. - Seems like the part of the code that converts a set of URL links to scrapped docs as a list of Pandas DataFrame is missing. It is likely deleted by accident in this PR #237. The lines below in the screenshot are the lines missing.