[Quality] Delay object creation until runtime
AetherUnbound opened this issue · 4 comments
Current Situation
Related to #199 - we have a number of objects (DelayedRequester
, MediaStore
, etc.) that are created at import time rather than runtime. This has a negative performance impact on DAG processing time. The DAG parsing does not need to have these object available, and initializing them every time the parsing is done can unnecessarily increase parsing time. This can be seen from the logging that's done during DAG parsing:
$ airflow dags list -v
[2021-10-07 21:37:40,207] {dagbag.py:496} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2021-10-07 21:37:40,242] {media.py:61} INFO - Initialized image MediaStore with provider finnishmuseums
[2021-10-07 21:37:40,242] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,243] {media.py:180} INFO - Output path: /tmp/finnishmuseums_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,248] {media.py:61} INFO - Initialized image MediaStore with provider smithsonian
[2021-10-07 21:37:40,248] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,248] {media.py:180} INFO - Output path: /tmp/smithsonian_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,257] {media.py:61} INFO - Initialized image MediaStore with provider europeana
[2021-10-07 21:37:40,257] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,257] {media.py:180} INFO - Output path: /tmp/europeana_image_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized audio MediaStore with provider wikimedia_audio
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,316] {media.py:180} INFO - Output path: /tmp/wikimedia_audio_audio_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized image MediaStore with provider wikimedia
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,317] {media.py:180} INFO - Output path: /tmp/wikimedia_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,327] {media.py:61} INFO - Initialized image MediaStore with provider rawpixel
[2021-10-07 21:37:40,327] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,327] {media.py:180} INFO - Output path: /tmp/rawpixel_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,329] {media.py:61} INFO - Initialized image MediaStore with provider flickr
[2021-10-07 21:37:40,329] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,329] {media.py:180} INFO - Output path: /tmp/flickr_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,330] {media.py:61} INFO - Initialized image MediaStore with provider museumsvictoria
[2021-10-07 21:37:40,330] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,331] {media.py:180} INFO - Output path: /tmp/museumsvictoria_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,351] {media.py:61} INFO - Initialized image MediaStore with provider stocksnap
[2021-10-07 21:37:40,351] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,352] {media.py:180} INFO - Output path: /tmp/stocksnap_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,494] {media.py:61} INFO - Initialized image MediaStore with provider met
[2021-10-07 21:37:40,494] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,495] {media.py:180} INFO - Output path: /tmp/met_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,508] {media.py:61} INFO - Initialized image MediaStore with provider statensmuseum
[2021-10-07 21:37:40,508] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,509] {media.py:180} INFO - Output path: /tmp/statensmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,519] {media.py:61} INFO - Initialized image MediaStore with provider nypl
[2021-10-07 21:37:40,519] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,520] {media.py:180} INFO - Output path: /tmp/nypl_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,522] {media.py:61} INFO - Initialized image MediaStore with provider phylopic
[2021-10-07 21:37:40,522] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,522] {media.py:180} INFO - Output path: /tmp/phylopic_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,526] {media.py:61} INFO - Initialized image MediaStore with provider sciencemuseum
[2021-10-07 21:37:40,526] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,526] {media.py:180} INFO - Output path: /tmp/sciencemuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,529] {media.py:61} INFO - Initialized audio MediaStore with provider jamendo
[2021-10-07 21:37:40,529] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,529] {media.py:180} INFO - Output path: /tmp/jamendo_audio_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,531] {media.py:61} INFO - Initialized image MediaStore with provider clevelandmuseum
[2021-10-07 21:37:40,531] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,531] {media.py:180} INFO - Output path: /tmp/clevelandmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 15, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,536] {media.py:61} INFO - Initialized image MediaStore with provider waltersartmuseum
[2021-10-07 21:37:40,536] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,536] {media.py:180} INFO - Output path: /tmp/waltersartmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 27, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,543] {media.py:61} INFO - Initialized image MediaStore with provider brooklynmuseum
[2021-10-07 21:37:40,543] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,543] {media.py:180} INFO - Output path: /tmp/brooklynmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
dag_id | filepath | owner | paused
==========================================+==============================================+================+=======
airflow_log_cleanup | airflow_log_cleanup_workflow.py | data-eng-admin | True
brooklyn_museum_workflow | brooklyn_museum_workflow.py | data-eng-admin | True
check_new_smithsonian_unit_codes_workflow | check_new_smithsonian_unit_codes_workflow.py | data-eng-admin | True
cleveland_museum_workflow | cleveland_museum_workflow.py | data-eng-admin | True
europeana_ingestion_workflow | europeana_ingestion_workflow.py | data-eng-admin | True
europeana_sub_provider_update_workflow | europeana_sub_provider_update_workflow.py | data-eng-admin | True
europeana_workflow | europeana_workflow.py | data-eng-admin | True
finnish_museums_workflow | finnish_museums_workflow.py | data-eng-admin | True
flickr_ingestion_workflow | flickr_ingestion_workflow.py | data-eng-admin | True
flickr_sub_provider_update_workflow | flickr_sub_provider_update_workflow.py | data-eng-admin | True
flickr_workflow | flickr_workflow.py | data-eng-admin | True
healthcheck_workflow | healthcheck_workflow.py | data-eng-admin | True
image_expiration_workflow | image_expiration_workflow.py | data-eng-admin | True
jamendo_workflow | jamendo_workflow.py | data-eng-admin | True
metropolitan_museum_workflow | metropolitan_museum_workflow.py | data-eng-admin | True
museum_victoria_workflow | museum_victoria_workflow.py | data-eng-admin | True
nypl_workflow | nypl_workflow.py | data-eng-admin | True
phylopic_workflow | phylopic_workflow.py | data-eng-admin | True
postgres_image_cleaner | cleaner_workflow.py | data-eng-admin | True
rawpixel_workflow | rawpixel_workflow.py | data-eng-admin | True
recreate_audio_popularity_calculation | recreate_audio_popularity_calculation.py | data-eng-admin | True
recreate_image_popularity_calculation | recreate_image_popularity_calculation.py | data-eng-admin | True
refresh_all_audio_popularity_data | refresh_all_audio_popularity_data.py | data-eng-admin | True
refresh_all_image_popularity_data | refresh_all_image_popularity_data.py | data-eng-admin | True
refresh_audio_view_data | refresh_audio_view_data.py | data-eng-admin | True
refresh_image_view_data | refresh_image_view_data.py | data-eng-admin | True
science_museum_workflow | science_museum_workflow.py | data-eng-admin | True
smithsonian_sub_provider_update_workflow | smithsonian_sub_provider_update_workflow.py | data-eng-admin | True
smithsonian_workflow | smithsonian_workflow.py | data-eng-admin | True
staten_museum_workflow | statens_museum_workflow.py | data-eng-admin | True
stocksnap_workflow | stocksnap_workflow.py | data-eng-admin | True
tsv_to_postgres_loader | loader_workflow.py | data-eng-admin | True
tsv_to_postgres_loader_overwrite | loader_workflow.py | data-eng-admin | True
walters_workflow | walters_workflow.py | data-eng-admin | True
wikimedia_commons_workflow | wikimedia_workflow.py | data-eng-admin | True
wikimedia_ingestion_workflow | wikimedia_ingestion_workflow.py | data-eng-admin | True
This is happening at every DAG parse interval, which by default is 30 seconds.
Suggested Improvement
Initialize these objects in the main
functions of each provider, and pass them into other functions as necessary.
Benefit
Decreased DAG processing time, CPU & memory usage.
Additional context
Implementation
- 🙋 I would be interested in implementing this feature.
I'm thinking about how to tackle this one today because it affects WordPress/openverse#1703. After taking a look at some of the code, I think we have 3 approaches:
1) Use a deferred init
function
With this approach, we would still initialize the DelayedRequester
and <Media>Store
at the top of the file, but none of the actual initialization would occur at that point. We would then call a .init()
function on each within the main
function in the provider_api_scripts
module. That would allow us to continue to reference the requester & store as module-level variables within the API script while still deferring any heavy lifting until runtime. Tests would be affected by this, so we may need to add a fixture to the tests which automatically init
s the requester & store.
2) Create the objects in main
Here we would move the instantiation into the main
function and out of the module scope. This would require significant changes to the function signatures (as they now have to pass along structures which were previously accessible at the module level) and the tests along with them. This approach seems the most "conventional".
3) Convert the API scripts into objects rather than collections of functions
Almost every one of our API scripts initializes a DelayedRequester
and a <Media>Store
(if not several). There's also a fair amount of similar machinery among the API scripts that could be collapsed/refactored as part of this effort. All of the shared/common pieces would go into a ProviderAPIRetreiver
(or something) class definition that would be sub-classed and overridden as necessary per-provider. This would be the largest undertaking of the three. It also has the risk of obfuscating some of the code - while code reduction is generally considered good, accessibility to community contributions is still a high priority for this piece of the project. By moving some pieces into a common class definition, we might make it harder for contributors to understand what's going on and thus how to contribute a new provider.
The reason I bring this up is that we're likely going to be adding a new Requester
class as part of WordPress/openverse#1703 that has even more overhead on initialization, which we'd want to avoid doing at DAG parse time.
Of the three, which do you think is best? Have I missed any considerations as part of this?
I think I'm inclined to actually go the object route - for something like WordPress/openverse#1697, it would significantly reduce the number of changes we needed to make if all providers had shared functionality inherited from a base class.
Sorry I meant to reply earlier, but lost this issue.
We have discussed converting the provider API scripts into objects: #163 (comment)
I think it would be a great solution for this problem, and, provided that we give clear instructions and a template, would not make it significantly harder for new contributors to understand.
Incidentally, a solution is also described here: https://make.wordpress.org/openverse/handbook/openverse-handbook/openverse-catalog/community-involvement/#technical-changes-making-contribution-easier