WordPress/openverse-catalog

[Quality] Delay object creation until runtime

AetherUnbound opened this issue · 4 comments

Current Situation

Related to #199 - we have a number of objects (DelayedRequester, MediaStore, etc.) that are created at import time rather than runtime. This has a negative performance impact on DAG processing time. The DAG parsing does not need to have these object available, and initializing them every time the parsing is done can unnecessarily increase parsing time. This can be seen from the logging that's done during DAG parsing:

$ airflow dags list -v
[2021-10-07 21:37:40,207] {dagbag.py:496} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2021-10-07 21:37:40,242] {media.py:61} INFO - Initialized image MediaStore with provider finnishmuseums
[2021-10-07 21:37:40,242] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,243] {media.py:180} INFO - Output path: /tmp/finnishmuseums_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,248] {media.py:61} INFO - Initialized image MediaStore with provider smithsonian
[2021-10-07 21:37:40,248] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,248] {media.py:180} INFO - Output path: /tmp/smithsonian_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,257] {media.py:61} INFO - Initialized image MediaStore with provider europeana
[2021-10-07 21:37:40,257] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,257] {media.py:180} INFO - Output path: /tmp/europeana_image_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized audio MediaStore with provider wikimedia_audio
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,316] {media.py:180} INFO - Output path: /tmp/wikimedia_audio_audio_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized image MediaStore with provider wikimedia
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,317] {media.py:180} INFO - Output path: /tmp/wikimedia_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,327] {media.py:61} INFO - Initialized image MediaStore with provider rawpixel
[2021-10-07 21:37:40,327] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,327] {media.py:180} INFO - Output path: /tmp/rawpixel_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,329] {media.py:61} INFO - Initialized image MediaStore with provider flickr
[2021-10-07 21:37:40,329] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,329] {media.py:180} INFO - Output path: /tmp/flickr_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,330] {media.py:61} INFO - Initialized image MediaStore with provider museumsvictoria
[2021-10-07 21:37:40,330] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,331] {media.py:180} INFO - Output path: /tmp/museumsvictoria_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,351] {media.py:61} INFO - Initialized image MediaStore with provider stocksnap
[2021-10-07 21:37:40,351] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,352] {media.py:180} INFO - Output path: /tmp/stocksnap_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,494] {media.py:61} INFO - Initialized image MediaStore with provider met
[2021-10-07 21:37:40,494] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,495] {media.py:180} INFO - Output path: /tmp/met_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,508] {media.py:61} INFO - Initialized image MediaStore with provider statensmuseum
[2021-10-07 21:37:40,508] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,509] {media.py:180} INFO - Output path: /tmp/statensmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,519] {media.py:61} INFO - Initialized image MediaStore with provider nypl
[2021-10-07 21:37:40,519] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,520] {media.py:180} INFO - Output path: /tmp/nypl_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,522] {media.py:61} INFO - Initialized image MediaStore with provider phylopic
[2021-10-07 21:37:40,522] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,522] {media.py:180} INFO - Output path: /tmp/phylopic_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,526] {media.py:61} INFO - Initialized image MediaStore with provider sciencemuseum
[2021-10-07 21:37:40,526] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,526] {media.py:180} INFO - Output path: /tmp/sciencemuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,529] {media.py:61} INFO - Initialized audio MediaStore with provider jamendo
[2021-10-07 21:37:40,529] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,529] {media.py:180} INFO - Output path: /tmp/jamendo_audio_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,531] {media.py:61} INFO - Initialized image MediaStore with provider clevelandmuseum
[2021-10-07 21:37:40,531] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,531] {media.py:180} INFO - Output path: /tmp/clevelandmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 15, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,536] {media.py:61} INFO - Initialized image MediaStore with provider waltersartmuseum
[2021-10-07 21:37:40,536] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,536] {media.py:180} INFO - Output path: /tmp/waltersartmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 27, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,543] {media.py:61} INFO - Initialized image MediaStore with provider brooklynmuseum
[2021-10-07 21:37:40,543] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,543] {media.py:180} INFO - Output path: /tmp/brooklynmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
dag_id                                    | filepath                                     | owner          | paused
==========================================+==============================================+================+=======
airflow_log_cleanup                       | airflow_log_cleanup_workflow.py              | data-eng-admin | True  
brooklyn_museum_workflow                  | brooklyn_museum_workflow.py                  | data-eng-admin | True  
check_new_smithsonian_unit_codes_workflow | check_new_smithsonian_unit_codes_workflow.py | data-eng-admin | True  
cleveland_museum_workflow                 | cleveland_museum_workflow.py                 | data-eng-admin | True  
europeana_ingestion_workflow              | europeana_ingestion_workflow.py              | data-eng-admin | True  
europeana_sub_provider_update_workflow    | europeana_sub_provider_update_workflow.py    | data-eng-admin | True  
europeana_workflow                        | europeana_workflow.py                        | data-eng-admin | True  
finnish_museums_workflow                  | finnish_museums_workflow.py                  | data-eng-admin | True  
flickr_ingestion_workflow                 | flickr_ingestion_workflow.py                 | data-eng-admin | True  
flickr_sub_provider_update_workflow       | flickr_sub_provider_update_workflow.py       | data-eng-admin | True  
flickr_workflow                           | flickr_workflow.py                           | data-eng-admin | True  
healthcheck_workflow                      | healthcheck_workflow.py                      | data-eng-admin | True  
image_expiration_workflow                 | image_expiration_workflow.py                 | data-eng-admin | True  
jamendo_workflow                          | jamendo_workflow.py                          | data-eng-admin | True  
metropolitan_museum_workflow              | metropolitan_museum_workflow.py              | data-eng-admin | True  
museum_victoria_workflow                  | museum_victoria_workflow.py                  | data-eng-admin | True  
nypl_workflow                             | nypl_workflow.py                             | data-eng-admin | True  
phylopic_workflow                         | phylopic_workflow.py                         | data-eng-admin | True  
postgres_image_cleaner                    | cleaner_workflow.py                          | data-eng-admin | True  
rawpixel_workflow                         | rawpixel_workflow.py                         | data-eng-admin | True  
recreate_audio_popularity_calculation     | recreate_audio_popularity_calculation.py     | data-eng-admin | True  
recreate_image_popularity_calculation     | recreate_image_popularity_calculation.py     | data-eng-admin | True  
refresh_all_audio_popularity_data         | refresh_all_audio_popularity_data.py         | data-eng-admin | True  
refresh_all_image_popularity_data         | refresh_all_image_popularity_data.py         | data-eng-admin | True  
refresh_audio_view_data                   | refresh_audio_view_data.py                   | data-eng-admin | True  
refresh_image_view_data                   | refresh_image_view_data.py                   | data-eng-admin | True  
science_museum_workflow                   | science_museum_workflow.py                   | data-eng-admin | True  
smithsonian_sub_provider_update_workflow  | smithsonian_sub_provider_update_workflow.py  | data-eng-admin | True  
smithsonian_workflow                      | smithsonian_workflow.py                      | data-eng-admin | True  
staten_museum_workflow                    | statens_museum_workflow.py                   | data-eng-admin | True  
stocksnap_workflow                        | stocksnap_workflow.py                        | data-eng-admin | True  
tsv_to_postgres_loader                    | loader_workflow.py                           | data-eng-admin | True  
tsv_to_postgres_loader_overwrite          | loader_workflow.py                           | data-eng-admin | True  
walters_workflow                          | walters_workflow.py                          | data-eng-admin | True  
wikimedia_commons_workflow                | wikimedia_workflow.py                        | data-eng-admin | True  
wikimedia_ingestion_workflow              | wikimedia_ingestion_workflow.py              | data-eng-admin | True  

This is happening at every DAG parse interval, which by default is 30 seconds.

Suggested Improvement

Initialize these objects in the main functions of each provider, and pass them into other functions as necessary.

Benefit

Decreased DAG processing time, CPU & memory usage.

Additional context

Implementation

  • 🙋 I would be interested in implementing this feature.

I'm thinking about how to tackle this one today because it affects WordPress/openverse#1703. After taking a look at some of the code, I think we have 3 approaches:

1) Use a deferred init function

With this approach, we would still initialize the DelayedRequester and <Media>Store at the top of the file, but none of the actual initialization would occur at that point. We would then call a .init() function on each within the main function in the provider_api_scripts module. That would allow us to continue to reference the requester & store as module-level variables within the API script while still deferring any heavy lifting until runtime. Tests would be affected by this, so we may need to add a fixture to the tests which automatically inits the requester & store.

2) Create the objects in main

Here we would move the instantiation into the main function and out of the module scope. This would require significant changes to the function signatures (as they now have to pass along structures which were previously accessible at the module level) and the tests along with them. This approach seems the most "conventional".

3) Convert the API scripts into objects rather than collections of functions

Almost every one of our API scripts initializes a DelayedRequester and a <Media>Store (if not several). There's also a fair amount of similar machinery among the API scripts that could be collapsed/refactored as part of this effort. All of the shared/common pieces would go into a ProviderAPIRetreiver (or something) class definition that would be sub-classed and overridden as necessary per-provider. This would be the largest undertaking of the three. It also has the risk of obfuscating some of the code - while code reduction is generally considered good, accessibility to community contributions is still a high priority for this piece of the project. By moving some pieces into a common class definition, we might make it harder for contributors to understand what's going on and thus how to contribute a new provider.


The reason I bring this up is that we're likely going to be adding a new Requester class as part of WordPress/openverse#1703 that has even more overhead on initialization, which we'd want to avoid doing at DAG parse time.

Of the three, which do you think is best? Have I missed any considerations as part of this?

CC @zackkrida @obulat @krysal

I think I'm inclined to actually go the object route - for something like WordPress/openverse#1697, it would significantly reduce the number of changes we needed to make if all providers had shared functionality inherited from a base class.

Sorry I meant to reply earlier, but lost this issue.
We have discussed converting the provider API scripts into objects: #163 (comment)
I think it would be a great solution for this problem, and, provided that we give clear instructions and a template, would not make it significantly harder for new contributors to understand.