teamclairvoyant/airflow-maintenance-dags

Airflow runs out of memory on large MySQL databases

Closed this issue · 2 comments

When running cleanup scripts for the first time on large DBs, the script will consume tens or hundreds of gigs of memory. This is because of this:

entries_to_delete = query.all()

Python's MySQL connection buffers everything in-memory by default. On large DBs, this can be hundreds of gigs. SSCursor should be used for MySQL:

https://pymysql.readthedocs.io/en/latest/modules/cursors.html#pymysql.cursors.SSCursor

However, setting the SSCursor at the Airflow connection level is pretty invasive, as it affects all of Airflow. It would be nice to be able to set this configuration for just the script, itself.

Alternatively, executing the query/delete in smaller batches in a loop would be preferable.

Should probably also add a print_deletes param, so we can turn off the query.all() statement if we need to.