Airflow runs out of memory on large MySQL databases
Closed this issue · 2 comments
When running cleanup scripts for the first time on large DBs, the script will consume tens or hundreds of gigs of memory. This is because of this:
entries_to_delete = query.all()
Python's MySQL connection buffers everything in-memory by default. On large DBs, this can be hundreds of gigs. SSCursor should be used for MySQL:
https://pymysql.readthedocs.io/en/latest/modules/cursors.html#pymysql.cursors.SSCursor
However, setting the SSCursor at the Airflow connection level is pretty invasive, as it affects all of Airflow. It would be nice to be able to set this configuration for just the script, itself.
Alternatively, executing the query/delete in smaller batches in a loop would be preferable.
Should probably also add a print_deletes
param, so we can turn off the query.all() statement if we need to.