Temporary files are not always removed
Closed this issue · 1 comments
Describe the bug
datapusher-plus leaves temporary files in /tmp, which accumulate until available space is depleted
To Reproduce
Steps to reproduce the behavior:
- Process various files with datapusher-plus
- Observe /tmp
Expected behavior
After analysis is complete /tmp contains no temporary data files
Additional context
I suspect this happens because the NamedTemporaryFile
s are used mostly as temp file name generators instead of file objects, so the automatic deletion does not work.
I propose using a TemporaryDirectory
instead within a with statement which ensures it is deleted afterwards. The most ergonomic way of implementing this is splitting push_to_datastore
into two functions: one to work as a @job.asynchronous
and context manager, the other to actually process the data like so:
@job.asynchronous
def push_to_datastore(task_id, input, dry_run=False):
"""Download and parse a resource push its data into CKAN's DataStore.
An asynchronous job that gets a resource from CKAN, downloads the
resource's data file and, if the data file has changed since last time,
parses the data and posts it into CKAN's DataStore.
:param dry_run: Fetch and parse the data file but don't actually post the
data to the DataStore, instead return the data headers and rows that
would have been posted.
:type dry_run: boolean
"""
# Ensure temporary files are removed after run
with tempfile.TemporaryDirectory() as temp_dir:
return _push_to_datastore(task_id, input, dry_run=dry_run, temp_dir=temp_dir)
def _push_to_datastore(task_id, input, dry_run=False, temp_dir=None):
handler = util.StoringHandler(task_id, input)
...
Then all the temporary files can be created within that directory with sensible names like qsv_dedup.csv
and omit all the os.unlink
stuff. Whatever happens in the processing the with statement will remove the temporary files afterwards.
Thanks @bzar for the suggestion!
Your proposed approach is certainly much better and simplifies the code.