dathere/datapusher-plus

Temporary files are not always removed

Closed this issue · 1 comments

bzar commented

Describe the bug
datapusher-plus leaves temporary files in /tmp, which accumulate until available space is depleted

To Reproduce
Steps to reproduce the behavior:

  1. Process various files with datapusher-plus
  2. Observe /tmp

Expected behavior
After analysis is complete /tmp contains no temporary data files

Additional context
I suspect this happens because the NamedTemporaryFiles are used mostly as temp file name generators instead of file objects, so the automatic deletion does not work.

I propose using a TemporaryDirectory instead within a with statement which ensures it is deleted afterwards. The most ergonomic way of implementing this is splitting push_to_datastore into two functions: one to work as a @job.asynchronous and context manager, the other to actually process the data like so:

@job.asynchronous
def push_to_datastore(task_id, input, dry_run=False):
    """Download and parse a resource push its data into CKAN's DataStore.

    An asynchronous job that gets a resource from CKAN, downloads the
    resource's data file and, if the data file has changed since last time,
    parses the data and posts it into CKAN's DataStore.

    :param dry_run: Fetch and parse the data file but don't actually post the
        data to the DataStore, instead return the data headers and rows that
        would have been posted.
    :type dry_run: boolean

    """

    # Ensure temporary files are removed after run
    with tempfile.TemporaryDirectory() as temp_dir:
        return _push_to_datastore(task_id, input, dry_run=dry_run, temp_dir=temp_dir)

    
def _push_to_datastore(task_id, input, dry_run=False, temp_dir=None):
    handler = util.StoringHandler(task_id, input)
    ...

Then all the temporary files can be created within that directory with sensible names like qsv_dedup.csv and omit all the os.unlink stuff. Whatever happens in the processing the with statement will remove the temporary files afterwards.

Thanks @bzar for the suggestion!

Your proposed approach is certainly much better and simplifies the code.