Illumina/paragraph

Handling of temporary files by multigrmpy.py

Closed this issue · 2 comments

I have noted a few problems with the way temporary files are handled by multigrmpy.py:

1- vcf.gz files are still written to /tmp or /scratch even when the option --scratch-dir is explicitly set to another directory (the .json files are written to that directory, but not the .vcf.gz ones)
2- The index files of the .vcf.gz files (.vcf.gz.csi files) are not cleaned from the temporary directories, even when multigrmpy.py exits successfully
3- The .json files are also not cleaned from the temporary directory after running multigrmpy.py

I assume this behavior is not the one expected from the program. In my case, I need to clean up the temporary directories after each run, but this prevents me from running several multigrmpy instances in parallel so as not to delete files that are used by another instance.

I saw that there has been an issue raised on this topic in the past and it has been closed, however the behavior of the program has not changed since.

Thanks for finding this out!
For item 1, that's a problem. I'll leave it open until we fix it.
For 2 & 3, I believe the temporary directory, if not specified by --scratch-dir, will be completely removed upon completion. Do you mean they're not cleaned when you have --scratch-dir specified?

Hi Sai,

Thanks for your reply. Indeed, the temporary directory is not removed after running when explicitly setting a --scratch-dir, and neither are its contents.

Regarding issue 2, I have fixed for my purposes by adding the line 'os.remove(indexed_vcf.name + ".csi")' after line 106 in the init.py file of vcf2paragraph. However, I am not an experienced Python coder so I am not sure whether this is the proper way to do so.

I have also added some functionality for deleting the temporary json files in the --scratch-dir after running but I assume some users might not want this functionality for debugging purposes or closer inspection of the output. As to me, I do not need them because I only really care about the genotypes. I prefer to specify scratch-dir explicitly because it makes it easier to manage than using the common /scratch directory which is shared with other users on the system.