option for not having unique identifier
Closed this issue · 8 comments
Example on support site
https://support.bioconductor.org/p/9144615/
It may be desirable in certain situations where add a unique identifier is not desirable. The original intent was to allow multiple versions of the same file to be cached. Create a user option to disable.
@mtmorgan thoughts on if the default should be to or to not add unique identifier? If consistent with current behavior, the default would be to add a unique identifier.
I think the current behavior is the desired default.
I have a need to cache files where the exact file name matters. The folder does not. I would benefit from an option where the unique ID is in the folder name, while file name itself does not get mangled in any way. This is for software that uses file names for outputs (e.g. columns in an excel spreadsheet), so mangling a file name is not desirable.
I absolutely agree this should NOT be a default, but I hope implementation would be reasonably easy - the place that determines file name as CHECKSUM_file.name
would simply create CHECKSUM/file.name
. I will look into possibility of doing a pull request.
yes I will work on this enhancement and try to have it in the branch shortly.
I've started a branch to work on this and expect it to be implemented shortly still working on adequate testing
https://github.com/Bioconductor/BiocFileCache/pulls once a few member of the team test and review it will be merged. Cheers
I went through the new code to see how it works, and I have some feedback:
- this is great, and it fixes the problem we had with "mangled" file names
- it is not great if (as mentioned in documentation), a user adds two different files of the same name. This happens very rarely as one can imagine, but when it does happen, it causes a major problem, as the cache cannot store such a file twice!
Can this be resolved by utilizing the unique ID, but instead of:
3549133bddd_file.txt
Produce a folder with unique ID, containing one file like so:
3549133bddd/file.txt
If you went the subfolder route, it would also later naturally support situations where people add 100s of thousands of files into the cache, and you would want to store them in separate subfolders as not to kill the filesystem performance...
If this solution is unacceptable (it doubles the number of filesystem entries), maybe one could implement an algorithm where the cache has numbered subfolders. When file is added, and a file exists already with the identical name, create a new subfolder and put file in there. This way nobody will create subfolders in cache, unless name collisions were to happen.
We can consider option two for a greater enhancement but I am not in favor of having an individual subfolder for each file added to the cache as it is wasted space and efficiency. For the 100s of thousances of files if there are groups of files that should be cached together a user can always define their own unique cache location for projects.
If a single user wants to cache a large amount of tiny files, filesystem performance degrades when too many files are stored inside a single folder. A cache is a natural bottleneck, because all your files end up in it.
A standard approach is to split huge folders into smaller ones... but I can see how this would mess with people who want to cache .bam
next to .bai
files, which inspired the "switch off name mangling" feature.
Maybe I am trying to use your package in ways that were not intended - I want to make a big "data cruncher" on a server that is shared by multiple people, that caches files locally for fast access. That leads to storing a large amount of cached files, multi-user access, and possible name collisions. This package might not be intended for such intense use case.