New Functionality: Input Keys from a File to Download
rajivchodisetti opened this issue · 4 comments
HI,
Can we have this additional functionality where list of keys to be downloaded can be provided via a file through an additional argument and for simplicity this input file can be an other Blob as well.
Use case, we have millions of small files(images) which are required for training on a need basis, so it would be very handy for this use case.
Right-now am relying on Python multiprocessing for the same but I think Go would be much faster based on the experience using your module.
Thanks
Or Just guide me how to do it, I will try to hack it
Currently you can download multiple files based on the prefix. If this does not work for your scenario, more details on why not would be helpful.
Nevertheless, there's already an enhancement request to make inputs of the -n and -f options available via a file. The current thinking on this (feedback is welcome) is to support something like this: -f @myfile
. To implement this functionally, a new parsing-validation rule would need to be implemented here. The validation rule would detect the case when the file is provided, read/validate the content and derive the pipeline parameters from it.
Sry for the delay in Reply, Regarding your first question on why does prefix based download doesn't work, it does work for my use case because, am maintaining an Index (Database) of Blob keys where depending on the search criteria on top of the database a bunch of output keys will be emitted and for those keys data has to be downloaded.
For example, we crawl millions of images and for each image there will be multiple other images associated, like one where the entire background is removed and only the Apparel is visible, one where the thumbnail is generated for the original image and these assets are stored in the Blob storage and I maintain an Index of keys in the database and a search query on My database might look like, give me all those Blob keys for which thumbnails are generated in the last 2 days and the output of query is nothing but bunch of blob keys for which the assets has to be downloaded
Any chance of this getting picked up ?