Azure/azure-storage-fuse

Can azure blobfuse be used to sync blobstore to local disk persistent volume

Closed this issue · 2 comments

First of all, thank you for supporting this great project!

My application deployed on AKS requires fast access to few large files (~200 files, each file between 500Mb and 10Gb). The files change infrequently (perhaps once a month) and are stored on an Azure datalake gen2 (by another service).

I have tried to give my app access to the files on the datalake by mounting it via a persistent volume using the azureblob-fuse-premium storage class, here is my minimal example. However I made two observations that made this approach not quite suitable:

  • the files only become available on the persistent volume once they are requested from the blobstore. A file that is shown on the mounted directory of my application is not there phyically, it first has to be downloaded
  • the PV is more a transient storage, the files from blobstore get re-requested after its cached copy on the PV has expired
  • The download operation is blocking the application

So it seems this blobfuse use case is more geared towards many small files that change frequently (and hence need frequent cache invalidation)

What I am looking for is a kind of background process that periodically syncs files between PV (which is read only) and blobstore. The PV should be "close" to the application with fast read access (i.e. local to the kubernetes node). Is this a usecase that blobfuse somehow supports or do I need to write this functionality myself?

Thank you very much

Thanks for your feedback. Couple of points based on your observations:

  • Blobfuse2 acts as a file-system driver and responds only to kernel calls as of now.
  • Files are downloaded only when application tries to open them and are evicted from local cache when timeout is hit.
  • In the background files are not refreshed unless application tries to open them again and blobfuse finds out that timeout has expired so lets redownload.
  • Listing is done through container while read/write are from local cache. Hence your observation is correct that list shows files but they do not physically exists on your local system anywhere unless application tries to open that file (or reopen after a timeout).
  • Application has to wait till the file download is complete is true as the open call goes for download and 'open' system call will be blocked for that time period.

Way forward:

  • If you are using file-cache then you can switch to block-cache where files are not cached locally on disk rather kept in memory only. One advantage that this model has is on open we do not download entire file rather based on application read it will be prefetched in blocks (pieces of file) and kept in local memory. This shall improve performance for you by reducing the time application has to wait till we download the file.
  • For applications which need access to all of the data in container, we are already planning to build a solution where on mount blobfuse will start prefetching the data instead of waiting application to request for a given file. This will help in applications which use the data for training/modelling purpose.

Do let me know if this answers your concerns and provide feedback if you something else in mind.

Closing this as there is no update. Feel free to reopen if you have any further queries on this.