Converting tar archives into a reference filesystem
Zarr files can challenge metadata-server of HPC systems due to their millions of files. One way to circumvent this challenge is to collect all files in a file container, e.g. in tar files and create a look-up table of byte ranges where the content of each file is saved within the container. Tar-ing zarr files makes it also easy to store and reuse data on tape-archives.
tar_referencer creates these look-up tables that can be used with the preffs package.
Usage
The package can be installed with
pip install git+https://github.com/observingClouds/tar_referencer.git
The look-up files (parquet reference files) are created with
tar_referencer -t file.*.tar -p file_index.preffs
If zarr files have been packed into tars and indexed with tar_referencer the tars can be opened with:
import xarray as xr
storage_options={"preffs":{"prefix":/path/to/tar/files/"}}
ds = xr.open_zarr("preffs::file_index.preffs", storage_options=storage_options)
Creating tar files
Technically all sorts of tar files can be referenced. However, tar_referencer currently does only supports tar files that are split at the file level. Tar files that are split within the header or data block are not supported.
Warning This does not work:
tar -cvf - big.tar | split --bytes=32000m --suffix-length=3 --numeric-suffix - part%03d.tar
To generate compatible tar files from zarr files or other directory structures, tar_referencer provides tar_creator
:
tar_creator -i dataset.zarr -t dataset_part{:03d}.tar -s MAX_SIZE_BYTES
where MAX_SIZE_BYTES
is the maximum size of a tar file, before writing further output to an additional archive.
To split already existing tar files, Splitar has been successfully tested.
splitar -S 32000m big.tar part.tar-
Tips and tricks
For very big zarr-datasets, especially those that contain several variables, it might be advisable to pack each variable-subfolder of the zarr file into their own set of tars. The benefit of this approach is that only those tars need to be downloaded/retrieved that are actually containing the variable of interest. For each of these sets a separate look-up table can be generated and merged to an overaching look-up table containing the entire dataset
import pandas as pd
df_coords = pd.read_parquet("file_index.coords.preffs")
df_var1 = pd.read_parquet("file_index.var1.preffs")
df_var2 = pd.read_parquet("file_index.var2.preffs")
df_entire_dataset = pd.concat([df_coords, df_var1, df_var2]).sort_index()
df_entire_dataset.to_parquet("entire_dataset.preffs")