ldeo-glaciology/xapres

[Minor] New winter data filenames cut off

Closed this issue · 9 comments

It seems like the uploaded zarrs with the Greenland 2022-23 winter data don't have the filenames fully saved. For reference, with the old data, the command: print(ds_101.filename.isel(time=100).values) returns ldeo-glaciology/GL_apres_2022/A101/CardA/DIR2022-05-26-1536/DATA2022-05-26-1536.DAT as the filename, which makes sense. If I do this same command with the newly saved winter data, I get ldeo-glaciology/GL_apres_2022/A101/winter22_23/DIR2022-09-25-1533/DATA2022-.

It's weird that it suddenly gets cut off. I think this is an error somewhere in the ApRESDefs. Will need to investigate when there is spare time, as the filename isn't too important for analysis in my opinion. In the meantime I will still use this uploaded data for generating figures + developing the velocity calculations as these tasks don't require the filename.

A good first step would be to check if the filenames stored in full in the xarray (or xarrays) you create before writing to zarr

The filenames are stored fine in the .DAT files, as well as in the individual zarrs. If I run xr.open_mfdataset(f'gs://ldeo-glaciology/apres/greenland/2022/A104/individual_zarrs_prechunked_winter22_23/dat_*', chunks = {}, engine = 'zarr', consolidated = False, parallel = True), the resulting ds.filenames are complete.

Running the steps in the dat_to_zarr notebook up to the line that calls to_zarr, results in the dataset having proper filenames still.

So something in the to_zarr conversion is cutting off the filenames. So the good news is that our code is fine (I think). Bad news is that this may be an issue with xarrray's to_zarr or with the way we call it. Why this is the case, I am still trying to figure out.

I figured out the problem, and it is kind of on our end. The issue is the survey data (the attended bursts) is listed first when converting to zarr. Since it's in a different folder and has a different filename, xarray.to_zarr automatically applies that string length to the remaining files when writing the filename. We can see the results below. In the summer data, the files are 83 characters long:
image
In our winter data, we have instead a character length of 75, which is the first entry (an attended file) length in the array, but not long enough for the unattended data file names:
image

So to fix, what I have to do is delete the "single" zarrs from the bucket and redo the last step of converting all the individual zarrs into single zarrs, but this time exclude the data stored in the "Survey" folder.

When converting the summer data to zarr, did you exclude the survey data as well?

Yeah it looks like the survey data just wasn't included in the individual or combined zarrs. I kept consistent with that and removed the individual survey zarrs, which resolved the filename issue. The updated zarrs have been uploaded.

Good work.

One feature that would be useful to add to help with this is an option to choose the search string to use in the line of code that selects the dats file to download. An ongoing PR I'm working on adds this so that we can easily choose dat files from measurements with different antenna orientations, but that feature could also be used for this I think.

Yeah, I think that would be useful. Though given the pipeline of .DAT → individual zarrs → 1 combined zarr, I think it's important to maybe specify what the recommended format of data storage should be with this package. If we plan on maintaining these 3 sets of data, then I think we might need multiple loading functions. The one that loads .DAT files (xa.load_all) sounds like the one you're modifying in your PR. I think if we want the 1 combined zarr to be created with a specific search string in mind, we also need to build a wrapper function around xr.open_mfdataset function that is used when converting individual zarrs to 1 combined zarr. Or, should we assume that in the future, all filtering happens at the initial .DAT→individual zarr conversion, so we don't need to worry about filtering out files between the individual zarr→ 1 combined zarr step?