philiprt/GeslaDataset

'filenames' order

Opened this issue · 1 comments

Hi,
thanks for providing this code to read GESLA data!
I just wanted add a minor comment on line 89 in gesla.py.
idx = [s.Index for s in self.meta.itertuples() if s.filename in filenames]

It seems like if 'filenames' is not exactly sorted like the names in self.meta, the meta information will not have the same order as the concatenated xr.dataset. I assume the entries in meta are alphabetically sorted, so that could be done when reading in the list of 'filenames'.

Thanks!

Hey I encountered the same issue. For me it was not only a minor issue. Actually the filename reference and the corresponding data is mixed up, if you have the initial filenames in an arbitrary order.
For me this lead to a wrong analysis in the percentiles of corresponding stations.

I adjusted the files_to_xarray method in the class as follows to solve the issue. Now I can have arbirtary sorting in filenames when loading the data.

def files_to_xarray(self, filenames):
        """Read a list of GESLA filenames into a xarray.Dataset object. The
        dataset includes variables containing metadata for each record.

        Args:
            filenames (list): list of filename strings.

        Returns:
            xarray.Dataset: data, flags, and metadata for each record.
        """
        def sort_filenames(filenames):
            """Auxillary function that sorts the filenames. 
            This ensures that data can be loaded independent of the sorting
            given by the user input.

            Args:
                filenames (list): list of filename strings.

            Returns:
                list: list of sorted filenames.
            Author:
                Kai Bellinghausen
            """
            # Get the indices of filenames in the metadata dataframe
            indices = [self.meta[self.meta['filename'] == filename].index[0] for filename in filenames]

            # Sort filenames based on the indices
            sorted_filenames = [filename for _, filename in sorted(zip(indices, filenames))]
            
            return sorted_filenames
        
        filenames = sort_filenames(filenames)

        data = xr.concat(
            [
                self.file_to_pandas(f, return_meta=False).to_xarray()
                for f in filenames
            ],
            dim="station",
        )

        idx = [
            s.Index for s in self.meta.itertuples() if s.filename in filenames
        ]
        meta = self.meta.loc[idx]
        meta.index = range(meta.index.size)
        meta.index.name = "station"
        data = data.assign({c: meta[c] for c in meta.columns})

        return data