Merge competing Dataset approaches
Opened this issue · 11 comments
Typhon currently has two competing Dataset approaches. One lives in the typhon.datasets package and its subpackages, another in typhon.spareice.datasets. The two have overlapping aims but diverge in their implementation and approach. This duplicates maintenance effort and in the long term, it is desirable to merge them. This requires a major effort.
To achieve this goal, it would be needed to make a full inventory of the features within each approach, where they overlap and where they differ. I suspect the typhon.datasets package has many features not currently available in typhon.spareice.datasets. It would affect readers implemented in either way.
Limitations of the typhon.datasets package:
- Limited documentation.
- Lacking tests.
- Overuse of multiple inheritance.
- Some methods defined in wrong place in class hierarchy.
- Hard to use. This is related to the limited documentation and the overuse of multiple inheritance, but that is not the only problem.
I don't expect to be able to do much on this any time soon, nor do I expect others to do so, but I'm keeping this issue here so we keep it in mind long term.
We should probably have a wiki page on this or something.
Personally, I would like to see something fully built around xarray.DataArray and xarray.Dataset (it is unfortunate that our class is also called Dataset). I wasn't aware of xarray when I first started the typhon.datasets package. The typhon.spareice.datasets approach uses the ArrayGroup class which appears very similar to an xarray.Dataset, whereas the Array appears very similar to xarray.DataArray, yet they aren't built around it.
If I understand generate_filename
correctly, the typhon.spareice.datasets approach assumes that the filename can be calculated using only the placeholders in the template. This is not the case for most real datasets. For example, many filenames contain orbit numbers or the string of a downlink station. That means it is necessary to include a regular expression. I'm not sure this is possible with the typhon.spareice.datasets approach but if it isn't, that would be a major limitation.
Thanks @gerritholl for opening this issue.
xarray support
I started to build spareice.datasets on xarray but I ran into some trouble hence I wrote my own Array and ArrayGroup implementations. Maybe you have some suggestions for me how to solve them. Here are my thoughts about xarray:
- xarray.Dataset does not have a native group support as we know it from netCDF4 and there is no development on that topic at the moment (afaik from this thread: pydata/xarray#1092). This is a huge drawback for me since I use the group support extensively especially in the collocations tool. It makes it easier to combine data coming from different datasets. This was the main reason for implementing ArrayGroup which allows grouping by using unix-like paths.
- I really like xarray for working with labeled data but when doing elementwise computations it is starting to become a mess due to the label alignment of the underlying pandas (https://pandas.pydata.org/pandas-docs/stable/dsintro.html#vectorized-operations-and-label-alignment-with-series). Hence I preferred to use plain but simple numpy.arrays that are very straightforward to handle.
- Extending xarray should be done via accessors: the developer discourage to subclass xarray directly and suggest writing accessors (http://xarray.pydata.org/en/stable/internals.html#extending-xarray) which follow the composition-over-inheritance principle (well, in general that is a good thing :-) ). But it hold me off to dig deeper into extending xarrays.
Nevertheless xarray has a great dask support which makes it preferable for big data applications and it seems to have a bright future. So I totally agree with you that future specific datasets implementations should support xarray objects. I therefore try to make ArrayGroup compatible to xarray objects. But in general, I think the actual Dataset base class should be independent from its file content - therefore it should not care about xarrays, ArrayGroups or whatever. This makes it more powerful and also usable for datasets of other data types (e.g. text based or images).
RegEx support
You are right. So far generate_filename
only uses temporal placeholders. I thought about implementing user-defined placeholders but I have not had the time to do it. What do you need them for? Do you want to keep the information from the original filenames and create new filenames with it? A kind of filename conversion? Could you give me a more detailed example?
I like your idea of stating the limitations and features of the packages.
typhon.spareice.datasets
Limitations
- No RegEx support for parsing additional information from filenames
- No RegEx support for generating new files
- So far only file handlers for the HDF versions of MHS and CloudSat files are available
- No testing for more complex datasets
Features
- Focus on usability and maintainability
- Should work with any kind of temporal datasets
- Extendable by using customized file handlers
- Mapping functions with multiprocessing support
find_files
method can return files in bundles (either by size or temporal bins)
I moved this discussion into a new project (https://github.com/atmtools/typhon/projects/1?). Are you okay with that?
I have no experience with "projects" in github but I suppose it is a good idea.
As a regex example, an example of a HIRS filename is 'NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz'. I describe that with the regex r"(L?\d*\.)?NSS.HIR[XS].(?P<satcode>.{2})\.D(?P<year>\d{2})(?P<doy>\d{3})\.S(?P<hour>\d{2})(?P<minute>\d{2})\.E(?P<hour_end>\d{2})(?P<minute_end>\d{2})\.B(?P<B>\d{7})\.(?P<station>.{2})\.gz"
. Out of those, the parts B
and station
are present in the filename but not predictable from the starting time. In the case of FCDR_HIRS
, I am either reading or writing data and I have both the re
approach, and a template based approach:
stored_name = ("FIDUCEO_FCDR_L1C_HIRS{version:d}_{satname:s}_"
"{year:04d}{month:02d}{day:02d}{hour:02d}{minute:02d}{second:02d}_"
"{year_end:04d}{month_end:02d}{day_end:02d}{hour_end:02d}{minute_end:02d}{second_end:02d}_"
"{fcdr_type:s}_v{data_version:s}_fv{format_version:s}.nc")
write_subdir = "{fcdr_type:s}/{satname:s}/{year:04d}/{month:02d}/{day:02d}"
stored_re = (r"FIDUCEO_FCDR_L1C_HIRS(?P<version>[2-4])_"
r"(?P<satname>.{6})_"
r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"
r"(?P<hour>\d{2})(?P<minute>\d{2})(?P<second>\d{2})_"
r"(?P<year_end>\d{4})(?P<month_end>\d{2})(?P<day_end>\d{2})"
r"(?P<hour_end>\d{2})(?P<minute_end>\d{2})(?P<second_end>\d{2})_"
r"(?P<fcdr_type>[a-zA-Z]*)_"
r"v(?P<data_version>.+)_"
r"fv(?P<format_version>.+)\.nc")
My file-finder uses the regular expression, but the writing part uses the template. There is a duplication here, ideally one should only need one.
Dear @gerritholl ,
I had a deeper look into your modules (especially tovs.py and models.py) and I think it might be surprisingly easy to use them with my Dataset class. Theoretically, it should already work if the reader classes (HIRS, HIRSPOD, IASIEPS, IASISub, HIASI, HIRSHIRS, and ERAInterim) inherit from
typhon.spareice.FileHandler and rename their _read() method to read(). The same would apply for NetCDFDataset. The spareice.Dataset class should provide the same features as your Dataset class now (as far as I can see), except from the combine() method. Since combine() is data-dependent
(it changes depending on whether you use xarray, ndarrays or any other object), I would not like to implement it directly into the Dataset class. Maybe an extra composition class named DataHandler (analog to FileHandler) could be an idea. It could also provide a concatenate and merge function for
the data.
I suppose so! I would love to try it out but right now I really don't have the time.