icesat2py/icepyx

Allow generating variables wanted from a list of h5 paths

Opened this issue · 4 comments

I'm trying to subset ATL03 granules for a list of variables that I have as a list of H5 paths

['gt1l/heights/delta_time',
 'gt1l/heights/dist_ph_across',
 'gt1l/heights/dist_ph_along',
 'gt1l/heights/h_ph',
 'gt1l/heights/lat_ph',
 'gt1l/heights/lon_ph',
 'gt1l/heights/pce_mframe_cnt',
 'gt1l/heights/ph_id_channel',
 'gt1l/heights/ph_id_pulse',
 'gt1l/heights/quality_ph',
 'gt1l/heights/signal_conf_ph',
 'gt1l/geolocation/delta_time',
 'gt1l/geolocation/ph_index_beg',
 'gt1l/geolocation/podppd_flag',
 'gt1l/geolocation/segment_dist_x',
 'gt1l/geolocation/segment_id',
 'gt1l/geolocation/segment_length',
 'gt1l/geolocation/segment_ph_cnt',
 'gt1l/geolocation/sigma_across',
 'gt1l/geolocation/sigma_along',
 'gt1l/geolocation/sigma_h',
 'gt1l/geolocation/surf_type',
 'gt1l/geolocation/velocity_sc']

However, I can not find an option in Query.order_vars.append to do this. I would expect to be able to do something like;

region_a.order_vars.append(var_list=['gt1l/heights/delta_time', 'gt1l/heights/dist_ph_across', 'gt1l/heights/dist_ph_along',...])

Am I missing something?

I've tried

region_a.order_vars.append(region_a.order_vars.parse_var_list(variable_list))

But this creates a long list of variables that I did not specify.

Hey @andypbarrett!

Am I missing something?

Short answer: No.

Now for the long answer. When we built it, the goal of the variables module was for it to construct a list like your example one of h5 paths, such that the user could have some control over selecting beams/groups/variables without actually needing to know/use h5py, have a granule to start with, or know anything about how data was grouped in hdf5 files. Our goal was/is to eliminate the hard coded lists we see in code examples (so I'm curious how you generated your list of h5 paths - did you manually pick them out? I know we have a HW example where it's based on number of layers in the path. More on this below.)

That said, I think you present a fairly common use case where the user might already have a list they want to use to subset. The hack approach to doing what you're asking (presuming you're trying to use it to order subsetted data) would be:

vardict, _ = region_a.order_vars.parse_var_list(variable_list)
region_a.order_vars.wanted = vardict

This will blow up at some point (likely during order processing) if you have a typo or submit an invalid h5 path, because it completely bypasses all of the built in checks for valid h5 paths.

A slightly longer approach that would go through the validation checks (if it doesn't get caught at .parse_var_list) is:

_, pathlists = region_a.order_vars.parse_var_list(variable_list, tiered_vars=True)
beam, keyword, varlist = pathlists[:]
region_a.order_vars.append(beam_list=list(set(beam)), keyword_list=list(set(keyword)), var_list=varlist)

It goes without saying neither of these is probably a great direction to send users. Which gets us back to my earlier question about how you generated the list in the first place and then to what the typical use cases are we should add some new functionality around. Two options I can think of right away would be (1) allow the user to submit their own list of h5 paths, essentially providing a wrapper for the aforementioned "hacks"; (2) find a way to give the user more options for controlling their list during the "build" process (.append) and editing it once it's generated so they don't need to submit a custom list. Either way I'm thinking this provides motivation for giving .wanted a getter+setter.

I know you've spent a lot of time creating h5 path lists for various products/examples, so I'm excited to hear your input and ideas for how to make the variables module more versatile and user-friendly.

I generated that list by hand based on common ATL03 variables used in producing ATL06 and ATL07. We are trying to produce a stripped-down h5 file for testing. But I think a list is a common enough use case.

I wanted to avoid having to play around with var_list, beam_list, and keyword_list to come up with the logic to generate that list.

The "hack" you present looks like what I want to do. So I propose adding a path_list or variable_path_list or dataset_list keyword arg along with logic to check that only dataset_list can be set. And then add

        if path_list:
            _, (beams, keywords, variables) = self.parse_var_list(path_list, tiered_vars=True)
            var_list = list(set(variables))
            beam_list = list(set(beams))
            keyword_list = list(set(beams))

These then get passed to check_valid_lists in append.

This works but I get some extra variables returned. There also does not seem to be a default list for ATL03 - which is understandable. I don't know how I would generate that.

I also suggest changing the assert statement at the beginning of append to if <conditions>:. My understanding is that assert should not be used in this way because assert can be disabled and it returns an AssertionError rather than a meaningful exception, which should be a ValueError in this case.

PR in progress!

But I think a list is a common enough use case.

Agreed, and you shouldn't have to mess around with trying to bend the logic to your will.

This works but I get some extra variables returned.

Hmmm... which ones? I'd be curious to look into this.

There also does not seem to be a default list for ATL03 - which is understandable. I don't know how I would generate that.

There are not default lists for any products that someone didn't contribute a list for (or directly share a list with me). If you have an inkling that there is some default list we could create (with the acknowledgement it will not be perfect), I'd be happy to add it.

PR in progress!

Woohoo - thanks! All the other items you mentioned sound great - assert was the first "internal check" I learned how to do before learning how to raise errors, so happy to have a best-practice update on that.

The extra variables were a result of a typo. It is fixed now and a PR created.

For default vars for ATL03 delta_time, latitude, longitude and photon height from the Heights group seem like a reasonable start.

I might need to touch-base on some linting issues. I've added you and Rachel as reviewers.