FeatureTypes and Array Representations: Strategy for Tackling #845

This issue is meant as a discussion/inquiry.

Investigating recent issue #845 has led to looking intently at the featureType discovery code in the CC. Currently, the checker tries to classify each variable as one of several feature types:

compliance-checker/compliance_checker/cfutil.py

Lines 1715 to 1765 in e49265d

    
           def guess_feature_type(nc, variable): 
        
               """ 
        
               Returns a string describing the feature type for this variable 
        
               :param netCDF4.Dataset nc: An open netCDF dataset 
        
               :param str variable: name of the variable to check 
        
               """ 
        
               if is_point(nc, variable): 
        
                   return "point" 
        
               if is_timeseries(nc, variable): 
        
                   return "timeseries" 
        
               if is_multi_timeseries_orthogonal(nc, variable): 
        
                   return "multi-timeseries-orthogonal" 
        
               if is_multi_timeseries_incomplete(nc, variable): 
        
                   return "multi-timeseries-incomplete" 
        
               if is_cf_trajectory(nc, variable): 
        
                   return "cf-trajectory" 
        
               if is_single_trajectory(nc, variable): 
        
                   return "single-trajectory" 
        
               if is_profile_orthogonal(nc, variable): 
        
                   return "profile-orthogonal" 
        
               if is_profile_incomplete(nc, variable): 
        
                   return "profile-incomplete" 
        
               if is_timeseries_profile_single_station(nc, variable): 
        
                   return "timeseries-profile-single-station" 
        
               if is_timeseries_profile_multi_station(nc, variable): 
        
                   return "timeseries-profile-multi-station" 
        
               if is_timeseries_profile_single_ortho_time(nc, variable): 
        
                   return "timeseries-profile-single-ortho-time" 
        
               if is_timeseries_profile_multi_ortho_time(nc, variable): 
        
                   return "timeseries-profile-multi-ortho-time" 
        
               if is_timeseries_profile_ortho_depth(nc, variable): 
        
                   return "timeseries-profile-ortho-depth" 
        
               if is_timeseries_profile_incomplete(nc, variable): 
        
                   return "timeseries-profile-incomplete" 
        
               if is_trajectory_profile_orthogonal(nc, variable): 
        
                   return "trajectory-profile-orthogonal" 
        
               if is_trajectory_profile_incomplete(nc, variable): 
        
                   return "trajectory-profile-incomplete" 
        
               if is_2d_regular_grid(nc, variable): 
        
                   return "2d-regular-grid" 
        
               if is_2d_static_grid(nc, variable): 
        
                   return "2d-static-grid" 
        
               if is_3d_regular_grid(nc, variable): 
        
                   return "3d-regular-grid" 
        
               if is_3d_static_grid(nc, variable): 
        
                   return "3d-static-grid" 
        
               if is_mapped_grid(nc, variable): 
        
                   return "mapped-grid" 
        
               if is_reduced_grid(nc, variable): 
        
                   return "reduced-grid"

However, since CF-1.6, there are only six feature types:

point
timeSeries
profile
timeSeriesProfile
trajectory
trajectoryProfile

It seems that the feature types have become entangled with the grid mappings and specifications of section 5.

My question: is it possible to pare down the featureType checks to only the six specified? While doing so, would that facilitate an easier way to deal with the actual array representation checks?

Further re-reading Appendix H indicates examples of featureTypes being represented:

point: degenerate case of all four array representations
timeSeries: all four representations
profile: all four representations
trajectory: multidimensional -- if the number of trajectories is the same for each station this would then be "orthogonal multidimensional", otherwise "incomplete"; both ragged array representations also valid
timeSeriesProfile: orthogonal multidimensional only if same number of times for each feature and same number of elements per profile feature, otherwise incomplete multidimensional; contiguous ragged array for profiles and indexed ragged array for organizing profiles into time series
trajectoryProfile: orthogonal multidimensional if the same number of trajectories per station and same number of depths per profile, otherwise incomplete multidimensional; contiguous ragged array for profiles and indexed ragged array for organizing profiles along trajectories (the profile data is written all at once, and multiple trajectories are being streamed in one after the other)

Since all six featureType classes can be expressed as all four array representations (I think the featureType necessitates the type of representation, right? Not the other way around?) I believe it's possible to thoroughly disambiguate and disentangle the grid mappings, array representations, and featureType discovery and create new, independent routines for each.

Thoughts? @benjwadams

cc @mwengren

I believe it's possible to thoroughly disambiguate and disentangle the grid mappings, array representations, and featureType discovery and create new, independent routines for each.

👍 Thanks @daltonkell This seems like a good idea to me!

Just a bit of confusion I noticed while trying to get my head around these feature types:

trajectory: multidimensional -- if the number of trajectories is the same for each station this would then be "orthogonal multidimensional", otherwise "incomplete"

The concept of stations (inherently fixed points in space) isn't really relevant for trajectories. I think the only way you could use an "orthogonal" representation here is if a collection of trajectories were all sampled at the exact same timestamps (so the obs dimension can be called time instead - similar to the orthogonal rep for timeseries). I guess this would be rare, but technically possible.

The concept of stations (inherently fixed points in space) isn't really relevant for trajectories. I think the only way you could use an "orthogonal" representation here is if a collection of trajectories were all sampled at the exact same timestamps (so the obs dimension can be called time instead - similar to the orthogonal rep for timeseries). I guess this would be rare, but technically possible.

You're right @mhidas, I think I duplicated timeseries when writing this. From the spec: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#_multidimensional_array_representation_of_trajectories

When storing multiple trajectories in the same file, and the number of elements in each trajectory is the same, one can use the multidimensional array representation. This representation also allows one to have a variable number of elements in different trajectories, at the cost of some wasted space. In that case, any unused elements of the data and auxiliary coordinate variables must contain missing data values (section 9.6).

CC @mylesmc123

If I had to make an educated guess, I would say that the remainder of the feature types probably came from the NOAA NCEI templates here: https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/ .

Closing after merge of #858.

	def guess_feature_type(nc, variable):
	"""
	Returns a string describing the feature type for this variable

	:param netCDF4.Dataset nc: An open netCDF dataset
	:param str variable: name of the variable to check
	"""
	if is_point(nc, variable):
	return "point"
	if is_timeseries(nc, variable):
	return "timeseries"
	if is_multi_timeseries_orthogonal(nc, variable):
	return "multi-timeseries-orthogonal"
	if is_multi_timeseries_incomplete(nc, variable):
	return "multi-timeseries-incomplete"
	if is_cf_trajectory(nc, variable):
	return "cf-trajectory"
	if is_single_trajectory(nc, variable):
	return "single-trajectory"
	if is_profile_orthogonal(nc, variable):
	return "profile-orthogonal"
	if is_profile_incomplete(nc, variable):
	return "profile-incomplete"
	if is_timeseries_profile_single_station(nc, variable):
	return "timeseries-profile-single-station"
	if is_timeseries_profile_multi_station(nc, variable):
	return "timeseries-profile-multi-station"
	if is_timeseries_profile_single_ortho_time(nc, variable):
	return "timeseries-profile-single-ortho-time"
	if is_timeseries_profile_multi_ortho_time(nc, variable):
	return "timeseries-profile-multi-ortho-time"
	if is_timeseries_profile_ortho_depth(nc, variable):
	return "timeseries-profile-ortho-depth"
	if is_timeseries_profile_incomplete(nc, variable):
	return "timeseries-profile-incomplete"
	if is_trajectory_profile_orthogonal(nc, variable):
	return "trajectory-profile-orthogonal"
	if is_trajectory_profile_incomplete(nc, variable):
	return "trajectory-profile-incomplete"
	if is_2d_regular_grid(nc, variable):
	return "2d-regular-grid"
	if is_2d_static_grid(nc, variable):
	return "2d-static-grid"
	if is_3d_regular_grid(nc, variable):
	return "3d-regular-grid"
	if is_3d_static_grid(nc, variable):
	return "3d-static-grid"
	if is_mapped_grid(nc, variable):
	return "mapped-grid"
	if is_reduced_grid(nc, variable):
	return "reduced-grid"