Subsetting tool

Question

Subsetting tool

Opened this issue 9 months ago · 31 comments

Project Description

This project aims to develop use case-specific enhancements to the xarray-subset-grid library to improve functionality and support for STOFS data subsetting (https://registry.opendata.aws/noaa-gestofs/), including:

Test and evaluate the performance of xarray-subset-grid against NOAA/NODD STOFS data. Develop demonstration notebooks and other documentation illustrating patterns for subsetting and accessing STOFS and other publicly available NOS ocean model data.
Create a service that operates via email to run subsetting jobs on demand and return STOFS model data that is much easier to access with low bandwidth.
Create a portal to view, subset, and download STOFS data similar to the COOPS portal but using the new xarray-based package to work against the NODD STOFS kerchunked data.
Integrate xarray-subset-grid with existing packages in use by the ocean modeling community (Thalassa, etc).

This project provides an opportunity to enhance accessibility to ocean water level forecast data, which is crucial across different sectors. Here is a link to a few examples where we've experimented with subsetting the STOFS-2D-Global data.

Expected Outcomes

Improved open-source code for efficiently subsetting STOFS model output with a demonstrated use-case example (STOFS data).

Skills required

Python; Libraries: Xarray, Dask, Zarr. Cloud Storage

Mentor(s)

Atieh Alipour (@AtiehAlipour-NOAA ), Chris Barker (@ChrisBarker-NOAA), Soroosh Mani (@SorooshMani-NOAA)

Expected Project Size

350

What is the difficulty of the project?

Intermediate

Answer 1 · 2024-03-15T16:54:08.000Z

Thanks for the project idea contribution @SorooshMani-NOAA!

This sounds similar in my naive viewpoint to an existing NOAA HPCC-funded project being led by OR&R/Chris Barker and IOOS' RPS partners that is currently underway.

@ChrisBarker-NOAA @mpiannucci @jonmjoyce How much overlap do you see between this proposed package and the work you're doing? If there is, can we consolidate efforts if one of you would be willing to mentor a student for this work (in addition to @SorooshMani-NOAA and @AtiehAlipour-NOAA) during this year's GSoC?

The project could be scoped more narrowly, if so, to a particular piece of functionality you could use help with.

Another piece that would help to accept this project for GSoC is an existing code base to cite and to build off of. I think the RPS folks have this code already if my impression of the similarity is correct, but I don't know where to point to.

Answer 2 · 2024-03-15T17:45:01.000Z

@mwengren thanks so much for the feedback. I think this project is linked to #42. We are working on developing a package for subsetting STOFS model output. At this point, the code is not ready to be shared. I am in the process of testing different packages, and we were thinking that with GSoC, we can have some help for code improvement. Do you think defining the project more narrowly for STOFS model output would help? Or do you think with such overlaps with other projects, it is possible to increase that project size so that we can cover subsetting STOFS model outputs as well? Thanks!

Answer 3 · 2024-03-15T17:54:10.000Z

@mwengren Thanks for tagging. The package we are developing is scoped to allow subsetting in space for UGRID and SGRID datasets to start. STOFS, I believe would fall within this scope as an unstructured model as long as the metadata is cf compliant for the mesh topology.

Beyond that, we will also be deploying a cloud service to subset the data in the cloud directly from NODD.

There is certainly a lot of overlap between the two ideas

Answer 4 · 2024-03-15T19:10:21.000Z

@mpiannucci Is there any public component of the code you're working on we could link to here?

That would give @AtiehAlipour-NOAA a reference to look at to understand if that could be leveraged for their proposed work or not.

I don't want to say no to this idea, but I've already set a precedent that projects need to have existing public code to start from in order to be included. So I think we'll need some sort of even initial code to be used as a reference for this idea to go forward.

The easiest way meet that would be for this project to build off of the HPC project code, extending to STOFS use case, if appropriate.

Or, we need to go back and accept some proposals I've declined previously to be fair, which we could do.

Open to suggestions from our GSoC community on that. Thanks!

Answer 5 · 2024-03-15T21:38:42.000Z

@mwengren can you point us to the other project? We could merge the two if possible. Is it #42 or #45? Or are you suggesting that we point to the repo of a relevant effort that is not in GSoC, so that you can accept our proposed project?

Answer 6 · 2024-03-15T22:29:54.000Z

Do you think defining the project more narrowly for STOFS model output would help?

I think that's the opposite of the right thing to do -- there have been a LOT of false starts -- one off codes in this space, and here we are now talking about at least two independent efforts that are duplicating each other and previous codes.

Or do you think with such overlaps with other projects, it is possible to increase that project size so that we can cover subsetting STOFS model outputs as well?

First -- yes. But I don't think that's way to frame it that way -- rather, the goal is a framework that could be used with any gridded model results that could conform to existing standards: CF, UGRID, SGRID. So getting it to work with STOFS should be trivial, once the framework is in place.

The package we are developing is scoped to allow subsetting in space for UGRID and SGRID datasets to start. STOFS, I believe would fall within this scope as an unstructured model as long as the metadata is cf compliant for the mesh topology.

And if its not compliant (it's probably not) , the API should have a way to massage it to be usable. That's pretty key, actually. As far as I've seen, NONE of the operational models provide fully standards compliant output.

That being said, nothing is known to work util it's been tested -- so "Getting the existing code to work with STOFS" is a fine goal -- it could be trivial, or maybe the code will need to be extended or refactored a bit, which could make a good GSoC project.

Answer 7 · 2024-03-15T22:33:29.000Z

@AtiehAlipour-NOAA wrote:

We are working on developing a package for subsetting STOFS model output. At this point, the code is not ready to be shared.

I don't think there's such a thing as "not ready to be shared" -- if this is going to be a community project, which I hope it will be, getting feedback early is better than later.

And there's the effort @mpiannucci referred to - it's silly to have these as independent efforts.

Answer 8 · 2024-03-15T22:39:43.000Z

This part is the key btw:

And if its not compliant (it's probably not) , the API should have a way to massage it to be usable. That's pretty key, actually. As far as I've seen, NONE of the operational models provide fully standards compliant output.

We are injecting the cf compliant ugrid metadata into NOS models with kerchunk and pushing them to NODD to facilitate this workflow

Answer 9 · 2024-03-15T22:45:24.000Z

Not the place to discuss this, but:

We are injecting the cf compliant ugrid metadata into NOS models with kerchunk and pushing them to NODD to facilitate this workflow

Nice! and certainly applicable to STOFS.

Though it's a bit less useful for using the same code outside the NODD (or similar Cloud systems) -- I'd love to have a library I can point at a pile of netcdf files on my machine without having to massage them first. (though perhaps having to provide some declarative data about how to interpret them)

-CHB

Answer 10 · 2024-03-15T22:52:58.000Z

Sorry didn't mean to cross wires, it is applicable to STOFS so thought I would mention.

Answer 11 · 2024-03-16T02:22:04.000Z

@mwengren can you point us to the other project? We could merge the two if possible. Is it #42 or #45? Or are you suggesting that we point to the repo of a relevant effort that is not in GSoC, so that you can accept our proposed project?

@SorooshMani-NOAA No the code isn't part of an existing GSoC project, it's being developed as part of a separate effort at NOS (that @ChrisBarker-NOAA and @mpiannucci are involved with). We'll post that reference here as soon as we have it.

I think this would be a useful compliment to that, and with assumption that the existing code will be up soon, and that this can project can be based on that, I think we should go ahead accept this so that potential students have time to review and apply before April 4. They will need code to base their applications on, so if for some reason we can't make that available, we may have to pull this project again.

In any case, I see a lot of ideas for modifying the initial project in the comments above. Please update the initial issue comment with whatever changes you decide to to the scope make so it's clear to applicants what the expectation is in one location. If you want to change project size as well from something other than 175 hours, that is ok as well.

Also, @ChrisBarker-NOAA mentioned he might be interested in being a co-mentor as well, I think. Please add any addition mentor(s) you want to include in the first comment.

Answer 12 · 2024-03-18T12:33:30.000Z

@AtiehAlipour-NOAA and I will update the description and add @ChrisBarker-NOAA as co-mentor. Please add the link as soon as you have it, thanks!

Answer 13 · 2024-03-18T13:38:46.000Z

I don't think there's such a thing as "not ready to be shared" -- if this is going to be a community project, which I hope it will be, getting feedback early is better than later.

And there's the effort @mpiannucci referred to - it's silly to have these as independent efforts.

@ChrisBarker-NOAA as far as I understand currently we're still in exploration phase for this subsetting effort, i.e. we're trying out different ways of subsetting, or different tools we know about, to see how fast they are, etc. I'm not sure if there's an actual packaged code from our efforts yet, and that's what @AtiehAlipour-NOAA meant, I believe.

There's probably a single script file that we can share, but I don't believe it's in a shape or form to have a repo of its own. @AtiehAlipour-NOAA please correct me if I'm wrong.

With that being said, we'd be happy to share anything we've tried so far with you or the contributors as a starting point. It's great that you are already developing a package, and even if nothing comes out of this project, we at least know about your effort and can learn from you and contribute to your code base.

Answer 14 · 2024-03-18T15:04:11.000Z

@SorooshMani-NOAA, thank you for the response. That is absolutely correct. @ChrisBarker-NOAA, thank you for the comments and feedback; we are not developing a new subsetting tool; instead, we want to use publicly available packages to subset STOFS model output. So far, we have been using different available packages and tested their performance as the subsetting tool in different scripts. Since there is no conclusion yet and it involves testing different packages, it is not in a state to have its repository, as @SorooshMani-NOAA mentioned. However, we would be happy to share those codes with you and the future contributors. Thanks again for your feedback and help. We are very excited to learn that you are already developing a package, and we look forward to learning more from you in the future.

Answer 15 · 2024-03-18T18:11:22.000Z

Hello mentors,
I am Omkar, a Computer Science undergraduate from India. I am interested in contributing to this project. I had a few inquiries.

I've read the comments above and have understood that you have been testing different packages and their performance. Is it possible for you to share the base codes as a starting point?
In the model output (https://nomads.ncep.noaa.gov/pub/data/nccf/com/stofs/prod/), there are 4 types - 2d Global, 2d Atlantic, 3d Global and 3d Atlantic. Which type is this project involved with?
Does this project involve the use of the libraries csdllib or autoval? If so, is there any documentation for it?
If not, I think it requires using open-source libraries like xarray, netcdf4 or pynio, right?

Edit :
In my application, I want to be transparent about my current familiarity. While I don't have experience in oceanographic data specifically, I do have experience of data modelling , data visualization, python and it's libraries in general. I am genuinely eager to learn and contribute to this project.
I am sharing my Resume for your reference.

Thank you,
Omkar

Answer 16 · 2024-03-18T19:10:25.000Z

@AtiehAlipour-NOAA wrote:

So far, we have been using different available packages and tested their performance as the subsetting tool in different scripts. Since there is no conclusion yet and it involves testing different packages, it is not in a state to have its repository,

Got it, thanks. However, even if there's no new code, the information about what's been tried, and how it's worked, could be really helpful to all -- and particularly to this project. -- it would be great to put there somewhere it can be shared.

Answer 17 · 2024-03-19T05:42:48.000Z

Hello mentors, I am Omkar, a Computer Science undergraduate from India. I am interested in contributing to this project. I had a few inquiries.

I've read the comments above and have understood that you have been testing different packages and their performance. Is it possible for you to share the base codes as a starting point?

In the model output (https://nomads.ncep.noaa.gov/pub/data/nccf/com/stofs/prod/), there are 4 types - 2d Global, 2d Atlantic, 3d Global and 3d Atlantic. Which type is this project involved with?

Does this project involve the use of the libraries csdllib or autoval? If so, is there any documentation for it?
If not, I think it requires using open-source libraries like xarray, netcdf4 or pynio, right?

Edit : In my application, I want to be transparent about my current familiarity. While I don't have experience in oceanographic data specifically, I do have experience of data modelling , data visualization, python and it's libraries in general. I am genuinely eager to learn and contribute to this project. I am sharing my Resume for your reference.

Thank you, Omkar

Hello Omkar,

Thank you for expressing interest in our project. We're glad you're excited to contribute. Below are the answers to your inquiries:

Base Codes for Starting Point: Certainly, here is a link to a few examples where we've experimented with subsetting the STOFS-2D-Global data. The codes use XUgrid and Thalassa packages, designed to work with 2D unstructured grids. We are currently working on enhancing the code with various formats and tools like ZARR and DASK. We are also exploring options such as transposing datasets and using Kerchunk.

Project Focus: At the moment, our project focus around STOFS-2D-Global. However, we have plans to expand our framework to include other types in the future.

Libraries and Documentation: We do not currently use csdllib or autoval packages in our project. These packages are typically used for post-processing and model evaluation within the STOFS framework. For our development, we rely on open-source libraries like xarray and netcdf4.

Regarding your application, it's great to hear about your background in data modeling, visualization, Python, and its libraries. We encourage you to highlight your skills and experiences in your application.

Please let us know if you have any further questions or need additional information.

Answer 18 · 2024-03-19T05:50:37.000Z

@AtiehAlipour-NOAA wrote:

So far, we have been using different available packages and tested their performance as the subsetting tool in different scripts. Since there is no conclusion yet and it involves testing different packages, it is not in a state to have its repository,

Got it, thanks. However, even if there's no new code, the information about what's been tried, and how it's worked, could be really helpful to all -- and particularly to this project. -- it would be great to put there somewhere it can be shared.

@ChrisBarker-NOAA, You're absolutely right. Here is a link to a few examples where we've experimented with subsetting the STOFS-2D-Global data. These examples use the XUgrid and Thalassa packages. Please feel free to reach out if you have any questions. Thank you once again for your valuable contribution to this project.

Answer 19 · 2024-03-19T20:39:40.000Z

@AtiehAlipour-NOAA
Thank you for your response.
I've tried out the base codes of subsetting using Thalassa and Xugrid.

I noticed that the code is heavily dependant on the hardware used.
The Fields dataset was too big so I tried with the max elevation dataset.
On my laptop CPU, it took 1145.45sec to run the whole file, whereas on Colab GPU, it took 185.22 seconds, almost 6x faster.
I've gone through the Thalassa API docs and it says that 'nvel' variable is dropped automatically. So is there a reason for specifying it again, or is it just a safety check?
On what basis is the subset box chosen for subsetting the dataset?

Create a service that operates via email to run subsetting jobs on demand and return STOFS model data that is much easier to access with low bandwidth.

Would web frameworks like Flask or FastAPI suffice for the service?

Integrate xarray-subset-grid with existing packages in use by the ocean modeling community (Thalassa, etc).

I'll be going through the xarray-subset-grid notebooks and try it out myself, maybe use it with STOFS. I'll update my progress soon. Thank you!

Answer 20 · 2024-03-19T22:03:19.000Z

@omkar-334, great work!

1- That's a great observation. We are glad you tested it on Colab and noticed the significant difference in performance.

2- Good catch. You can disregard that part; there's no need to drop it for the Thalassa package.

3- You're free to define any subset box; the example provided is just a starting point. In the future, we aim to use any polygon/shapefile for data subsetting.

4- Let's discuss this further as a group, based on the specific features we want to implement.

I'll be going through the xarray-subset-grid notebooks and try it out myself, maybe use it with STOFS. I'll update my progress soon. Thank you!

Wonderful! If you have any questions, feel free to ask.

Thank you once again for your dedication and hard work.

Answer 21 · 2024-03-20T17:57:20.000Z

@AtiehAlipour-NOAA Thank you for the feedback!

I've looked into the implementation of Thalassa and api.open_dataset and I've run benchmarks of the base implementation versus your tests. The base implementation is almost 2x faster than Thalassa test 1 and almost 5x faster than Xugrid test 1.
Colab Link STOFS Benchmarks
Also, thalassa's api.open_dataset method is just a wrapper around the xarray method. So, using the xarray method, dropping variables and then normalizing the dataset comes out to be faster than thalassa's method.
I've read that Xarray.open_dataset can sometimes crash when loading HDF5 files and a workaround can be loading the dataset through H5py. Has this ever happened to you?

3- You're free to define any subset box; the example provided is just a starting point. In the future, we aim to use any polygon/shapefile for data subsetting.

Is the the project going to involve subsetting/selecting over other dimensions or variables (for example, depth, elevation, etc), and plotting too?
I had tried to setup Xarray-subset-grid, but it had a minor error, which has now been fixed.Since the package is a work in progress, I assume this project will include contributing to it, right?
In the example notebooks, only 'zarr' datasets have been implemented. Are there plans of migrating from HDF5/netCDF4 to zarr? I've read up online and the consensus seems to be divided over which is faster and better. What is your opinion?
I've tried the method given in the Xarray-subset-grid example notebooks to subset our original data.
When I mention ds along with any parameter like x or y -> ds_subset = sgrid.subset_polygon(ds[['x','y']], polygon)
It gives me this error - Dataset.cf does not understand the key 'mesh_topology'

When I mention only ds without any parameter -> ds_subset = sgrid.subset_polygon(ds, polygon)
It gives me this error - AttributeError: 'DataArray' object has no attribute 'face_face_connectivity'

I've noticed that 'dataset.cf' returns Grid Mapping and Bounds both as N/A.
Is this the reason why this method doesn't work on our dataset?

Thank you very much for your patience and for answering my questions.

Answer 22 · 2024-03-20T18:01:50.000Z

A few good sources on zarr benchmarks that I found interesting -

Answer 23 · 2024-03-20T20:21:00.000Z

@omkar-334 it's really encouraging to see that you are already spending time on figuring out how to improve the sample codes provided. However I'd suggest that you don't share all of your findings on this ticket and instead email it to Atieh or me (Atieh.Alipour@noaa.gov, Soroosh.Mani@noaa.gov).

Another thing I'd like to point out is that the xarray-subset-grid is still a work in progress, which will be more in shape by the time we get to the start date of GSoC contributions. I encourage that you note down all your findings for yourself (feel free to email us and ask questions too) but use this GitHub ticket to ask about more generic aspects of the project (like in your previous comments)

To answer a couple of your questions:

I've looked into the implementation of Thalassa and api.open_dataset and I've run benchmarks of the base implementation versus your tests. The base implementation is almost 2x faster than Thalassa test 1 and almost 5x faster than Xugrid test 1.
Colab Link STOFS Benchmarks
Also, thalassa's api.open_dataset method is just a wrapper around the xarray method. So, using the xarray method, dropping variables and then normalizing the dataset comes out to be faster than thalassa's method.
I've read that Xarray.open_dataset can sometimes crash when loading HDF5 files and a workaround can be loading the dataset through H5py. Has this ever happened to you?

These are some specifics that I'd encourage you to take a note of, but ask about in email

Is the the project going to involve subsetting/selecting over other dimensions or variables (for example, depth, elevation, etc), and plotting too?

The subsetting only happens in location and time (lon, lat, time)

In the example notebooks, only 'zarr' datasets have been implemented. Are there plans of migrating from HDF5/netCDF4 to zarr? I've read up online and the consensus seems to be divided over which is faster and better. What is your opinion?

The idea is to move towards Zarr format, either by saving new data to Zarr, or use virtual Zarr files by using HDF/NC metadata (see kerchunk)

Answer 24 · 2024-03-22T05:42:21.000Z

@SorooshMani-NOAA ,
Thank you for the feedback.
Noted, I will ask about specific details through email.

I'm running into few errors while trying to select / subset the dataset.

I've tried the method given in the Xarray-subset-grid example notebooks to subset our original data.
When I mention ds along with any parameter like x or y -> ds_subset = sgrid.subset_polygon(ds[['x','y']], polygon)
It gives me this error - Dataset.cf does not understand the key 'mesh_topology'

When I mention only ds without any parameter -> ds_subset = sgrid.subset_polygon(ds, polygon) It gives me this error - AttributeError: 'DataArray' object has no attribute 'face_face_connectivity'
I've noticed that 'dataset.cf' returns Grid Mapping and Bounds both as N/A. Is this the reason why this method doesn't work on our dataset?

Another error - When I try to use dataset.del method - ds.sel(x=90, method='nearest'), I'm getting this error
KeyError: "no index found for coordinate 'x'"
This methods works for time though.
Furthermore, I found these discussions in the xarray repo, which were similar, but confusing as well.
pydata/xarray#4825 , pydata/xarray#6229
Could you help me with this?
The standard names for the coordinates - latitute and longitude is given as x and y. The standard names for the axes X, Y, Z is give as None.
How does this structure work?

Thank you!

Answer 25 · 2024-03-22T12:37:44.000Z

@omkar-334 as I mentioned earlier, this package is still in development. The main developers of it has informed me that at it's current state they don't expect it to just work out of the box, but by the start of the GSoC it will be in a better state. Anything that you notice now and is not fixed by the start date will be a part of your contribution for the GSoC project; and I and the other mentors will be more than happy to help the contributors with fixing it at that point! You can include what you find as bugs as a part of your proposal.

Answer 26 · 2024-03-28T10:09:51.000Z

Hi @SorooshMani-NOAA
1.a The past few days I've delved into kerchunk and gone through #42 as well. I've used the SingleHdfToZarr method for virtually loading the NetCDF file as a Zarr dataset and this got me wondering on if we would need to combine multiple datasets into one zarr dataset.

1.b In the subsetting service mentioned in the project Description, would the zarr dataset references be generated each time or would we store them in case two users need to subset the same dataset? You've mentioned earlier that the idea is to move towards saving new data in Zarr format, so I think we could do with using virtual Zarr files of old data.

Could you provide basic details about the subsetting-email-service and the portal, like the API service or the tech stack?
I've also tried converting the xarray dataset into a UGrid dataset and this made some ds.sel operations easier.
While plotting the dataset,
ds.depth.plot() returns a plot of depth vs node whereas
ds.depth.ugrid.plot() returns a plot of depth vs x , y .
How does each node correspond to a latitude, longitude pair here?
I've tried the Ugrid.subset_polygon method from xarray-subset-grid but I'm running into different errors with xarray, zarr and Ugrid datasets. I'll note them down and include them in my proposal.
Regarding my proposal, are you able to see the submitted proposal before the deadline ends or should I just share it in a Google Docs format for your feedback?

Thank you!

Answer 27 · 2024-03-28T15:54:35.000Z

@omkar-334

1.a ... if we would need to combine multiple datasets into one zarr dataset.

Please note that while #42 is mentioned in this ticket and is related to this project, it’s a whole separate GSOC project.
I’m not sure if I fully understand the context of your question. Are you asking about a specific set of netCDF files to be combined or are you asking in general?

One idea behind developing this tool is to make the model results more accessible by for example making it easier to download for those who have low internet bandwidth.

Sometimes model data is divided over multiple netCDF files. Suppose that someone needs to get multi-day timeseries of modeled water elevation, usually this means data is spread across multiple netCDF files that are run on different simulation cycles. The results of this subsetting are then going to be combining the data from all those datasets into a single one (ideally Zarr). I hope that answers your question.

1b. ... would we store them in case two users need to subset the same dataset? You've mentioned earlier that the idea is to move towards saving new data in Zarr format, so I think we could do with using virtual Zarr files of old data.

The Zarr metadata for a given file needs to be stored, otherwise the whole file needs to be downloaded every time to be able to then chunk it, which defeats the purpose.

In an ideal world, we’d be able to just use kerchunk and look into old netCDF results like a Zarr file optimally. But in reality there are few things that need to be considered. Most important of all, the underlying chunking (binary blobs) of the data in the file needs to align with what is optimal when retrieving either a long timeseries for a single location or a single time step of a large area. These two cases are two sides of the coin, but both are very valid use cases for different users. With that in mind, one might need to rechunk the original data before using kerchunk to generate virtual Zarr files.

2.. ... Could you provide basic details about the subsetting-email-service and the portal, like the API service or the tech stack?

There are no hard requirements.

This could be a very basic portal with tools to define a region (as simple as lat/lon values or more sophisticated polygon drawing on a map) and then the submitted job runs the subsetting and finally the location of that subsetted dataset is emailed to the user to download.

It also could be you ignore this requirement and provide dataset to the user as a service, through packages like xpublish (https://xpublish.readthedocs.io/en/latest/index.html)

3.. ... How does each node correspond to a latitude, longitude pair here?

If you’re not already, I’d suggest that you familiarize yourself with how an unstructured mesh is defined (don’t worry about generating one!)

One common way to represent unstructured mesh is by two tables: a coordinates table and an elements table. The coordinates table shows the lat/lon location for each mesh node; the element table shows what nodes are connected to create each element.

4.. ... I'm running into different errors ... I'll note them down and include them in my proposal.

Thank you for documenting those errors. We will address them once the project begins.

5.. ... Regarding my proposal, are you able to see the submitted proposal before the deadline ends or should I just share it in a Google Docs format for your feedback?

I don’t see your proposal in the list, I think I only see fully submitted proposals. Please share your proposal in Google Docs for our review so that we can share our feedback before your final submission.Thank you for all your hard work on this project. Please let us know if there is anything else we can help with.

Answer 28 · 2024-03-28T15:57:49.000Z

@omkar-334 just FYI the xarray-subset-grid package works with any xarray dataset, not only zarr. See here for an example of the usage with netcdf. There may be issues with the grid logic in the apackage still it is in its infancy, but working netcdf files should not have any issues.

Answer 29 · 2024-03-28T17:03:37.000Z

@omkar-334 just FYI the xarray-subset-grid package works with any xarray dataset, not only zarr. See here for an example of the usage with netcdf. There may be issues with the grid logic in the apackage still it is in its infancy, but working netcdf files should not have any issues.

You're right, It could be the grid logic that's causing the issues.
Thanks for the example code.
In the STOFS dataset that I'm trying out this code with, there already exists a mesh dimension.
ds.subset_grid.grid.name returns ugrid
from xarray_subset_grid.grids.ugrid import UGrid
ugrid = UGrid()
ugrid.recognize(ds) also returns True.
So I think this means that the ugrid topology is correctly mentioned in the STOFS Dataset.

However, when i use the subset_polygon method,
box = [(-70, -60),(40, 50)]
ds = ugrid.subset_polygon(ds, box),
This returns an error - ValueError: zero-size array to reduction operation maximum which has no identity
Is this because of the polygon that I have specified? Maybe there don't exist coordinates in that specific subset or something similar to that?

Thank you!

Answer 30 · 2024-03-28T17:19:32.000Z

Please note that while #42 is mentioned in this ticket and is related to this project, it’s a whole separate GSOC project. I’m not sure if I fully understand the context of your question. Are you asking about a specific set of netCDF files to be combined or are you asking in general?

Yes, I do realise that #42 is a separate project. I was just going through the technical details and libraries, since they are similar to that of this project.
My question was in general, just to know if we would need the functionality.

Sometimes model data is divided over multiple netCDF files. Suppose that someone needs to get multi-day timeseries of modeled water elevation, usually this means data is spread across multiple netCDF files that are run on different simulation cycles. The results of this subsetting are then going to be combining the data from all those datasets into a single one (ideally Zarr). I hope that answers your question.

Yes, It does answer my question. Thank you.

If you’re not already, I’d suggest that you familiarize yourself with how an unstructured mesh is defined (don’t worry about generating one!)

One common way to represent unstructured mesh is by two tables: a coordinates table and an elements table. The coordinates table shows the lat/lon location for each mesh node; the element table shows what nodes are connected to create each element.

Ok, got it. I'll read up more about this.

I don’t see your proposal in the list, I think I only see fully submitted proposals. Please share your proposal in Google Docs for our review so that we can share our feedback before your final submission.Thank you for all your hard work on this project. Please let us know if there is anything else we can help with.

Noted, I will share my proposal in Google Docs for your review.
Thank you for your detailed feedback!

Answer 31 · 2024-03-29T12:53:49.000Z

@omkar-334 please delete your comment and share the link through email! Your proposal is publicly shared through the link right now!