Mosaic multiple rasters
TomAugspurger opened this issue Β· 36 comments
I'm investigating the best way to mosiac multiple arrays into a single array.
https://desktop.arcgis.com/en/arcmap/10.3/manage-data/raster-and-images/what-is-a-mosaic.htm has a nice description. I'll follow up with what I learn. If anyone has additional resources I'd love to hear about them.
@TomAugspurger - to get you started the command line GDAL approach is typically to use https://gdal.org/programs/gdalbuildvrt.html (or rio merge https://rasterio.readthedocs.io/en/latest/api/rasterio.merge.html) , which have a few common built-in options for overlapping pixel selection rules (first, last, min, max)
If you want some sort of custom pixel selection scheme a python script is the way to go. I really like Xarray syntax for creating mosaics. See this notebook where I load a stack of many swath images into an xarray dataarray, then extract a monthly mean mosaic (https://github.com/scottyhq/sentinel1-rtc/blob/main/Sentinel1-RTC-example.ipynb), in particular:
mosaic_mean = da.where(da!=0).resample(time='1m').mean()
And ongoing discussion in intake-stac to make this process simpler intake/intake-stac#65
Thanks for sharing that @scottyhq, it's very helpful.
Question about building the vrt. I don't see the code for generating the file with the list of paths s3paths.txt
. Is it something roughly like this?
vsis3paths = [x.replace("s3://", "/vsis3/") for x in s3Paths]
with open("s3paths.txt", "w") as f:
f.write("\n".join(vsis3paths))
and then using the CLI like you did, or Python API?
import gdal
options, *_ = gdal.BuildVRTOptions(separate=True)
vrtName = f'stack{zone}{latLabel}{square}-3.vrt'
out = gdal.BuildVRT(vrtName, vsis3paths, separate=True)
Just using the keys s3://
didn't work, since gdal wants it to use its /vsis3
URL.
Thanks very much for all this work... state scale sentinel mosaics are what I am aiming for. Starting with a median I think.
So starting with a STAC of sentinel 2 cogs - get catalog info and a dataframe to get the datetimes
The GDAL environment variables are important? For metatadata/filesize etc. - e.g. 250MB for a sentinel 2A single band?
Something i have been wondering being new to this, does computing a list of datasets try to bring all that into memory?
@TomAugspurger - if trying to stack rasters (typically time - creating dims {x,y,t,band}) - I have used three approaches with STAC collections (alluded to in @scottyhq post above).
- If the STAC catalog has the "proj" extension (shape + transform) - then create vrts with the metadata in the catalog and then mosaic with gdal.BuildVrt. Gist here: https://gist.github.com/rmg55/1acf804ef1af0c7934b265b3a653a486
- If the STAC catalog does not have the "proj" extension (shape + transform) - then open/read the metadata directly from the file - this can be slow.
- If the STAC items have sidecar xml files, than read the xml file to acquire the needed metadata to create the vrt files. Then mosaic with gdal.BuildVrt. Gist here (work in progress fyi): https://gist.github.com/rmg55/3e30dbb1afeee80e8c1d68db40c3b56c
I think this approach would also work with spatial mosaics as well (possibly using pixel functions to mosaic overlapping rasters), but that is something I have not yet worked on.
I would be interested to hear more about the solution/workflow you develop for mosaicing/stacking. I think this is a common bottlekneck in working with EO data (Landsat, Sentinel, Aster, etc....). Adding this sort of functionality to intake-stac would be great, but I have not been able to carve out any time to work on that yet.
Thanks for sharing @rmg55. I'll be sure to update this thread once we get a design together.
I'm glad to hear about creating a VRT from the Catalog items. I've been pleased with building VRTs from the COGs themselves, but as you say this can be a bit slow.
Quick status update: I have a prototype for building a VRTs from a STAC query (thanks for the idea and code @rmg55). This VRT can then be fed into rioxarray.open_rasterio
.
This approach can be much faster than using gdal.BuildVRT
, since it doesn't have to open any datasets; it can do it all from the STAC metadata as long as the required fields are present (most of the fields in the proj
extension: https://github.com/radiantearth/stac-spec/blob/cc6c309d72c24cbaae802709ed58c2bf1e231f37/extensions/projection/README.md).
I'll share more soonish.
Here's the repo: https://github.com/stac-utils/stac-vrt and docs: https://stac-vrt.readthedocs.io/en/latest. Pretty rough right now (likely only works for NAIP STAC items / COGs in Azure), but happy to hack on it with people.
Wow - this looks great @TomAugspurger. Will dig into it deeper and see if/how i might be able to contribute. I think this has the potential to really improve a lot of existing workflows. Seems like getting existing/new catalogs to incorporating the STAC projection extension would really help this effort.
Hey folks! Haven't met many of you before, but I've been tinkering with this question as well, of how to go from STAC items to a ndarray-ish stack of imagery. Sorry for not getting in touch sooner!
So I came up with this library: https://github.com/gjoseph92/stackstac (docs at https://stackstac.readthedocs.io). The overall idea is to generate a dask array representing the whole stack of imageryβonly using the STAC metadata, to avoid the latency of opening datasets. But I've also added a bit of logic to:
- Manage GDAL's (lack of) concurrency model within Dask's threadpools, offering fully multi-threaded (no locks) reads on GeoTIFFs automatically, and falling back to thread-safe (but locked) reads with other drivers. The multi-threaded reads are kinda cheating, since we just open a new thread-local copy of the dataset for each thread, which does have some memory cost, but at least we keep the latency low by carefully ensuring that dataset-open requests remain in the
VSI_CACHE
. - Make the dask graph construction fast (see dask/dask#5913 for how repeated
da.stack
doesn't scale well) by ensuring the graph is an un-materialized Blockwise layer (which is basically constant-time to generate). The goal is to be able to create xarrays pointing to terabytes/petabytes of imagery in a few milliseconds, and (eventually) have slicing be just as fast, so you can work in the xarray data model the whole time without worrying about pre-filtering your data for performance. - Make warping data to a common grid straightforward, and in many cases automatic. Using only STAC metadata, we try to pick a common CRS/resolution, and all reads are warped to that grid through a VRT. If the metadata is missing, or conflicts, or you want something different, specifying the CRS and resolution of the output array is easy.
- Put as much STAC metadata as possible into xarray coordinates, so you can index and filter by
eo:cloud_fraction
, time, spatial coordinates, etc. - Be modular to new backends. Currently, rasterio is the only one, but I'd like to experiment with going no-GDAL when possible via
aiocogeo
orgeotiff
.
There's a ton of overlap here with @TomAugspurger's work on stac-vrt
, and @scottyhq + @rmg55's work on intake-stac (intake/intake-stac#75). I think we're all working on similar problems, coming from slightly different assumptions of what the problems are. I'd love to hear if you think what I've come up with is helpful for your use-cases (@RichardScottOZ, curious if this is helpful for your median composites), or if not, how we can use bits of it to build something that works for everyone.
Thanks, sounds interesting - I will take a look!
Thanks for sharing @gjoseph92! Making a "dask native" way of doing this was on my list too, as these VRTs can get somewhat large. I'll take a closer look later.
@gjoseph92 - awesome project and agreed there is a lot of overlap. Hopefully I will have some more time to explore the project over the next week. I had some quick questions that I was hoping you might be able to clarify:
- after browsing the project, it looks like there is a method to build the transform (affine) from the GSD and BBox - is that correct? I was wondering about that and if the accuracy in the bbox geographic coordinates would be sufficient to create the transform.
- Looks like this is focused on "stacking" items/assets/bands (which I think is the most common workflow). Any plans to incorporate the mosaic workflow like @TomAugspurger has put together in stac_vrt (or maybe this is already there and I just missed it)?
@rmg55 thanks for the questions!
it looks like there is a method to build the transform (affine) from the GSD and BBox
Not quiteβI don't use gsd
, but in some cases yes, I do use bbox
(if proj:bbox
is missing). Which indeed could be less accurate. (I should switch from reprojecting bbox
to reprojecting geometry
, then taking its bbox in the output CRS.)
But overall, it's actually okay if it's a little inaccurate, because we're just estimating a good output transform. When actually reading the data, GDAL will still use whatever geotrans is set on the dataset, then warp it to the output transform we picked. So it's never "inaccurate", per se (pixels will always be correctly georeferenced), but it's true that the output resolution might not always match the native resolution of the data by default. But since getting things to a common grid is the goal anyway, that's kind of the point. And if you have a specific resolution and CRS you want, you're always free to set that too.
Any plans to incorporate the mosaic workflow
I opened gjoseph92/stackstac#1 to discuss this, but the short answer is: you can instead do the mosaicking with some np.where
s on the dask array, which IMO has a few advantages over baking it into the VRT at the GDAL level.
Thanks for opening this up @gjoseph92, I had a brief look at stackstac (love the name by the way, and couldn't help thinking the palindrome stackcats would also be fun :) I'm really impressed by the docs and appreciate your links to the dask scaling issues. There is definitely a lot of overlap, but to be honest I think you're a fair ways ahead on the path that we've just been starting down! Will definitely be good to think about how to consolidate effort and collaborate where possible on this, I think it's a really exciting time to push these tools forward with STAC1.0 coming out and more and more COGs+STAC available to make the most of.
Anyone else interested in joining a call to hash out next steps? I feel like we've had a good amount of exploration in the area of STAC + xarray + Dask, but we might be due for some consolidation, or at least understanding where different projects have different visions or different approaches.
If you are interested, give a +1 to this and I'll send out an invite sometime later this week.
cc @snowman2 as well: #4 (comment), if you're interested in joining a call on this topic. #4 (comment) is a good summary.
I'll send out a doodle poll tomorrow to find a time for next week. I suspect we'll cover STAC + Dask + xarray first (and stac-vrt / stackstac specifically). But we'll inevitably spill into larger operations so it'd be good to have someone from rioxarray and xarray-spatial there. (and maybe @jorisvandenbossche or someone else from geopandas, if we get into vector / raster operations? Sounds like a full meeting π)
I sent the following email to people who indicated interest. Anyone is welcome to join.
Link to the doodle poll: https://doodle.com/poll/kpz9rtrqmk4tiqy2. I think we have a pretty wide range of timezones, but I gave options in hourly increments for the US working day for next week. I think we'll use an hour productively, and people of course are welcome to come and go.
Here's a link to a proposed agenda: https://docs.google.com/document/d/1IXnk2fvpjWaRZHAwZ1i_d6l0B7iIvV_6Kmjy2IXK5Vo/edit?usp=sharing. Feel free to add other items.
ping @dcherian and @JessicaS11 would be great if you have availability to join!
When folks converge on a solution for reading STAC rasters into xarray Datasets, I'd like to showcase that solution in this lesson: https://github.com/carpentries-incubator/geospatial-python
It's geared toward novice and intermediate programmers who do GIS/remote sensing. I'll attend mostly to listen.
π I'm a bit late here, but Mosaic (+ STAC) is something we have worked a lot in the last years. rio-tiler (which I'm one of the core dev) has native support for mosaic (and for STAC), you may found this example interesting https://cogeotiff.github.io/rio-tiler/examples/Using-rio-tiler-mosaic/
On the mosaic side, we have developed a simple spec: https://github.com/developmentseed/mosaicjson-spec. The spec was made to support dynamic tiling, but help created https://github.com/developmentseed/cogeo-mosaic which can in fact work for other purpose than map tiles but we recently introduced the concept of dynamic mosaic
which use API or database to store the COG geometry informations and use the cogeo-mosaic backend to create data array for the list of intersecting COG: https://developmentseed.org/cogeo-mosaic/examples/Create_a_Dynamic_StacBackend/
rio-tiler/cogeo-mosaic are just one solution for some specific problem and it's nice to see other tooling begin developed on this subject.
Looks like Monday the 22nd at 1:00 US Central time worked for the most people (reminder: the US just had a daylight savings time transition). I sent an invite to people whose email I had. If you're interested in joining send me an email at taugspurger@microsoft.com and I'll add you, or just show up then.
Agenda: https://docs.google.com/document/d/1IXnk2fvpjWaRZHAwZ1i_d6l0B7iIvV_6Kmjy2IXK5Vo/edit?usp=sharing
I updated #4 (comment) with a link to the video call.
Thanks to everyone who joined. The agenda doc has some detailed notes for those who missed: https://docs.google.com/document/d/1IXnk2fvpjWaRZHAwZ1i_d6l0B7iIvV_6Kmjy2IXK5Vo/edit?usp=sharing
Some takeaways:
- We're broadly in agreement that it'd be great if libraries doing I/O (rioxarray, a hypothetical library that has the I/O component of stackstac) exposed geospatial information consistently. Then downstream analysis / visualization libraries can rely on this information, regardless of how the data was loaded. https://corteva.github.io/rioxarray/stable/getting_started/crs_management.html has a note on how rioxarray handles this
- Both stackstac and stac-vrt are narrowly scoped. There's interest in a library that handles a wider set of functionality. It's not clear (to me) how to scope that library, but at a minimum things like stacking a time-series of images and making a mosaic of many images are in scope, which implies that the kind of I/O stackstac had to implement is likely in scope. What else?
Anything else to add?
Hi all, First time joining a github conversation (!) but just thought I'd jump in as a way to keep an eye on where this goes. Very interested as I've been banging my head against the wall trying to use these tools to mosaic rasters for a while now. Thanks for everyone's contributions to it!
Median composites...no median in dask, how is this working? A 'make me one of these' from whatever generic or user function mean, median, mode, hdstats once parallelized et al would be beneficial?
Improved performance...eg stackstac
Plotting of large outputs/dimensions
Thanks for the notes for the middle of the night people
Thanks, @TomNicholas to bring the issue here.
A little bit more context, for an institutional reason I'm not able to use the Synergise/AWS products. This brings me to deal with the overlapping issue that has been already "solved" (with an average ? I've no info about this)in the AWS product.
...this issue ...
Has been more accurately described here
Seems that the best option to solve this is, as suggested by @JessicaS11 , is to use rioxarray.merge but I'm still struggling to create a full mosaic over a large time period.
@pl-marasco see gjoseph92/stackstac#1 for a quick mosaic implementation (not quite what you're looking for, since it doesn't take coordinates into account, but may work).
I'm still struggling to create a full mosaic over a large time period
stackstac may be helpful for this task in general.
Yes, don't think I could have done the SA dataset without an AWS (GCS/Azure) or whatever machine with 300+ GB of RAM.
Hi, Just passing through looking for answer to something but thought I'd mention this in case it is relevant/of interest.
https://github.com/opendatacube/odc-stac/tree/odc-geo
Hi, Just passing through looking for answer to something but thought I'd mention this in case it is relevant/of interest. https://github.com/opendatacube/odc-stac/tree/odc-geo
Updated location: https://github.com/opendatacube/odc-geo