Parallelize geemap.download_ee_image_tiles
TmtStss opened this issue · 8 comments
Environment Information
Date: Fri Jun 09 07:49:34 2023 UTC
OS : Linux
CPU(s) : 4
Machine : x86_64
Architecture : 64bit
RAM : 15.3 GiB
Environment : IPython
Python 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0]
geemap : 0.17.1
ee : 0.1.343
ipyleaflet : 0.17.2
folium : 0.13.0
jupyterlab : 3.2.5
notebook : 6.1.5
ipyevents : 2.0.1
geopandas : 0.10.2
Description
Parallelize the download of image tiles for multiple ee.FeatureCollection
using geemap.download_ee_image_tiles
and joblib.Parallel
.
What I Did
import ee
import geemap
import joblib
def download_data(country_name):
# Get FeatureCollection for country of interest
gaul = ee.FeatureCollection("FAO/GAUL/2015/level0")
geometry = gaul.filter(ee.Filter.eq("ADM0_NAME", country_name)).geometry()
fc = geemap.fishnet(geometry, h_interval=0.4, v_interval=0.2)
# Download image tiles
geemap.download_ee_image_tiles(
image=ee.Image("projects/glad/GLCLU2020/Forest_height_2020"),
features=fc,
out_dir=".",
prefix=country_name,
crs="EPSG:4326",
scale=30,
)
if __name__ == "__main__":
try:
ee.Initialize(opt_url="https://earthengine-highvolume.googleapis.com")
print("Google Earth Engine API initialized successfully.")
except Exception as e:
print("Error while initializing the Google Earth Engine API: {}".format(e))
country_list = ["Gambia", "Qatar", "Lebanon", "Jamaica"]
# Parallel (gives error)
with joblib.Parallel(n_jobs=4) as parallel:
parallel(joblib.delayed(download_data)(country_name) for country_name in country_list)
# Sequential (gives no error)
# for country_name in country_list:
# download_data(country_name)
Google Earth Engine API initialized successfully.
Traceback (most recent call last):
File "scripts/dummy.py", line 32, in <module>
parallel(joblib.delayed(download_data)(country_name) for country_name in country_list)
File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1098, in __call__
self.retrieve()
File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 975, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
return future.result(timeout=timeout)
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
ee.ee_exception.EEException: Earth Engine client library not initialized. Run `ee.Initialize()`
Moving ee.Initialize()
to within the download_data
function fixes the issue, but I am not sure it is actually solving it. It looks like every worker needs to have its own initialization.
def download_data(country_name):
# Initiliaze GEE Python API
try:
ee.Initialize(opt_url="https://earthengine-highvolume.googleapis.com")
print("Google Earth Engine API initialized successfully.")
except Exception as e:
print("Error while initializing the Google Earth Engine API: {}".format(e))
# Get FeatureCollection for country of interest
gaul = ee.FeatureCollection("FAO/GAUL/2015/level0")
geometry = gaul.filter(ee.Filter.eq("ADM0_NAME", country_name)).geometry()
fc = geemap.fishnet(geometry, h_interval=0.4, v_interval=0.2)
# Download image tiles
geemap.download_ee_image_tiles(
image=ee.Image("projects/glad/GLCLU2020/Forest_height_2020"),
features=fc,
out_dir=".",
prefix=country_name,
crs="EPSG:4326",
scale=30,
)
I compared the speed of execution as well: sequential (13min 52s) vs parallel (5min 21s).
Yet, is it the most efficient way to download image tiles at large scale ? I came across the package geetiles for which geemap issue #1473 wants to add support, as well as Gorelick's Medium blogpost on fast downloads trough the GEE Python API using multiprocessing
and ee.Image.getDownloadURL()
.
Thank you for sharing the solution. Try geemap.ee_initialize() rather than ee.Initialize(). It will only initializes if needed.
Thank you very much for the quick implementation of support for downloading images in parallel. However, the geemap.download_ee_image_tiles_parallel()
throws the following error when run on the Sepal cloud computing platform for geospatial analysis, developed by the Food and Agriculture Organization of the United Nations (see website and GitHub).
From an earlier comment of @12rambau, a developer of the platform, I think the issue might be related to a Sepal-specific version of the earthengine-api package. Please note that the geemap version '0.22.1' is actually the latest version of the geemap master branch.
Environment Information
--------------------------------------------------------------------------------
Date: Thu Jun 15 21:49:18 2023 UTC
OS : Linux
CPU(s) : 4
Machine : x86_64
Architecture : 64bit
RAM : 15.3 GiB
Environment : IPython
Python 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0]
geemap : 0.22.1
ee : 0.1.343
ipyleaflet : 0.17.2
folium : 0.13.0
jupyterlab : 3.2.5
notebook : 6.1.5
ipyevents : 2.0.1
geopandas : 0.10.2
--------------------------------------------------------------------------------
Script
Adapted from geemap tutorial notebook 136_download_parallel.ipynb
import geemap
import ee
# # # works with geemap.download_ee_image_tiles() but throws the setUserAgent error with geemap.download_ee_image_tiles_parallel()
ee.Initialize(opt_url="https://earthengine-highvolume.googleapis.com")
# # # throws the setUserAgent error with both geemap.download_ee_image_tiles() and geemap.download_ee_image_tiles_parallel()
# geemap.ee_initialize()
image = ee.Image("LANDSAT/LE7_TOA_5YEAR/1999_2003").select(["B4", "B3", "B2"])
country_list = ["Gambia", "Qatar", "Lebanon", "Jamaica"]
fc = ee.FeatureCollection("FAO/GAUL/2015/level0").filter(ee.Filter.inList("ADM0_NAM>
image = ee.Image("LANDSAT/LE7_TOA_5YEAR/1999_2003").select(["B4", "B3", "B2"])
country_list = ["Gambia", "Qatar", "Lebanon", "Jamaica"]
fc = ee.FeatureCollection("FAO/GAUL/2015/level0").filter(ee.Filter.inList("ADM0_NAME", country_list))
print("Sequential")
geemap.download_ee_image_tiles(image, fc, out_dir=".", scale=100, crs="EPSG:3857", column="ADM0_NAME")
print("Parallel")
geemap.download_ee_image_tiles_parallel(image, fc, out_dir=".", scale=100, crs="EPSG:3857", column="ADM0_NAME")
Error
sepal-user@376d663ba688:~/foo$ python scripts/geemap_parallel_test.py `
Sequential
Downloading 1/4: ./Lebanon.tif
Lebanon.tif: |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.0M/11.0M (raw) [100.0%] in 00:12 (eta: 00:00)
Downloading 2/4: ./Qatar.tif
Qatar.tif: | | 32.0k/11.7M (raw) [ 0.3%] in 00:11 (eta: 1:08:52) Qatar.tif: |██████████████████████| 11.7M/11.7M (raw) [100.0%] in 00:13 (eta: 00:00)
Downloading 3/4: ./Gambia.tif
Gambia.tif: |█████████████████████| 8.70M/8.70M (raw) [100.0%] in 00:07 (eta: 00:00)
Downloading 4/4: ./Jamaica.tif
Jamaica.tif: |████████████████████| 7.00M/7.00M (raw) [100.0%] in 00:07 (eta: 00:00)
Downloaded 4 tiles in 54.581400871276855 seconds.
Parallel
Traceback (most recent call last):
File "scripts/inference/geemap_parallel.py", line 20, in <module>
geemap.download_ee_image_tiles_parallel(image, fc, out_dir=".", scale=100, crs="EPSG:3857", column="ADM0_NAME")
File "/home/sepal-user/.local/lib/python3.8/site-packages/geemap/common.py", line 12831, in download_ee_image_tiles_parallel
parallel(joblib.delayed(download_data)(index) for index in range(count))
File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1098, in __call__
self.retrieve()
File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 975, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
return future.result(timeout=timeout)
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
AttributeError: module 'ee.data' has no attribute 'setUserAgent'
Additional information for debugging
Running the following two lines of code throws a similar error.
import geemap
geemap.ee_initialize()
Traceback (most recent call last):
File "scripts/inference/geemap_parallel.py", line 8, in <module>
geemap.ee_initialize()
File "/home/sepal-user/.local/lib/python3.8/site-packages/geemap/common.py", line 245, in ee_initialize
ee.data.setUserAgent(user_agent)
AttributeError: module 'ee.data' has no attribute 'setUserAgent'
Your earthengine-api package is outdated. Update it should resolve the issue
Thank you for the quick feedback. I opened an issue on the sepal repository (openforis/sepal#286).
Alternatively, you can copy the function source code and run it in your notebook. In this case, you don't need to update packages
https://geemap.org/common/#geemap.common.download_ee_image_tiles_parallel
Thank you for the suggestion, it works as a temporary fix. Still, given the large number of Sepal users and geemap users, having Sepal support for geemap.ee_initialize()
is in my opinion a legit request.
I'm no longer working for the SEPAL team but I'm still following their repositories. As SEPAL allows users to connect to their Earthengine account via the web interface and without writting code we are forced to tune the EE API a bit. The whole platform is relying on this fork: https://github.com/openforis/earthengine-api which is only updated when a problem rises.
DO NOT manually install the latest EE in your local environment (with a simple pip install
)or you will break the applications that rely on the saved token to run. if you really need to do it, use a conda env.