gee-community/geemap

Parallelize geemap.download_ee_image_tiles

TmtStss opened this issue · 8 comments

Environment Information


Date: Fri Jun 09 07:49:34 2023 UTC

            OS : Linux
        CPU(s) : 4
       Machine : x86_64
  Architecture : 64bit
           RAM : 15.3 GiB
   Environment : IPython

Python 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0]

        geemap : 0.17.1
            ee : 0.1.343
    ipyleaflet : 0.17.2
        folium : 0.13.0
    jupyterlab : 3.2.5
      notebook : 6.1.5
     ipyevents : 2.0.1
     geopandas : 0.10.2

Description

Parallelize the download of image tiles for multiple ee.FeatureCollection using geemap.download_ee_image_tiles and joblib.Parallel.

What I Did

import ee
import geemap
import joblib


def download_data(country_name):
    # Get FeatureCollection for country of interest
    gaul = ee.FeatureCollection("FAO/GAUL/2015/level0")
    geometry = gaul.filter(ee.Filter.eq("ADM0_NAME", country_name)).geometry()
    fc = geemap.fishnet(geometry, h_interval=0.4, v_interval=0.2)

    # Download image tiles
    geemap.download_ee_image_tiles(
        image=ee.Image("projects/glad/GLCLU2020/Forest_height_2020"),
        features=fc,
        out_dir=".",
        prefix=country_name,
        crs="EPSG:4326",
        scale=30,
    )


if __name__ == "__main__":
    try:
        ee.Initialize(opt_url="https://earthengine-highvolume.googleapis.com")
        print("Google Earth Engine API initialized successfully.")
    except Exception as e:
        print("Error while initializing the Google Earth Engine API: {}".format(e))

    country_list = ["Gambia", "Qatar", "Lebanon", "Jamaica"]

    # Parallel (gives error)
    with joblib.Parallel(n_jobs=4) as parallel:
        parallel(joblib.delayed(download_data)(country_name) for country_name in country_list)

    # Sequential (gives no error)
    # for country_name in country_list:
    #     download_data(country_name)

Google Earth Engine API initialized successfully.
Traceback (most recent call last):
  File "scripts/dummy.py", line 32, in <module>
    parallel(joblib.delayed(download_data)(country_name) for country_name in country_list)
  File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
ee.ee_exception.EEException: Earth Engine client library not initialized. Run `ee.Initialize()`

Moving ee.Initialize() to within the download_data function fixes the issue, but I am not sure it is actually solving it. It looks like every worker needs to have its own initialization.

def download_data(country_name):
    # Initiliaze GEE Python API
    try:
        ee.Initialize(opt_url="https://earthengine-highvolume.googleapis.com")
        print("Google Earth Engine API initialized successfully.")
    except Exception as e:
        print("Error while initializing the Google Earth Engine API: {}".format(e))
    
    # Get FeatureCollection for country of interest
    gaul = ee.FeatureCollection("FAO/GAUL/2015/level0")
    geometry = gaul.filter(ee.Filter.eq("ADM0_NAME", country_name)).geometry()
    fc = geemap.fishnet(geometry, h_interval=0.4, v_interval=0.2)

    # Download image tiles
    geemap.download_ee_image_tiles(
        image=ee.Image("projects/glad/GLCLU2020/Forest_height_2020"),
        features=fc,
        out_dir=".",
        prefix=country_name,
        crs="EPSG:4326",
        scale=30,
    )

I compared the speed of execution as well: sequential (13min 52s) vs parallel (5min 21s).

Yet, is it the most efficient way to download image tiles at large scale ? I came across the package geetiles for which geemap issue #1473 wants to add support, as well as Gorelick's Medium blogpost on fast downloads trough the GEE Python API using multiprocessing and ee.Image.getDownloadURL().

giswqs commented

Thank you for sharing the solution. Try geemap.ee_initialize() rather than ee.Initialize(). It will only initializes if needed.

Thank you very much for the quick implementation of support for downloading images in parallel. However, the geemap.download_ee_image_tiles_parallel() throws the following error when run on the Sepal cloud computing platform for geospatial analysis, developed by the Food and Agriculture Organization of the United Nations (see website and GitHub).

From an earlier comment of @12rambau, a developer of the platform, I think the issue might be related to a Sepal-specific version of the earthengine-api package. Please note that the geemap version '0.22.1' is actually the latest version of the geemap master branch.

Environment Information

--------------------------------------------------------------------------------
  Date: Thu Jun 15 21:49:18 2023 UTC

                OS : Linux
            CPU(s) : 4
           Machine : x86_64
      Architecture : 64bit
               RAM : 15.3 GiB
       Environment : IPython

  Python 3.8.10 (default, Mar 13 2023, 10:26:41)  [GCC 9.4.0]

            geemap : 0.22.1 
                ee : 0.1.343
        ipyleaflet : 0.17.2
            folium : 0.13.0
        jupyterlab : 3.2.5
          notebook : 6.1.5
         ipyevents : 2.0.1
         geopandas : 0.10.2
--------------------------------------------------------------------------------

Script

Adapted from geemap tutorial notebook 136_download_parallel.ipynb

import geemap
import ee

# # # works with geemap.download_ee_image_tiles() but throws the setUserAgent error with geemap.download_ee_image_tiles_parallel()
ee.Initialize(opt_url="https://earthengine-highvolume.googleapis.com")

# # # throws the setUserAgent error with both geemap.download_ee_image_tiles() and geemap.download_ee_image_tiles_parallel()
# geemap.ee_initialize()


image = ee.Image("LANDSAT/LE7_TOA_5YEAR/1999_2003").select(["B4", "B3", "B2"])
country_list = ["Gambia", "Qatar", "Lebanon", "Jamaica"]
fc = ee.FeatureCollection("FAO/GAUL/2015/level0").filter(ee.Filter.inList("ADM0_NAM>


image = ee.Image("LANDSAT/LE7_TOA_5YEAR/1999_2003").select(["B4", "B3", "B2"])
country_list = ["Gambia", "Qatar", "Lebanon", "Jamaica"]
fc = ee.FeatureCollection("FAO/GAUL/2015/level0").filter(ee.Filter.inList("ADM0_NAME", country_list))


print("Sequential")
geemap.download_ee_image_tiles(image, fc, out_dir=".", scale=100, crs="EPSG:3857", column="ADM0_NAME")

print("Parallel")
geemap.download_ee_image_tiles_parallel(image, fc, out_dir=".", scale=100, crs="EPSG:3857", column="ADM0_NAME")

Error

sepal-user@376d663ba688:~/foo$ python scripts/geemap_parallel_test.py `

Sequential
Downloading 1/4: ./Lebanon.tif
Lebanon.tif: |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.0M/11.0M (raw) [100.0%] in 00:12 (eta: 00:00)
Downloading 2/4: ./Qatar.tif
Qatar.tif: |                    | 32.0k/11.7M (raw) [  0.3%] in 00:11 (eta: 1:08:52)                                                                                    Qatar.tif: |██████████████████████| 11.7M/11.7M (raw) [100.0%] in 00:13 (eta: 00:00)
Downloading 3/4: ./Gambia.tif
Gambia.tif: |█████████████████████| 8.70M/8.70M (raw) [100.0%] in 00:07 (eta: 00:00)
Downloading 4/4: ./Jamaica.tif
Jamaica.tif: |████████████████████| 7.00M/7.00M (raw) [100.0%] in 00:07 (eta: 00:00)
Downloaded 4 tiles in 54.581400871276855 seconds.
Parallel
Traceback (most recent call last):
  File "scripts/inference/geemap_parallel.py", line 20, in <module>
    geemap.download_ee_image_tiles_parallel(image, fc, out_dir=".", scale=100, crs="EPSG:3857", column="ADM0_NAME")
  File "/home/sepal-user/.local/lib/python3.8/site-packages/geemap/common.py", line 12831, in download_ee_image_tiles_parallel
    parallel(joblib.delayed(download_data)(index) for index in range(count))
  File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/sepal-user/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
AttributeError: module 'ee.data' has no attribute 'setUserAgent'

Additional information for debugging

Running the following two lines of code throws a similar error.

import geemap
geemap.ee_initialize()
Traceback (most recent call last):
  File "scripts/inference/geemap_parallel.py", line 8, in <module>
    geemap.ee_initialize()
  File "/home/sepal-user/.local/lib/python3.8/site-packages/geemap/common.py", line 245, in ee_initialize
    ee.data.setUserAgent(user_agent)
AttributeError: module 'ee.data' has no attribute 'setUserAgent'
giswqs commented

Your earthengine-api package is outdated. Update it should resolve the issue

Thank you for the quick feedback. I opened an issue on the sepal repository (openforis/sepal#286).

giswqs commented

Alternatively, you can copy the function source code and run it in your notebook. In this case, you don't need to update packages

https://geemap.org/common/#geemap.common.download_ee_image_tiles_parallel

Thank you for the suggestion, it works as a temporary fix. Still, given the large number of Sepal users and geemap users, having Sepal support for geemap.ee_initialize() is in my opinion a legit request.

I'm no longer working for the SEPAL team but I'm still following their repositories. As SEPAL allows users to connect to their Earthengine account via the web interface and without writting code we are forced to tune the EE API a bit. The whole platform is relying on this fork: https://github.com/openforis/earthengine-api which is only updated when a problem rises.

DO NOT manually install the latest EE in your local environment (with a simple pip install)or you will break the applications that rely on the saved token to run. if you really need to do it, use a conda env.