This is not an officially supported Google product, though support will be provided on a best-effort basis.
Copyright 2023 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
segmentEverything is a simple python based Apache Beam pipeline, optimized for Google Cloud Dataflow, which aims to allow users to provide a Google Cloud Storage bucket of GeoTIFF images exported from [Google Earth Engine] (https://earthengine.google.com/) and produce a vector representation of the areas. in the GeoTIFFs.
segmentEverything builds heavily upon the awesome work of the segment-geospatial project which itself draws upon the segment-anything-eo project and, ultimtately, from Meta's Segment Anything model
This repository includes both the pipeline script (segmentEverything.py) which calls the pipeline, and, the Dockerfile / build.yaml required to build a custom container which will be used as the worker by Dataflow.
The custom container will pre-build all of the required dependencies for segment-geospatial, including pytorch, as it builds from a pytorch-optimized container reference. It also will include all of the required NVIDIA CUDA drivers neccessary to leverage a GPU attached to the worker, and the ~3GB Segment Anything Visual Tranformer Checkpoint (ViT_H). This means that every worker that is created doesn't need to download that checkpoint or build all of the dependencies at run time.
gcloud builds submit --config build.yaml
Note: Use tmux to ensure your pipeline caller session isn't terminated prematurely.
Choose a Google Cloud Platform project where you will launch this Dataflow pipeline.
You don't need much of a machine to launch the pipeline as none of the processing of the pipeline is done locally.
- Start with an Ubuntu 20.04 Image
- start a tmux session (this will ensure that your pipeline finishes correctly even if you lose connection to your VM).
tmux
- Update aptitude.
sudo apt-get update
- Install pip3
sudo apt-get install python3-pip gcc libglib2.0-0 libx11-6 libxext6 libgl1
- Clone this repository
git clone XXXXXX
- Change directories into the new "segmentEverything" folder:
cd segmentEverything
- Install local python3 dependencies
pip install -r requirements.text
- You'll need to authenticate so that the python code that runs locally can use your credentials.
gcloud auth application-default login
You need to enable the Dataflow API for your project here.
Click "Enable"
python3 cogbeam.py -h
Flag | Required | Default | Description |
---|---|---|---|
--project | True | Specify GCP project where pipeline will run and usage will be billed | |
--region | True | us-central1 | The GCP region in which the pipeline will run. |
--worker_zone | False | us-central-a | The GCP zone in which the pipeline will run |
--machine_type | True | n1-highmem-2 | The Compute Engine machine type to use for each worker. |
--workerDiskType | True | n1-highmem-2 | The type of Persistent Disk to use, specified by a full URL of the disk type resource. For example, use compute.googleapis.com/projects/PROJECT/zones/ZONE/diskTypes/pd-ssd to specify an SSD Persistent Disk. |
--max_num_workers | True | 4 | The maximum number of workers to scale to in the Dataflow pipeline. |
--source_bucket | True | GCS Source Bucket ONLY (no folders) e.g. "bucket_name" | |
--source_folder | True | GCS Bucket folder(s) e.g. "folder_level1/folder_level2" | |
--file_pattern | True | File pattern to search e.g. "*.tif" | |
--output_bucket | True | GCS Output Bucket ONLY (no folders) e.g. "bucket_name" for Exports | |
--output_folder | True | GCS Output Bucket folder(s) e.g. "folder_level1/folder_level2" for Outputs - can be blank for root of bucket | |
--runner | True | DataflowRunner | Run the pipeline locally or via Dataflow |
--dataflow_service_options | True | worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver | Configure the GPU Requirement |
--sdk_location | True | container | Where to pull the Beam SDK from, you want it to come from the custom container |
--disk_size_gb | True | 100 | The size of the worker harddrive |
--sdk_container_image | True | gcr.io/cloud-geographers-internal-gee/segment-everything-minimal | Custom Container Location |
The Segment Anything model accepts only 8bit RGB imagery, and, running the model is GPU intensive.
When creating exports from Earth Engine, ensure you're exporting an 8bit image. For example, if you have an ee.Image() named "rgb":
var eightbitRGB = rgb.unitScale(0, 1).multiply(255).toByte();
Also, it is recommended to export the pixels at 1024x1024 resolution. This gives enough pixels to have a decent area included in the image, but not too much that it overloads the GPU memory.
Export.image.toCloudStorage({image:eightbitRGB, description:"My Export Task", bucket:"my-bucket", fileNamePrefix: "my-folder/", region:iowa, scale: 3, crs: "EPSG:4326", maxPixels: 29073057936, fileDimensions: 1024 })
python3 segmentEverything.py
--project [my-project]
--region [my-region]
--machine_type n1-highmem-8
--max_num_workers 4
--workerDiskType googleapis.com/projects/[my-project]/zones/[my-zone]/diskTypes/pd-ssd
--source_bucket [my-bucket]
--source_folder [my-folder]
--file_pattern *.tif
--output_bucket [my-output-bucket]
--output_folder [my-output-folder]
--runner DataflowRunner
--dataflow_service_options 'worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver'
--experiments use_runner_v2,disable_worker_container_image_prepull
--number_of_worker_harness_threads 1
--sdk_location container
--disk_size_gb 100
--sdk_container_image gcr.io/[my-project]/segment-everything-minimal