/cerulean-cloud

All cloud services including inference and database structure

Primary LanguagePythonApache License 2.0Apache-2.0

Cerulean API Documentation

Introduction

For most users, we recommend using the Cerulean web application, which provides a visual interface for exploring the complete set of Cerulean data layers. For users who want to directly access and download oil slick detection data, we provide programmatic free access to an OGC compliant API (api.cerulean.skytruth.org). Currently, only oil slick detections can be downloaded. Data used for source identification, including AIS tracks, vessel identities, and offshore oil platform locations, cannot be downloaded and can only be accessed via the Cerulean web application. API queries can be made programmatically (e.g. a curl request in Python) for direct data access and download. You can also execute API queries within a browser by pasting an API command into your browser’s address bar, which will then show the results of your query, including a helpful paginated map, or download the data directly. Below, we provide some working examples of common data queries from our API. This is only a small sample of the types of queries that are possible. To dig deeper, please see our full API docs and check out the current documentation for tipg and CQL-2, both of which are used by our API.

Example 1. Query by date and time range

For our first query, let’s return all slick detection data on December 8, 2023, sorted by slick_timestamp. To do this, we specify a sorting function (?sortby=slick_timestamp) and provide a start and end datetime: api.cerulean.skytruth.org/collections/public.slick_plus/items?sortby=slick_timestamp&datetime=2023-12-08T00:00:00Z/2023-12-09T00:00:00Z The required date format is YYYY-MM-DDTHH:MM:SSZ, where the time is in UTC (which matches the timezone of S1 imagery naming convention). If you want to change the query dates, you can modify the datetime parameter to match the time range you are interested in. For example, the following command will fetch slick detections that appeared from 2023-12-01 through 2023-12-07: api.cerulean.skytruth.org/collections/public.slick_plus/items?sortby=slick_timestamp&datetime=2023-12-01T00:00:00Z/2023-12-08T00:00:00Z

Example 2. Basic filtering

Our API also allows you to filter results using various properties of the slick detection data. For example, let’s repeat the query from example 1, but limit results to detections with a machine_confidence greater-than-or-equal-to (GTE) 60%, and an area greater than (GT) 20 square km: api.cerulean.skytruth.org/collections/public.slick_plus/items?sortby=slick_timestamp&datetime=2023-12-01T00:00:00Z/2023-12-08T00:00:00Z&filter=machine_confidence GTE 0.6 AND area GT 20000000 To create this query, we added the following commands to Example 1:

  • &filter= which always should always precede all the combined filters you apply, like those below
  • machine_confidence GTE 0.6 which specifies a machine confidence score greater-than-or-equal-to 60%
  • AND area GT 20000000 which specifies an area greater-than 20,000,000 square meters, equivalent to 20 square km.

Note that these filter commands include spaces and abbreviated operators such as GTE (greater-than-or-equal-to), which are patterns enabled by CQL-2. There are a large number of fields available for filtering. We’ll cover a few more common examples below, but for full documentation, see our standard API docs.

Example 3. Filtering by source

For higher-confidence slicks detected by Cerulean, we apply a second model that finds any vessels or offshore oil infrastructure recorded in the vicinity of those slicks. Let’s repeat our query from example 1, but limit the results to slicks with a possible vessel or infrastructure source identified nearby. api.cerulean.skytruth.org/collections/public.slick_plus/items?sortby=slick_timestamp&datetime=2023-12-01T00:00:00Z/2023-12-08T00:00:00Z&filter=(NOT source_type_1_ids IS NULL OR NOT source_type_2_ids IS NULL) AND cls != 1

This one is a little complicated. Let’s break it down piece by piece:

  • &filter=(NOT source_type_1_ids IS NULL OR NOT source_type_2_ids IS NULL). This command returns slicks where Cerulean has identified at least one potential source of type 1 (vessel) or type 2 (infrastructure). The syntax is a little confusing because of the double negative, but the command NOT source_type_1_ids IS NULL tells the API to fetch all slicks where the source_type_1 field has at least one entry, and the command NOT source_type_2_ids IS NULL does the same thing for source_type_2.
  • AND cls != 1. This is a class filter that excludes all slicks of Class 1. Class 1 is “background,” which includes detections over land and other regions where oil slicks won’t plausibly occur. We recommend including this filter in most API queries.

Example 4. Direct download of data as a .csv or .geojson

If you wanted to return the query from example 2 as a csv for direct download, you would append &f=csv to the API query like so: api.cerulean.skytruth.org/collections/public.slick_plus/items?sortby=slick_timestamp&datetime=2023-12-01T00:00:00Z/2023-12-08T00:00:00Z&filter=machine_confidence GTE 0.6 AND area GTE 20000000&f=csv

NOTE: This functionality may be limited until this GitHub Issue is resolved.

If you prefer a geojson, you can append &f=geojson to the query instead, like this: api.cerulean.skytruth.org/collections/public.slick_plus/items?sortby=slick_timestamp&datetime=2023-12-01T00:00:00Z/2023-12-08T00:00:00Z&filter=machine_confidence GTE 0.6 AND area GTE 20000000&f=geojson

NOTE: All requests have a default value of the limit parameter if unspecified: &limit=10. If you want to return more than 10 results, adjust that number, and add it to the URL. To make use of pagination, you can also use the parameter &offset=60 to return entries starting at any arbitrary row (shown here returning from row 61 onwards).

Example 5. Return a specific slick by its ID

If you know which slick you want to pull from the API - let’s say it’s slick 171370 - you can fetch it using a query like this: api.cerulean.skytruth.org/collections/public.slick_plus/items?id=171370

Example 6. Return all slicks detected in a specific Sentinel-1 scene

If you want to return all slick detections in a specific Sentinel-1 scene, use a query like this: api.cerulean.skytruth.org/collections/public.slick_plus/items?s1_scene_id=S1A_IW_GRDH_1SDV_20231104T135322_20231104T135347_051068_062873_F9A6 Now, let’s limit our results to slick detections in that Sentinel-1 scene with an area greater than 10 square km and a machine confidence greater-than-or-equal-to 50%: api.cerulean.skytruth.org/collections/public.slick_plus/items?s1_scene_id=S1A_IW_GRDH_1SDV_20231104T135322_20231104T135347_051068_062873_F9A6&filter=machine_confidence GTE 0.5 AND area GT 10000000

Example 7. Return slicks within a bounding box

To download slicks within a specific geographic area, you can use the bounding box (bbox) pattern. For example, this command will download all model detections between 2023-12-01 and 2023-12-08 in an area near the Strait of Hormuz: api.cerulean.skytruth.org/collections/public.slick_plus/items?&datetime=2023-12-01T00:00:00Z/2023-12-08T23:59:59Z&bbox=53.6,23.6,59.9,28.1 A bounding box is specified as min_longitude,min_latitude,max_longitude,max_latitude. Note: This command will return all model detections in that bounding box. We recommend restricting the returned slick detections using parameters like machine confidence, area, and non-background class (see above examples) to only return the detections most likely to be real slicks.

Conclusion

We hope this summary helps you get started with Cerulean’s API. This is a small sample of the data queries that are currently possible with Cerulean’s API. We describe more queryable fields in the table below. For full documentation, please see our standard API docs.

cerulean-cloud

Pulumi repository with infrastructure for Cerulean, including all cloud services and database structure.

Architecture

The cerulean-cloud architecture diagram can be found here.

Cerulean Cloud Architecture

Deployment

Deployment is fully managed by GitHub actions and Pulumi and the entire workflow defined in this YAML file.

We have defined three development stages / stacks:

  • TEST: this is the Work in Progress (WIP) deployment stage. Will often be broken. Deployment can be triggered to TEST using the workflow_dispatch from any branch.
  • STAGING: this is the stable development deployment. Should be used to perform more complex integration tests, with close to real data. Deployment to STAGING is triggered with any merge commit to main.
  • PRODUCTION: this is the production / publicly available deployment. Any deployment into PRODUCTION should pass rigorous integration tests in STAGING. Deployment can be triggered into PRODUCTION by adding a tag to a commit (git tag v0.0.1 && git push --tags).

Pulumi deployments

In order to make development easier we have defined two pulumi deployments that are intended to work in tandem:

  • cerulean-cloud-images: deploys all necessary docker images into Google Cloud Registry (GCR), in preparation for deploying the Cloud Run functions that require those images.
  • cerulean-cloud: deploys all cerulean-cloud infrastructure (except docker images).

For each of these deployments there exists a configuration directory that includes a YAML configuration file per stage / stack (named with the stage name itself i.e. Pulumi.test.yaml, Pulumi.staging.yaml, Pulumi.production.yaml). These files include configuration that is stage / stack specific, such as deployment regions, usernames and passwords for external services, etc. They should be managed using the Pulumi CLI (pulumi config set someparam) but can also be edited directly.

Initial deployment

If you are deploying a completely new stack, make sure to create matching configuration files in cerulean-cloud-images and cerulean-cloud, with matching stack names. In addition, specifically for the tipg deployment, since the database is empty when a stack is deployed for the first time (alembic migrations occur after the initial migration), if you want to access tipg after this initial deployment make sure to poll the /register endpoint of the resulting URL in order to correctly load the tables (i.e. curl https://some-tipg-url.app/register). For any deployments after the first one, this is not required.

Decreasing cold starts for Cloud Run

In order to decrease response time for Cloud Run (especially in the production services, since for test and staging this will increase costs) you can set the minimal instances to 1 to run at any given moment in time (see documentation). Due to a pulumi limitation, when the service is first created this property cannot be set so we advise to set this value manually once the deployment has been completed.

Metrics

Google Cloud provides nice dashboards for tracking the stability, response time and resource consumption of the cloud resources. The links below point to the PRODUCTION deployment, but for every stack's components you'll find similar dashboards:

Development

In order to develop in cerulean-cloud repository we recommend the following system wide requirements (for MacOS), in addition to the python specific requirements listed below:

Setup cloud authentication

GCP authentication

gcloud config set account rodrigo@developmentseed.org
gcloud config configurations create cerulean --project cerulean-338116 --account rodrigo@developmentseed.org
gcloud config configurations activate cerulean

Also, make sure to authenticate into docker with GCP to allow interaction with GCR:

gcloud auth configure-docker

AWS authentication

aws configure --profile cerulean
export AWS_PROFILE=cerulean

Setup your python virtualenv

WARNING: Setting up your local virtualenv can vary slightly depending on your Operating System (OS) and python installation (native or conda). Procede with caution!

Make sure that you have setup your shell script as defined in the mkvirtualenv documentation. This will vary slightly with your python installation but you will need to change you bash profile file, by adding the following variables:

export VIRTUALENVWRAPPER_PYTHON=/Users/jonathanraphael/mambaforge/bin/python # the path to the python installation where you installed mkvirtualenv
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Devel
source /usr/local/bin/virtualenvwrapper.sh # this path can vary depending on your installation

Then you'll be able to run:

mkvirtualenv cerulean-cloud --python=$(which python3.8)
pip install -r requirements.txt
pip install -r requirements-test.txt
# Additional requirements files
pip install -r cerulean_cloud/cloud_run_offset_tiles/requirements.txt
pip install -r cerulean_cloud/cloud_run_orchestrator/requirements.txt
pip install -r cerulean_cloud/cloud_run_tipg/requirements.txt
pip install -r cerulean_cloud/titiler_sentinel/requirements.txt
# Setup pre-commit
pre-commit install

To activate your virtual environment:

workon cerulean-cloud

For notebook development

pip install ipykernel
python -m ipykernel install --user --name=cerulean-cloud

Running tests

You can run tests using pytest commands:

pytest
pytest test/test_cerulean_cloud/test_tiling.py # run only tests in a specific module
pytest test/test_cerulean_cloud/test_tiling.py::test_from_base_tiles_create_offset_tiles # run only a specific test

If you get an error while running tests mentioning that psycopg is not installed run:

pip install "psycopg[binary]"

Pulumi

Check available stages

pulumi stack ls

Select another stage

pulumi stack select test

Set config

Set secret values with (passwords, keys, etc):

pulumi config set db:db-password --secret

Set other config values with:

pulumi config set infra_distance

Preview changes (no need to run locally)

Make sure docker is running in your machine before running this command.

pulumi preview

This would be the output:

Previewing update (test):

docker:index:RemoteImage cerulean-cloud-images-test-remote-offset  completing deletion from previous update
docker:index:RemoteImage cerulean-cloud-images-test-remote-orchestrator  completing deletion from previous update
docker:index:RemoteImage cerulean-cloud-images-test-remote-tipg  completing deletion from previous update
-  docker:index:RemoteImage cerulean-cloud-images-test-remote-offset delete completing deletion from previous update
-  docker:index:RemoteImage cerulean-cloud-images-test-remote-tipg delete completing deletion from previous update
-  docker:index:RemoteImage cerulean-cloud-images-test-remote-orchestrator delete completing deletion from previous update
...
pulumi:pulumi:Stack cerulean-cloud-test running Creating lambda package [running in Docker]...
pulumi:pulumi:Stack cerulean-cloud-test running Building docker image...
pulumi:pulumi:Stack cerulean-cloud-test running Copying package.zip ...
pulumi:pulumi:Stack cerulean-cloud-test running Copied package package.zip ...
pulumi:pulumi:Stack cerulean-cloud-test  4 messages

This process is run on push on any open PRs, and you'll be able to see the output as a comment in your PR as this one.

Deploy changes (no need to run locally)

pulumi up

Destroy and rebuild

pulumi destroy Use GUI to delete the database pulumi refresh pulumi state delete pulumi refresh pulumi destroy test and deploy If there is a lock on the stack, you can delete that lock in gs://cerulean-cloud-state/cerulean-cloud-images

Database

Database Schema as of 2023-08-01

Connecting

In order to connect to the deployed database, you can use the Cloud SQL proxy for authentication. First install the proxy in your local machine (instructions here).

You can then find the instance connection name and the connection string in the outputs of your active pulumi stack:

pulumi stack --show-secrets
# use `database_instance_name` in Cloud SQL proxy
# use `database_url_alembic` to connect in your client

Start the Cloud SQL proxy (make sure you are properly authenticated with GCP):

cd . # where cloud_sql_proxy
./cloud_sql_proxy -instances=$'{database_instance_name}'=tcp:0.0.0.0:5432

In order to connect in pgAdmin, you can take apart the connection string that you get from the pulumi output:

postgresql://cerulean-cloud-test-database:some_password@127.0.0.1:5432/cerulean-cloud-test-database
# postgresql://${USER}:${PASSWORD}@${HOST}:${PORT}/${DATABASE_NAME}
# HOST and PORT refer to the cloud sql proxy host (your localhost)

In another process connect to the database (i.e. with psql):

psql ${database_url_alembic}

Migrations

We are using alembic to run migration in our database. You can create a new revision using:

alembic revision -m "Add new table"

And apply this revision with:

# Ensure you have access to your database and have setup DB_URL environment variable with the connection string above
alembic upgrade head

If you want to look at common operations with alembic make sure to check out the previously run migrations in the alembic/versions folder. For instance:

During the deployment process with GitHub Actions, migrations will be automatically run when new revisions are included in the branch/commit.

Authentication

Most services deployed with cerulean-cloud are safeguarded against abuse by outside actors using an API key authentication. This means that when interacting with the majority of the endpoints in your client of choice (i.e. httpx in Python, curl in your terminal, Postman or QGIS) you should make sure to include the following authentication header:

{"Authorization": "Bearer SOME_API_KEY"}

The API_KEY we use is set on the stack configuration file with pulumi and is encrypted. In order to access the API key for the currently selected stack you can run:

pulumi stack output api_key

You could then save this value as an environment variable for later use.

As an example, to place a request to the cloud run orchestrator, using httpx you'd do the following:

import httpx
URL = "https://cerulean-cloud-test-cloud-run-orch-5qkjkyomta-ew.a.run.app"
API_KEY= "SOME_API_KEY"
SCENES = [
    "S1A_IW_GRDH_1SDV_20230523T224049_20230523T224114_048667_05DA7A_91D1", # INDONESIA
    "S1A_IW_GRDH_1SDV_20230320T062953_20230320T063018_047724_05BB92_AC28", # UK
    "S1A_IW_GRDH_1SDV_20210523T005625_20210523T005651_038008_047C68_FE94", # INDIA
    "S1A_IW_GRDH_1SDV_20230711T160632_20230711T160657_049378_05F013_448A", # EGYPT
    "S1A_IW_GRDH_1SDV_20230330T015937_20230330T020003_047867_05C077_A6AC", # OMAN
    "S1A_IW_GRDH_1SDV_20230302T001557_20230302T001622_047458_05B29B_DF03", # MEXICO
    "S1A_IW_GRDH_1SDV_20230618T232014_20230618T232039_049047_05E5E0_718C", # USA
]
for sceneid in SCENES:
	orchestrator_result = httpx.post(URL+"/orchestrate",
									json={"sceneid": f"{sceneid}"},
									timeout=None,
									headers={"Authorization": f"Bearer {API_KEY}"})
	print(orchestrator_result)

The services deployed by cerulean-cloud that DO NOT require this API key are:

  • tipg Cloud Run
  • Historical run Cloud Function
  • Scene relevancy Cloud Function

Adding a new Pulumi Stack

Don't forget to edit the following places:

  • historical_run.py (add the stack name to stage options)
  • TODO add other places/steps
  • create a new git branch with the same name
  • copy changes in Git hash 7f1dcda and b55e6c7 (but use more modern yamls as base), commit and push
  • if commit fails due to pre-commit, review files and accept changes, retry
  • go to Git Actions, and manually run Test and Deploy on the new branch (takes about 25 minutes)

If you are getting this error on test&deploy, then you should try to pulumi refresh and run the github action again. Error 409: Conflict for resource 'cerulean-cloud-test-cloud-run-tipg': version '1690549980909798' was specified but current version is '1690554168522592'

Updating the trained model

If you are going to deploy a new scripted model, first save it as a tracing model using the function "save_icevision_model_state_dict_and_tracing" in the cerulean-ml repo. Then, upload the experiment folder and contents to the GCP ceruleanml bucket. Update the value of the Pulumi pamater cerulean-cloud-images:weights_name found in your local version of cerulean-cloud/images/stack_config/Pulumi.STACK_NAME_OF_INTEREST.yaml to match the experiment naming you just uploaded. --- You must then push these changes to the git repo. --- Finally, do the following stsps:

  1. Go to https://github.com/SkyTruth/cerulean-cloud/actions
  2. Then click "Test and Deploy"
  3. Click on "Run Workflow" and choose either the local branch you are working on
  4. Click "Run Workflow" to kick off the model upload

Troubleshooting

If pulumi throws funky errors at deployment, you can run in your current stack:

pulumi refresh

If you need to completely restart your stack, these steps have been found to work: '''sh pulumi destroy (use the GUI delete database) pulumi refresh pulumi state delete {URN of any sticky resources} pulumi refresh pulumi destroy (If there is a lock on the stack, you can delete that lock in gs://cerulean-cloud-state/cerulean-cloud-images) '''

If you get the following error, make sure you have docker running in your machine:

Exception: invoke of docker:index/getRegistryImage:getRegistryImage failed: invocation of docker:index/getRegistryImage:getRegistryImage returned an error: 1 error occurred:
        * Error pinging Docker server: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
    error: an unhandled error occurred: Program exited with non-zero exit code: 1

If you want to run tests on the database migrations locally, you can do the following steps:

  1. Launch Docker on your machine
  2. Run 'docker-compose up --build' in the folder where the alembic.ini is
  3. Run 'DB_URL=postgresql://user:password@localhost:5432/db alembic upgrade head' (or downgrade base) in the folder where the alembic.ini is (you may have to activate an environment and install dependencies, like tempdbenv)
  4. To connect with pgAdmin, add a new server with these properties:
  • Name: local
  • Host name: localhost
  • Port: 5432
  • Username: user
  • Password: password

Human in the loop (HITL) workflows using SQL

In general the HITL process is as follows in all cases:

  • Select a target slick id/s that is intended to be changed;
  • Update the slick "active" field to False;
  • Select all the "stable" values from the slick id/s;
  • Save these values along with the changed values using INSERT INTO (when we have multiple slicks in the mix make sure to use aggregation functions);
  • Ensure to set slick column to the original slick/slicks ids to keep the audit of changes across time.
  • Ensure to set active column in the new slick/slicks to True.

You can then use the function slick_history with a given slick_id to inspect the changes that occurred in the slick.

All these examples I have been run in the test environment so you can inspect the results with the slick_history query there.

As a side note, ideally this process would be intermediated by the streamlit app that we had envisioned - these queries could all be encoded in a module that interacts with a frontend, to make it really use to run these common HITL changes to the slick table.

Validate Slick Class (Include Confidence Level)

UPDATE slick SET active=False WHERE id=34913;
INSERT INTO slick (
  slick_timestamp,
	geometry,
	machine_confidence,
  human_confidence,
	active,
	validated,
	slick,
	notes,
	meta,
	orchestrator_run,
	slick_class
    )
SELECT
	slick_timestamp,
	geometry,
  machine_confidence,
	0.9,
	True,
	True,
	'{ 34913 }',
	notes,
	meta,
	orchestrator_run,
	slick_class
FROM slick WHERE id=34913
RETURNING id;
SELECT * FROM slick_history(34927)

Change slick class

UPDATE slick SET active=False WHERE id=34927;
INSERT INTO slick (
  slick_timestamp,
	geometry,
	machine_confidence,
  human_confidence,
	active,
	validated,
	slick,
	notes,
	meta,
	orchestrator_run,
	slick_class
    )
SELECT
	slick_timestamp,
	geometry,
  machine_confidence,
	0.9,
	True,
	True,
	'{ 34927 }',
	notes,
	meta,
	orchestrator_run,
	3
FROM slick WHERE id=34927
RETURNING id;
SELECT * FROM slick_history(34930)

Combine two slicks into one

UPDATE slick SET active=False WHERE id IN (34817, 34816);
INSERT INTO slick (
  slick_timestamp,
	geometry,
	machine_confidence,
  human_confidence,
	active,
	validated,
	slick,
	notes,
	meta,
	orchestrator_run,
	slick_class
    )
SELECT
	MAX(slick_timestamp),
	ST_Union(geometry::geometry),
  MIN(machine_confidence),
	0.9,
	True,
	True,
	'{ 34817, 34816 }',
	string_agg(notes, ','),
	jsonb_agg(meta),
	MAX(orchestrator_run),
	2
FROM slick WHERE id IN (34817, 34816)
RETURNING id;
SELECT * FROM slick_history(34931)

Break one slick into two (or n slicks)

UPDATE slick SET active=False WHERE id IN (34817);
INSERT INTO slick (
  slick_timestamp,
	geometry,
	machine_confidence,
  human_confidence,
	active,
	validated,
	slick,
	notes,
	meta,
	orchestrator_run,
	slick_class
    )
SELECT
	slick_timestamp,
	ST_Multi(ST_Subdivide(geometry::geometry)),
  machine_confidence,
	0.9,
	True,
	True,
	'{ 34817 }',
	notes,
	meta,
	orchestrator_run,
	3
FROM slick WHERE id IN (34817)
RETURNING id; # Returns 3 records
SELECT * FROM slick_history(34933)
SELECT * FROM slick_history(34934)
SELECT * FROM slick_history(34935)

Add note field to slick (e.g. record some action taken, or a reference ID from NOAA, ...)

UPDATE slick SET active=False WHERE id=34935;
INSERT INTO slick (
  slick_timestamp,
	geometry,
	machine_confidence,
  human_confidence,
	active,
	validated,
	slick,
	notes,
	meta,
	orchestrator_run,
	slick_class
    )
SELECT
	slick_timestamp,
	geometry,
  machine_confidence,
	0.9,
	True,
	True,
	'{ 34935 }',
	'This is a funny slick!',
	meta,
	orchestrator_run,
	slick_class
FROM slick WHERE id= 34935
RETURNING id;
SELECT * FROM slick_history(34936)