/analytics-toolbox-core

A set of UDFs and Procedures to extend BigQuery, Snowflake, Redshift and Postgres with Spatial Analytics capabilities

Primary LanguageJavaScriptOtherNOASSERTION

CARTO Analytics Toolbox Core

The CARTO Analytics Toolbox is a set of UDFs and Stored Procedures to unlock Spatial Analytics. It is organized into modules based on the functionality they offer. This toolbox is cloud-native, which means it is available for different data warehouses: BigQuery, Snowflake, and Redshift. It is built on top of the data warehouse's GIS features, extending and complementing this functionality.

This repo contains the core open-source modules of the toolbox. CARTO offers a set of premium modules that are available for CARTO users.

Getting started

Using the functions on this project depends on the Datawarehouse you are using. In BigQuery and Snowflake you can access them directly as a shared resoueces without having to install them, for the rest you will have to install them locally on your database.

BigQuery

You can use directly the functions as they are globally shared in the US region.

SELECT `carto-os.carto.H3_CENTER`('847b59dffffffff')

If you need to use them from the Europe region use:

SELECT `carto-os-eu.carto.H3_CENTER`('847b59dffffffff')

If you need to install them on your own VPC or in a different region, follow the instructions later on.

Snowflake

The easiest way to start using these functions is to add them to your Datawarehouse through the Snowflake Marketplace. Go there and install it using theregular methods. After that you should be able to use them on the location you have installed them. For example try:

SELECT carto.H3_FROMGEOGPOINT(ST_POINT(40.4168, -3.7038), 4)

If you need to install them directly, not through the data share process, follow the instructions later on.

Redshift

Right now the only way to get access the Analytics toolbox is by installing it directly on your database. Follow the instructions later on.

Documentation

Cloud Documentation
BigQuery https://docs.carto.com/analytics-toolbox-bigquery
Snowflake https://docs.carto.com/analytics-toolbox-snowflake
Redshift https://docs.carto.com/analytics-toolbox-redshift

Development

The repo contains the implementation of the toolbox for all the clouds. The functions are organized in modules. Each module has the following structure:

  • doc: contains the SQL reference of the functions
  • lib: contains the library code (JavaScript/Python)
  • sql: contains the function's code (SQL)
  • test: contains both the unit and integration tests

Inside a module, you can run the following commands. These commands are available for any {module}/{cloud}. For example, you can enter the module cd modules/accessors/bigquery and then run the command:

  • make help: shows the commands available in the Makefile.
  • make lint: runs a linter (using eslint or flake8).
  • make lint-fix: runs a linter (using eslint or flake8) and fixes the trivial issues.
  • make build: builds the bundles (using rollup or zip).
  • make deploy: builds the bundles and deploys and SQL functions to the data warehouse using the env variables.
  • make test-unit: builds the bundles and runs the unit tests for these bundles (using jest or pytest).
  • make test-integration: runs just the integration tests (using jest or pytest). It performs requests to real data warehouses.
  • make test-integration-full: runs the full-path integration tests (using jest or pytest). It is equivalent to deploy + test-integration + clean-deploy.
  • make clean: removes the installed dependencies and generated files.
  • make clean-deploy: removes all the assets, functions, procedures, tables uploaded in the deploy.

These commands can be used with all the modules at once from the root folder. For example, make deploy CLOUD=bigquery.

Additionally, this tool has been developed to generate code templates for new modules and functions.

BigQuery

The Analytics Toolbox for BigQuery contains SQL functions and JavaScript libraries. The functions are deployed in a dataset called carto inside a specific project. In BigQuery, datasets are associated with a region so the functions can only be used with tables stored in datasets with the same region. The JavaScript libraries are deployed in a Google Cloud Storage bucket and referenced by the functions.

Tools

Make sure you have installed the following tools:

Environment variables

The .env file contains the variables required to deploy and run the toolbox.

# BigQuery
BQ_PROJECT=your-bigquery-project
BQ_BUCKET=gs://your-gcs-bucket/
BQ_REGION=your-region
BQ_DATASET_PREFIX=
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service/account/or/adc.json

Note: you may need to run gcloud auth login to generate the acd.json file.

Snowflake

The Analytics Toolbox for Snowflake contains SQL functions and JavaScript libraries. The functions are deployed in a schema called carto inside a specific database. The JavaScript libraries are deployed inside the SQL functions. In Snowflake, the functions can be used with tables of any database in the same account.

Tools

Make sure you have installed the following tools:

Environment variables

The .env file contains the variables required to deploy and run the toolbox.

# Snowflake
SF_ACCOUNT=your-snowflake-account
SF_DATABASE=your-snowflake-database
SF_SCHEMA_PREFIX=
SF_USER=your-snowflake-user
SF_PASSWORD=your-snowflake-password
SF_SHARE_PREFIX=
SF_SHARE_ENABLED=0

Redshift

The Analytics Toolbox for Redshift contains SQL functions and Python libraries. The functions are deployed in a schema called carto inside a specific database. The Python libraries are installed in the Redshift cluster, so they can be used by all the databases in the cluster. An S3 bucket is required as intermediate storage to create the libraries. In Redshift, the functions can be used with tables of the same database, but different schemas.

Note: Redshift UDFs only support Python2 but the Python Redshift connector is only available in Python3. Therefore, both Python versions are required to develop the toolbox.

Tools

Make sure you have installed the following tools:

Environment variables

The .env file contains the variables required to deploy and run the toolbox.

# Redshift
RS_HOST=your-redshift-host-url.redshift.amazonaws.com
RS_CLUSTER_ID=your-redshift-cluster
RS_REGION=your-region
RS_BUCKET=s3://your-s3-bucket/
RS_DATABASE=your-redshift-database
RS_SCHEMA_PREFIX=
RS_USER=your-redshift-user
RS_PASSWORD=your-redshift-password
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-access-key

Contribute

This project is public. We are more than happy of receiving feedback and contributions. Feel free to open a ticket with a bug, a doubt or a discussion, or open a pull request with a fix or a new feature.