gcp-datacatalog-python

Self-contained ready-to-use Python scripts to help Data Citizens who work with Google Cloud Data Catalog.

1. Get to know the concepts behind this code
2. Environment setup
3. Quickstart
- 3.1. Integration tests
- 3.2. Run quickstart.py
4. Load Tag Templates from CSV files
5. Load Tag Templates from Google Sheets
6. How to contribute
- 6.1. Report issues
- 6.2. Contribute code

1. Get to know the concepts behind this code

Data Catalog hands-on guide: a mental model @ Google Cloud Community / Medium
Data Catalog hands-on guide: search, get & lookup with Python @ Google Cloud Community / Medium
Data Catalog hands-on guide: templates & tags with Python @ Google Cloud Community / Medium

2. Environment setup

2.1. Get the code

git clone https://github.com/ricardolsmendes/gcp-datacatalog-python.git
cd gcp-datacatalog-python

2.2. Auth credentials

2.2.1. Create a service account and grant it below roles

BigQuery Admin
Data Catalog Admin

2.2.2. Download a JSON key and save it as

./credentials/datacatalog-samples.json

2.3. Virtualenv

Using virtualenv is optional, but strongly recommended unless you use Docker.

2.3.1. Install Python 3.6+

2.3.2. Create and activate an isolated Python environment

pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate

2.3.3. Install the dependencies

pip install --upgrade -r requirements.txt

2.3.4. Set environment variables

export GOOGLE_APPLICATION_CREDENTIALS=./credentials/datacatalog-samples.json

2.4. Docker

Docker may be used to run all the scripts. In this case please disregard the Set up Virtualenv install instructions.

2.5. Integration tests

Integration tests help to make sure Google Cloud APIs and Service Accounts IAM Roles have been properly set up before running a script. They actually communicate with the APIs and create temporary resources that are deleted just after being used.

3. Quickstart

3.1. Integration tests

pytest

export GOOGLE_CLOUD_TEST_ORGANIZATION_ID=<YOUR-ORGANIZATION-ID>
export GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID>

pytest ./tests/integration/quickstart_test.py

docker

docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
  --env GOOGLE_CLOUD_TEST_ORGANIZATION_ID=<YOUR-ORGANIZATION-ID> \
  --env GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID> \
  --volume <CREDENTIALS-FILE-FOLDER>:/credentials \
  gcp-datacatalog-python pytest ./tests/integration/quickstart_test.py

3.2. Run quickstart.py

python

python quickstart.py --organization-id <YOUR-ORGANIZATION-ID> --project-id <YOUR-PROJECT-ID>

docker

docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
  --volume <CREDENTIALS-FILE-FOLDER>:/credentials \
  python quickstart.py --organization-id <YOUR-ORGANIZATION-ID> --project-id <YOUR-PROJECT-ID>

4. Load Tag Templates from CSV files

4.1. Provide CSV files representing the Template to be created

A master file named with the Template ID — i.e., template-abc.csv if your Template ID is template_abc. This file may contain as many lines as needed to represent the template. The first line is always discarded as it's supposed to contain headers. Each field line must have 3 values: the first is the Field ID; second is its Display Name; third is the Type. Currently, types BOOL, DOUBLE, ENUM, STRING, TIMESTAMP, and MULTI are supported. MULTI is not a Data Catalog native type, but a flag that instructs the script to create a specific template to represent field's predefined values (more on this below...).
If the template has ENUM fields, the script looks for a "display names file" for each of them. The files shall be named with the fields' names — i.e., enum-field-xyz.csv if an ENUM Field ID is enum_field_xyz. Each file must have just one value per line, representing a display name.
If the template has multivalued fields, the script looks for a "values file" for each of them. The files shall be named with the fields' names — i.e., multivalued-field-xyz.csv if a multivalued Field ID is multivalued_field_xyz. Each file must have just one value per line, representing a short description for the value. The script will generate Field's ID and Display Name based on it.
All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but it will do the formatting job for you. So, just provide the IDs as strings.

TIP: keep all template-related files in the same folder (sample-input/load-template-csv for reference).

4.2. Integration tests

pytest

export GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID>

pytest ./tests/integration/load_template_csv_test.py

docker

docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
  --env GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID> \
  --volume <CREDENTIALS-FILE-FOLDER>:/credentials \
  gcp-datacatalog-python pytest ./tests/integration/load_template_csv_test.py

4.3. Run load_template_csv.py

python

python load_template_csv.py \
  --template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
  --project-id <YOUR-PROJECT-ID> --files-folder <FILES-FOLDER> \
  [--delete-existing]

docker

docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
  --volume <CREDENTIALS-FILE-FOLDER>:/credentials \
  python load_template_csv.py \
  --template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
  --project-id <YOUR-PROJECT-ID> --files-folder <FILES-FOLDER> \
  [--delete-existing]

5. Load Tag Templates from Google Sheets

5.1. Enable the Google Sheets API in your GCP Project

https://console.developers.google.com/apis/library/sheets.googleapis.com

5.2. Provide Google Spreadsheets representing the Template to be created

A master sheet named with the Template ID — i.e., template-abc if your Template ID is template_abc. This sheet may contain as many lines as needed to represent the template. The first line is always discarded as it's supposed to contain headers. Each field line must have 3 values: column A is the Field ID; column B is its Display Name; column C is the Type. Currently, types BOOL, DOUBLE, ENUM, STRING, TIMESTAMP, and MULTI are supported. MULTI is not a Data Catalog native type, but a flag that instructs the script to create a specific template to represent field's predefined values (more on this below...).
If the template has ENUM fields, the script looks for a "display names sheet" for each of them. The sheets shall be named with the fields' names — i.e., enum-field-xyz if an ENUM Field ID is enum_field_xyz. Each sheet must have just one value per line (column A), representing a display name.
If the template has multivalued fields, the script looks for a "values sheet" for each of them. The sheets shall be named with the fields' names — i.e., multivalued-field-xyz if a multivalued Field ID is multivalued_field_xyz. Each sheet must have just one value per line (column A), representing a short description for the value. The script will generate Field's ID and Display Name based on it.
All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but it will do the formatting job for you. So, just provide the IDs as strings.

TIP: keep all template-related sheets in the same document (Data Catalog Sample Tag Template for reference).

5.3. Integration tests

pytest

export GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID>

pytest ./tests/integration/load_template_google_sheets_test.py

docker

docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
  --env GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID> \
  --volume <CREDENTIALS-FILE-FOLDER>:/credentials \
  gcp-datacatalog-python pytest ./tests/integration/load_template_google_sheets_test.py

5.4. Run load_template_google_sheets.py

python

python load_template_google_sheets.py \
  --template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
  --project-id <YOUR-PROJECT-ID> --spreadsheet-id <SPREADSHEET-ID> \
  [--delete-existing]

docker

docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
  --volume <CREDENTIALS-FILE-FOLDER>:/credentials \
  python load_template_google_sheets.py \
  --template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
  --project-id <YOUR-PROJECT-ID> --spreadsheet-id <SPREADSHEET-ID> \
  [--delete-existing]

6. How to contribute

Please make sure to take a moment and read the Code of Conduct.

6.1. Report issues

Please report bugs and suggest features via the GitHub Issues.

Before opening an issue, search the tracker for possible duplicates. If you find a duplicate, please add a comment saying that you encountered the problem as well.

6.2. Contribute code

Please make sure to read the Contributing Guide before making a pull request.

ricardolsmendes/gcp-datacatalog-python