Self-contained ready-to-use Python scripts to help Data Citizens who work with Google Cloud Data Catalog.
- 1. Get to know the concepts behind this code
- 2. Environment setup
- 3. Quickstart
- 4. Load Tag Templates from CSV files
- 5. Load Tag Templates from Google Sheets
- 6. How to contribute
-
Data Catalog hands-on guide: a mental model @ Google Cloud Community / Medium
-
Data Catalog hands-on guide: search, get & lookup with Python @ Google Cloud Community / Medium
-
Data Catalog hands-on guide: templates & tags with Python @ Google Cloud Community / Medium
git clone https://github.com/ricardolsmendes/gcp-datacatalog-python.git
cd gcp-datacatalog-python
2.2.1. Create a service account and grant it below roles
- BigQuery Admin
- Data Catalog Admin
2.2.2. Download a JSON key and save it as
./credentials/datacatalog-samples.json
Using virtualenv is optional, but strongly recommended unless you use Docker.
2.3.1. Install Python 3.6+
2.3.2. Create and activate an isolated Python environment
pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate
2.3.3. Install the dependencies
pip install --upgrade -r requirements.txt
2.3.4. Set environment variables
export GOOGLE_APPLICATION_CREDENTIALS=./credentials/datacatalog-samples.json
Docker may be used to run all the scripts. In this case please disregard the Set up Virtualenv install instructions.
Integration tests help to make sure Google Cloud APIs and Service Accounts IAM Roles have been properly set up before running a script. They actually communicate with the APIs and create temporary resources that are deleted just after being used.
- pytest
export GOOGLE_CLOUD_TEST_ORGANIZATION_ID=<YOUR-ORGANIZATION-ID>
export GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID>
pytest ./tests/integration/quickstart_test.py
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
--env GOOGLE_CLOUD_TEST_ORGANIZATION_ID=<YOUR-ORGANIZATION-ID> \
--env GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID> \
--volume <CREDENTIALS-FILE-FOLDER>:/credentials \
gcp-datacatalog-python pytest ./tests/integration/quickstart_test.py
- python
python quickstart.py --organization-id <YOUR-ORGANIZATION-ID> --project-id <YOUR-PROJECT-ID>
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
--volume <CREDENTIALS-FILE-FOLDER>:/credentials \
python quickstart.py --organization-id <YOUR-ORGANIZATION-ID> --project-id <YOUR-PROJECT-ID>
- A master file named with the Template ID — i.e.,
template-abc.csv
if your Template ID is template_abc. This file may contain as many lines as needed to represent the template. The first line is always discarded as it's supposed to contain headers. Each field line must have 3 values: the first is the Field ID; second is its Display Name; third is the Type. Currently, typesBOOL
,DOUBLE
,ENUM
,STRING
,TIMESTAMP
, andMULTI
are supported.MULTI
is not a Data Catalog native type, but a flag that instructs the script to create a specific template to represent field's predefined values (more on this below...). - If the template has ENUM fields, the script looks for a "display names file" for each of
them. The files shall be named with the fields' names — i.e.,
enum-field-xyz.csv
if an ENUM Field ID is enum_field_xyz. Each file must have just one value per line, representing a display name. - If the template has multivalued fields, the script looks for a "values file" for each of
them. The files shall be named with the fields' names — i.e.,
multivalued-field-xyz.csv
if a multivalued Field ID is multivalued_field_xyz. Each file must have just one value per line, representing a short description for the value. The script will generate Field's ID and Display Name based on it. - All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but it will do the formatting job for you. So, just provide the IDs as strings.
TIP: keep all template-related files in the same folder (sample-input/load-template-csv for reference).
- pytest
export GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID>
pytest ./tests/integration/load_template_csv_test.py
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
--env GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID> \
--volume <CREDENTIALS-FILE-FOLDER>:/credentials \
gcp-datacatalog-python pytest ./tests/integration/load_template_csv_test.py
- python
python load_template_csv.py \
--template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
--project-id <YOUR-PROJECT-ID> --files-folder <FILES-FOLDER> \
[--delete-existing]
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
--volume <CREDENTIALS-FILE-FOLDER>:/credentials \
python load_template_csv.py \
--template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
--project-id <YOUR-PROJECT-ID> --files-folder <FILES-FOLDER> \
[--delete-existing]
https://console.developers.google.com/apis/library/sheets.googleapis.com
- A master sheet named with the Template ID — i.e.,
template-abc
if your Template ID is template_abc. This sheet may contain as many lines as needed to represent the template. The first line is always discarded as it's supposed to contain headers. Each field line must have 3 values: column A is the Field ID; column B is its Display Name; column C is the Type. Currently, typesBOOL
,DOUBLE
,ENUM
,STRING
,TIMESTAMP
, andMULTI
are supported.MULTI
is not a Data Catalog native type, but a flag that instructs the script to create a specific template to represent field's predefined values (more on this below...). - If the template has ENUM fields, the script looks for a "display names sheet" for each of
them. The sheets shall be named with the fields' names — i.e.,
enum-field-xyz
if an ENUM Field ID is enum_field_xyz. Each sheet must have just one value per line (column A), representing a display name. - If the template has multivalued fields, the script looks for a "values sheet" for each of
them. The sheets shall be named with the fields' names — i.e.,
multivalued-field-xyz
if a multivalued Field ID is multivalued_field_xyz. Each sheet must have just one value per line (column A), representing a short description for the value. The script will generate Field's ID and Display Name based on it. - All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but it will do the formatting job for you. So, just provide the IDs as strings.
TIP: keep all template-related sheets in the same document (Data Catalog Sample Tag Template for reference).
- pytest
export GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID>
pytest ./tests/integration/load_template_google_sheets_test.py
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
--env GOOGLE_CLOUD_TEST_PROJECT_ID=<YOUR-PROJECT-ID> \
--volume <CREDENTIALS-FILE-FOLDER>:/credentials \
gcp-datacatalog-python pytest ./tests/integration/load_template_google_sheets_test.py
- python
python load_template_google_sheets.py \
--template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
--project-id <YOUR-PROJECT-ID> --spreadsheet-id <SPREADSHEET-ID> \
[--delete-existing]
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
--volume <CREDENTIALS-FILE-FOLDER>:/credentials \
python load_template_google_sheets.py \
--template-id <TEMPLATE-ID> --display-name <DISPLAY-NAME> \
--project-id <YOUR-PROJECT-ID> --spreadsheet-id <SPREADSHEET-ID> \
[--delete-existing]
Please make sure to take a moment and read the Code of Conduct.
Please report bugs and suggest features via the GitHub Issues.
Before opening an issue, search the tracker for possible duplicates. If you find a duplicate, please add a comment saying that you encountered the problem as well.
Please make sure to read the Contributing Guide before making a pull request.