terraform-google-bigquery

This module allows you to create opinionated Google Cloud Platform BigQuery datasets and tables. This will allow the user to programmatically create an empty table schema inside of a dataset, ready for loading. Additional user accounts and permissions are necessary to begin querying the newly created table(s).

Compatibility

This module is meant for use with Terraform 0.13+ and tested using Terraform 1.0+. If you find incompatibilities using Terraform >=0.13, please open an issue. If you haven't upgraded and need a Terraform 0.12.x-compatible version of this module, the last released version intended for Terraform 0.12.x is v4.5.0.

Upgrading

The current version is 4.X. The following guides are available to assist with upgrades:

Usage

Basic usage of this module is as follows:

module "bigquery" {
  source  = "terraform-google-modules/bigquery/google"
  version = "~> 8.0"

  dataset_id                  = "foo"
  dataset_name                = "foo"
  description                 = "some description"
  project_id                  = "<PROJECT ID>"
  location                    = "US"
  default_table_expiration_ms = 3600000

  tables = [
  {
    table_id           = "foo",
    schema             =  "<SCHEMA JSON DATA>",
    time_partitioning  = {
      type                     = "DAY",
      field                    = null,
      require_partition_filter = false,
      expiration_ms            = null,
    },
    range_partitioning = null,
    expiration_time = null,
    clustering      = ["fullVisitorId", "visitId"],
    labels          = {
      env      = "dev"
      billable = "true"
      owner    = "joedoe"
    },
  },
  {
    table_id           = "bar",
    schema             =  "<SCHEMA JSON DATA>",
    time_partitioning  = null,
    range_partitioning = {
      field = "customer_id",
      range = {
        start    = "1"
        end      = "100",
        interval = "10",
      },
    },
    expiration_time    = 2524604400000, # 2050/01/01
    clustering         = [],
    labels = {
      env      = "devops"
      billable = "true"
      owner    = "joedoe"
    }
  }
  ],

  views = [
    {
      view_id    = "barview",
      use_legacy_sql = false,
      query          = <<EOF
      SELECT
       column_a,
       column_b,
      FROM
        `project_id.dataset_id.table_id`
      WHERE
        approved_user = SESSION_USER
      EOF,
      labels = {
        env      = "devops"
        billable = "true"
        owner    = "joedoe"
      }
    }
  ]
  dataset_labels = {
    env      = "dev"
    billable = "true"
  }
}

Functional examples are included in the examples directory.

Variable `tables` detailed description

The tables variable should be provided as a list of object with the following keys:

{
  table_id = "some_id"                        # Unique table id (will be used as ID for table).
  table_name = "Friendly Name"                # Optional friendly name for table. If not set, the "table_id" will be used by default.
  schema = file("path/to/schema.json")        # Schema as JSON string.
  time_partitioning = {                       # Set it to `null` to omit partitioning configuration for the table.
        type                     = "DAY",     # The only type supported is DAY, which will generate one partition per day based on data loading time.
        field                    = null,      # The field used to determine how to create a time-based partition. If time-based partitioning is enabled without this value, the table is partitioned based on the load time. Set it to `null` to omit configuration.
        require_partition_filter = false,     # If set to true, queries over this table require a partition filter that can be used for partition elimination to be specified. Set it to `null` to omit configuration.
        expiration_ms            = null,      # Number of milliseconds for which to keep the storage for a partition.
      },
  range_partitioning = {                      # Set it to `null` to omit partitioning configuration for the table.
    field = "integer_column",                 # The column used to create the integer range partitions.
    range = {
      start    = "1"                          # The start of range partitioning, inclusive.
      end      = "100",                       # The end of range partitioning, exclusive.
      interval = "10",                        # The width of each range within the partition.
        },
      },
  clustering = ["fullVisitorId", "visitId"]   # Specifies column names to use for data clustering. Up to four top-level columns are allowed, and should be specified in descending priority order. Partitioning should be configured in order to use clustering.
  expiration_time = 2524604400000             # The time when this table expires, in milliseconds since the epoch. If set to `null`, the table will persist indefinitely.
  deletion_protection = true                  # Optional. Configures deletion_protection for the table. If unset, module-level deletion_protection setting will be used.
  labels = {                                  # A mapping of labels to assign to the table.
      env      = "dev"
      billable = "true"
    }
}

Variable `views` detailed description

The views variable should be provided as a list of object with the following keys:

{
  view_id = "some_id"                                                # Unique view id. it will be set to friendly name as well
  query = "Select user_id, name from `project_id.dataset_id.table`"  # the Select query that will create the view. Tables should be created before.
  use_legacy_sql = false                                             # whether to use legacy sql or standard sql
  labels = {                                                         # A mapping of labels to assign to the view.
      env      = "dev"
      billable = "true"
  }
}

Variable `routines` detailed description

The routines variable should be provided as a list of object with the following keys:

{
  routine_id = "some_id"                     # The ID of the routine. The ID must contain only letters, numbers, or underscores. The maximum length is 256 characters.
  routine_type = "PROCEDURE"                 # The type of routine. Possible values are SCALAR_FUNCTION and PROCEDURE.
  language = "SQL"                           # The language of the routine. Possible values are SQL and JAVASCRIPT.
  definition_body = "CREATE FUNCTION test return x*y;"  # The body of the routine. For functions, this is the expression in the AS clause. If language=SQL, it is the substring inside (but excluding) the parentheses.
  return_type     = null                     # A JSON schema for the return type. Optional if language = "SQL"; required otherwise. If absent, the return type is inferred from definitionBody at query time in each query that references this routine. If present, then the evaluated result will be cast to the specified returned type at query time.
  description = "Description"               # The description of the routine if defined.
  arguments = [                             # Set it to `null` to omit arguments block configuration for the routine.
    {
      name      = "x",                      # The name of this argument. Can be absent for function return argument.
      data_type = null,                     # A JSON schema for the data type. Required unless argumentKind = ANY_TYPE.
      argument_kind = "ANY_TYPE"            # Defaults to FIXED_TYPE. Default value is FIXED_TYPE. Possible values are FIXED_TYPE and ANY_TYPE.
      mode = null                           # Specifies whether the argument is input or output. Can be set for procedures only. Possible values are IN, OUT, and INOUT.
    }
  ]
}

A detailed example with authorized views can be found here.

Features

This module provisions a dataset and a list of tables with associated JSON schemas and views from queries.

Inputs

Name	Description	Type	Default	Required
access	An array of objects that define dataset access for one or more entities.	`any`	[ { "role": "roles/bigquery.dataOwner", "special_group": "projectOwners" } ]	no
dataset_id	Unique ID for the dataset being provisioned.	`string`	n/a	yes
dataset_labels	Key value pairs in a map for dataset labels	`map(string)`	`{}`	no
dataset_name	Friendly name for the dataset being provisioned.	`string`	`null`	no
default_partition_expiration_ms	The default partition expiration for all partitioned tables in the dataset, in MS	`number`	`null`	no
default_table_expiration_ms	TTL of tables using the dataset in MS	`number`	`null`	no
delete_contents_on_destroy	(Optional) If set to true, delete all the tables in the dataset when destroying the resource; otherwise, destroying the resource will fail if tables are present.	`bool`	`null`	no
deletion_protection	Whether or not to allow deletion of tables and external tables defined by this module. Can be overriden by table-level deletion_protection configuration.	`bool`	`false`	no
description	Dataset description.	`string`	`null`	no
encryption_key	Default encryption key to apply to the dataset. Defaults to null (Google-managed).	`string`	`null`	no
external_tables	A list of objects which include table_id, expiration_time, external_data_configuration, and labels.	list(object({ table_id = string, description = optional(string), autodetect = bool, compression = string, ignore_unknown_values = bool, max_bad_records = number, schema = string, source_format = string, source_uris = list(string), csv_options = object({ quote = string, allow_jagged_rows = bool, allow_quoted_newlines = bool, encoding = string, field_delimiter = string, skip_leading_rows = number, }), google_sheets_options = object({ range = string, skip_leading_rows = number, }), hive_partitioning_options = object({ mode = string, source_uri_prefix = string, }), expiration_time = string, max_staleness = optional(string), deletion_protection = optional(bool), labels = map(string), }))	`[]`	no
location	The regional location for the dataset only US and EU are allowed in module	`string`	`"US"`	no
materialized_views	A list of objects which includes view_id, view_query, clustering, time_partitioning, range_partitioning, expiration_time and labels	list(object({ view_id = string, description = optional(string), query = string, enable_refresh = bool, refresh_interval_ms = string, clustering = list(string), time_partitioning = object({ expiration_ms = string, field = string, type = string, require_partition_filter = bool, }), range_partitioning = object({ field = string, range = object({ start = string, end = string, interval = string, }), }), expiration_time = string, max_staleness = optional(string), labels = map(string), }))	`[]`	no
max_time_travel_hours	Defines the time travel window in hours	`number`	`null`	no
project_id	Project where the dataset and table are created	`string`	n/a	yes
routines	A list of objects which include routine_id, routine_type, routine_language, definition_body, return_type, routine_description and arguments.	list(object({ routine_id = string, routine_type = string, language = string, definition_body = string, return_type = string, description = string, arguments = list(object({ name = string, data_type = string, argument_kind = string, mode = string, })), }))	`[]`	no
storage_billing_model	Specifies the storage billing model for the dataset. Set this flag value to LOGICAL to use logical bytes for storage billing, or to PHYSICAL to use physical bytes instead. LOGICAL is the default if this flag isn't specified.	`string`	`null`	no
tables	A list of objects which include table_id, table_name, schema, clustering, time_partitioning, range_partitioning, expiration_time and labels.	list(object({ table_id = string, description = optional(string), table_name = optional(string), schema = string, clustering = list(string), require_partition_filter = optional(bool), time_partitioning = object({ expiration_ms = string, field = string, type = string, }), range_partitioning = object({ field = string, range = object({ start = string, end = string, interval = string, }), }), expiration_time = string, deletion_protection = optional(bool), labels = map(string), }))	`[]`	no
views	A list of objects which include view_id and view query	list(object({ view_id = string, description = optional(string), query = string, use_legacy_sql = bool, labels = map(string), }))	`[]`	no

Outputs

Name	Description
bigquery_dataset	Bigquery dataset resource.
bigquery_external_tables	Map of BigQuery external table resources being provisioned.
bigquery_tables	Map of bigquery table resources being provisioned.
bigquery_views	Map of bigquery view resources being provisioned.
external_table_ids	Unique IDs for any external tables being provisioned
external_table_names	Friendly names for any external tables being provisioned
project	Project where the dataset and tables are created
routine_ids	Unique IDs for any routine being provisioned
table_ids	Unique id for the table being provisioned
table_names	Friendly name for the table being provisioned
view_ids	Unique id for the view being provisioned
view_names	friendlyname for the view being provisioned

Requirements

These sections describe requirements for using this module.

Software

The following dependencies must be available:

Terraform >= 0.13.0
Terraform Provider for GCP plugin v3

Service Account

A service account with the following roles must be used to provision the resources of this module:

BigQuery Data Owner: roles/bigquery.dataOwner

The Project Factory module and the IAM module may be used in combination to provision a service account with the necessary roles applied.

Script Helper

A helper script for configuring a Service Account is located at (./helpers/setup-sa.sh).

APIs

A project with the following APIs enabled must be used to host the resources of this module:

BigQuery JSON API: bigquery-json.googleapis.com

The Project Factory module can be used to provision a project with the necessary APIs enabled.

Contributing

Refer to the contribution guidelines for information on contributing to this module.

SamuZad/terraform-google-bigquery