When creating a Google Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up.
Initialization actions are stored in a Google Cloud Storage bucket and can be passed as a paramater to the gcloud
command or the clusters.create
API when creating a Dataproc cluster. For example, to specify an initialization action when creating a cluster with the gcloud
command, you can run:
gcloud dataproc clusters create CLUSTER-NAME
[--initialization-actions [GCS_URI,...]]
[--initialization-action-timeout TIMEOUT]
For convenience, a copy of initialization actions in this repository are stored in the following Cloud Storage bucket which is publicly-accessible:
gs://dataproc-initialization-actions
The folder structure of this Cloud Storage bucket mirrors this repository. You should be able to use this Cloud Storage bucket (and the initialization scripts within it) for your clusters.
These samples are provided to show how various packages and components can be installed on Cloud Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided without support and you use them at your own risk.
This repository presently offers the following actions for use with Cloud Dataproc clusters.
- Install packages/software on the cluster
- Configure the cluster
- Configure a nice shell environment
- Share a NFS consistency cache
- Share a Google Cloud SQL Hive Metastore.
For more information, review the Dataproc documentation. You can also pose questions to the Stack Overflow comminity with the tag google-cloud-dataproc
.
See our other Google Cloud Platform github
repos for sample applications and
scaffolding for other frameworks and use cases.
- See CONTRIBUTING.md
- See LICENSE