Dataproc clusters feature the following types of components:
Installed components: Components that are installed in the image and activated when the cluster is created.
Optional components: Components that you select to install and use on your cluster when you create the cluster. Dataproc installs and activates optional components depending on the cluster image version as follows:
2.2
and earlier image versions: Optional components are automatically installed. Selected optional components are activated and non-selected optional components are uninstalled at cluster creation.2.3
and later image versions: Optional components are installed during cluster creation. For more information, see Dataproc 2.3.x release versions.
Initialization action components: Components installed on a cluster as part of an initialization action that you specify when you create a cluster.
Optional components are installed on a cluster before initialization actions are run on the cluster.
The Dataproc image version pages list the components and component types available in the latest Dataproc image releases.
Optional components have the following advantages over initialization actions used to install components:
- Optional components are tested as compatible with specific Dataproc versions.
- Optional components are enabled with a cluster creation parameter; initialization actions require a script.
Available optional components
Optional component | Component name in Google Cloud CLI commands and API requests |
Image Version | Release Stage |
---|---|---|---|
Delta Lake | DELTA | 2.2.46 and later | GA |
Docker | DOCKER | 1.5 and later | GA |
Flink | FLINK | 1.5 and later | GA |
HBase | HBASE | 1.5 and later (not available in 2.1 and later) |
Deprecated |
Hive WebHCat | HIVE_WEBHCAT | 1.3 and later | GA |
Hudi | HUDI | 1.5 and later | GA |
Iceberg | ICEBERG | 2.2 and later | GA |
Jupyter Notebook | JUPYTER | 1.3 and later | GA |
Pig | PIG | 1.5* and later | GA |
Presto | PRESTO | 1.3 and later (not available in 2.1 and later) |
GA |
Ranger | RANGER | 1.3 and later | GA |
Solr | SOLR | 1.3 and later | GA |
Trino | TRINO | 2.1 and later | GA |
Zeppelin Notebook | ZEPPELIN | 1.3 and later | GA |
Zookeeper | ZOOKEEPER | 1.0 and later | GA |
Notes:
- Apache Pig is an optional component in image versions 2.3 and later. It was
pre-installed in
2.2
and earlier image versions.
Add optional components
Console
- In the Google Cloud console, go to the Dataproc
Create a cluster page.
The Set up cluster panel is selected.
- In the Components section, under Optional components, select one or more components to install on your cluster.
Google Cloud CLI
To create a Dataproc cluster and install one or more
optional components on the cluster, use the
gcloud beta dataproc clusters create cluster-name
command with the --optional-components
flag.
gcloud dataproc clusters create CLUSTER_NAME \ --optional-components=COMPONENT-NAME(s) \ ... other flags
REST API
Optional components can be specified through the Dataproc API using SoftwareConfig.Component as part of a clusters.create request.