Cloudera Data Hub Installation¶
Table of Contents
Introduction¶
This document includes the installation steps for Cloudera Data Hub.
Gluent Data Platform Software Installation¶
In addition to the mandatory installation on Oracle Database servers, for production deployments it is recommended that Gluent Data Platform is installed on at least one other server. This can be any server that satisfies the Gluent Data Platform Supported Operating Systems and Versions. A Gluent node may been provisioned to satisfy this requirement.
The additional server enables Data Daemon to be sized appropriately for throughput without consuming resources on the Oracle Database server(s). It may also be beneficial for the following reasons:
Password-less SSH connectivity between Oracle Database servers and Hadoop nodes is not permitted
Separation of duties: Orchestration commands can be run by the Hadoop administrators rather than the Oracle database team
This document assumes that the user created during Provision a Gluent Data Platform OS User is gluent.
Unpack Software¶
Perform the following actions as gluent.
Unpack the install tarball (gluent_offload_<version>.tar.bz2
):
Note
When unpacking, an offload
directory will be created if it does not exist. The offload
directory is referred to as <OFFLOAD_HOME> and an environment variable ($OFFLOAD_HOME
) will be set when offload.env
is sourced.
$ cd <Gluent Data Platform Base Directory>
$ tar xpf <Gluent Data Platform Installation Media Directory>/gluent_offload_<version>.tar.bz2
Gluent Data Platform Environment File¶
Copy
offload.env
from an Oracle Database server into$OFFLOAD_HOME/conf
Set both
HDFS_CMD_HOST
andOFFLOAD_TRANSPORT_CMD_HOST
tolocalhost
inoffload.env
Creation of User-Defined Functions¶
If Gluent Data Platform has been installed on a server in addition to the Oracle Database server, the connect
command to create the user-defined functions (UDFs) detailed below should be run from that server. Otherwise, run this command using the Gluent Data Platform installation on an Oracle Database server.
Tip
By default UDFs are created in the default
Impala database. This database can be changed by specifying the database name in the OFFLOAD_UDF_DB
parameter in offload.env
.
The storage location of the library that is referenced by the Gluent UDFs is determined by the values of parameters in offload.env
. See Integrating with Cloud Storage. Ad hoc overrides to a different cloud or HDFS location are available with the --offload-fs-scheme
, --offload-fs-container
, --offload-fs-prefix
and --hdfs-home
parameters with the connect --install-udfs
command.
To create the UDFs run the supplied connect
command with the --install-udfs
option:
$ cd $OFFLOAD_HOME/bin
$ . ../conf/offload.env
$ ./connect --install-udfs
Note
In systems using Sentry to control authorization the ALL ON SERVER
/CREATE ON SERVER
privilege will be required in order to install UDFs. The privilege can be safely removed once this task is complete.
In systems using Ranger to control authorization, appropriate Ranger permissions are required in order to install UDFs. See Ranger Privileges.
If the user with which Gluent Data Platform will authenticate to Impala is not permitted to have the necessary privileges to create UDFs, even on a temporary basis, then a script can be generated for execution by a system administrator. Use the --sql-file
option to specify a file where commands should be written instead of being executed:
$ cd $OFFLOAD_HOME/bin
$ . ../conf/offload.env
$ ./connect --install-udfs --sql-file=/tmp/gluent_udfs.sql
The /tmp/gluent_udfs.sql
file can then be run by an Impala user with the required Sentry privileges.
Creation of Sequence Table¶
Note
The creation of the sequence table can be skipped on Cloudera Data Hub versions 5.10.x and above.
Cloudera Data Hub versions earlier than 5.10.x contain a performance issue with Impala queries that contain a large number of constants in an in-list (refer to IMPALA-4302 for details).
Gluent Data Platform includes an optimization that overcomes this by transforming large in-lists into a semi-join using a sequence table. In order for this optimization to function, the sequence table must be installed using Connect.
If Gluent Data Platform has been installed on a Hadoop node, the connect
command should be run from the Hadoop node to create the sequence table. Otherwise, run this command using the Gluent Data Platform installation on an Oracle Database server.
By default the sequence table is created in the default
Impala database and is named gluent_sequence
. The database and table name can be changed by specifying the database and table name in the IN_LIST_JOIN_TABLE
parameter in offload.env
.
To create the sequence table run the supplied connect
script with the --create-sequence-table
flag:
$ cd $OFFLOAD_HOME/bin
$ . ../conf/offload.env
$ ./connect --create-sequence-table
Note
In systems using Sentry to control authorization the CREATE TABLE
privilege will be required to create the sequence table. For continuing operations the SELECT ON DATABASE <database containing sequence table>
privilege is required.
HDFS Client Configuration File¶
An HDFS client configuration file is required if any of the following Cluster Configuration options are enabled:
DataNode Data Transfer Protection
Data Transfer Encryption
Encryption Zones
Hadoop RPC Protection
HDFS High Availability
Kerberos
If any are enabled create an empty $OFFLOAD_HOME/conf/hdfs-client.xml
file with the following initial content:
<configuration>
</configuration>
Add the properties detailed in the sections below between the <configuration>
tags in the xml file.
Once the file is complete with the relevant properties, propagate the file to $OFFLOAD_HOME/conf
on all other Gluent Data Platform installations.
Set LIBHDFS3_CONF
to $OFFLOAD_HOME/conf/hdfs-client.xml
in offload.env
.
Important
Any changes made to the Gluent Data Platform environment file (offload.env
) must be propagated across all installations.
DataNode Data Transfer Protection¶
If the Cluster Configuration prerequisite shows DataNode Data Transfer Protection is set add the following properties:
<property>
<name>dfs.data.transfer.protection</name>
<value>[protection_option]</value>
</property>
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
The value for [protection_option]
should be replaced with the setting for DataNode Data Transfer Protection (authentication
, integrity
or privacy
).
Data Transfer Encryption¶
If the Cluster Configuration prerequisite shows Data Transfer Encryption is enabled add the following properties:
<property>
<name>dfs.encrypt.data.transfer</name>
<value>true</value>
</property>
Encryption Zones¶
If HDFS_LOAD
points to an HDFS encryption zone add the following properties:
<property>
<name>dfs.encryption.key.provider.uri</name>
<value>kms://https@[kms server]:[kms port]/kms</value>
</property>
<property>
<name>hadoop.kms.authentication.type</name>
<value>kerberos</value>
</property>
The values for [kms server]
and [kms port]
should be replaced with the values for the Cloudera Data Hub Java KeyStore KMS in use.
Hadoop RPC Protection¶
If the Cluster Configuration prerequisite shows Hadoop RPC Protection is set add the following properties:
<property>
<name>hadoop.rpc.protection</name>
<value>[protection_option]</value>
</property>
The value for [protection_option]
should be replaced with the setting for Hadoop RPC Protection (authentication
, integrity
or privacy
).
HDFS High Availability¶
If HDFS High Availability is configured add the following properties:
<property>
<name>dfs.nameservices</name>
<value>[nameservice ID]</value>
</property>
<property>
<name>dfs.ha.namenodes.[nameservice ID]</name>
<value>[name node 1 ID],[name node 2 ID]</value>
</property>
<property>
<name>dfs.namenode.rpc-address.[nameservice ID].[name node 1 ID]</name>
<value>[full name node 1 address]:[name node 1 port]</value>
</property>
<property>
<name>dfs.namenode.rpc-address.[nameservice ID].[name node 2 ID]</name>
<value>[full name node 2 address]:[name node 2 port]</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.[nameservice ID]</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
The values for the [nameservice ID]
, [name node 1 ID]
, [name node 2 ID]
, [full name node 1 address]
, [name node 1 port]
, [full name node 2 address]
, [name node 2 port]
placeholders must be replaced with the correct settings for the environment.
Refer to Hadoop documentation for further information.
Kerberos¶
If the Cluster Configuration prerequisite shows Kerberos is enabled add the following properties:
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/_HOST@[realm]</value>
</property>
The value for [realm]
should be replaced with the Kerberos realm.