Cloudera Data Hub Prerequisites

Introduction

This document includes the prerequisite steps for Cloudera Data Hub.

Cluster Configuration

A number of cluster options will need specific Gluent Data Platform configuration parameters to be set when completing the Gluent Data Platform Environment File Creation installation step. Confirm these values as follows and note them down:

Option

Non-Cloudera Manager Installations

Cloudera Manager Installations

DataNode Data Transfer Protection

grep -i 'dfs.data.transfer.protection' hdfs-site.xml

Clusters → Cluster Name → HDFS Service → Configuration → Search (dfs.data.transfer.protection)

Data Transfer Encryption

grep -i 'dfs.encrypt.data.transfer' hdfs-site.xml

Clusters → Cluster Name → HDFS Service → Configuration → Search (dfs.encrypt.data.transfer)

Hadoop RPC Protection

grep -i 'hadoop.rpc.protection' hdfs-site.xml

Clusters → Cluster Name → HDFS Service → Configuration → Search (hadoop.rpc.protection)

HDFS High Availability

grep -i 'dfs.nameservices' hdfs-site.xml

Clusters → Cluster Name → HDFS Service → Instances → Federation and High Availability

Impala HS2 Port

grep -i 'hs2_port' /etc/default/impala

Clusters → Cluster Name → Impala Service → Configuration → Search (hs2_port)

Kerberos Enabled

grep -i 'principal' /etc/default/impala

Administration → Security

Kerberos Principal

grep -i 'principal' /etc/default/impala

Clusters → Cluster Name → Impala Service → Configuration → Search (Kerberos Principal)

LDAP Enabled

grep -i 'ldap_uri' /etc/default/impala

Clusters → Cluster Name → Impala Service → Configuration → Search (enable_ldap_auth)

Sentry Enabled

grep -i 'authorization_policy_provider_class' /etc/default/impala

Clusters → Cluster Name → Impala Service → Configuration → Search (sentry)

SSL Certificate

grep -i 'ssl_server_certificate' /etc/default/impala

Clusters → Cluster Name → Impala Service → Configuration → Search (ssl_server_certificate)

SSL Enabled

grep -i 'ssl_server_certificate' /etc/default/impala

Clusters → Cluster Name → Impala Service → Configuration → Search (client_services_ssl_enabled)

Provision a Gluent Data Platform OS User

A Gluent Data Platform OS user (assumed to be gluent for the remainder of this document) is required on the Hadoop node(s) on which Gluent Offload Engine commands will be initiated.

This user should be provisioned using the appropriate method for the environment, e.g. LDAP, local users, etc. There are no specific group membership requirements for this user.

Verify the user is present using the following command:

$ id gluent

Storage Requirements

Note

This prerequisite is needed only if Gluent Data Platform is to be installed on Hadoop node(s).

A filesystem location must be created for Gluent Data Platform installation.

Gluent Data Platform occupies approximately 1GB of storage once unpacked.

During operation, Gluent Data Platform will write log and trace files within its installation directory. Sufficient space will need to be allocated for continuing operations.

The filesystem location must be owned by the provisioned Gluent Data Platform OS user.

Default Shell

The owner of the Gluent Data Platform software requires the Bash shell. The outcome of the following should be bash for that user:

$ echo $SHELL

Create HDFS Directories

Gluent Data Platform requires up to three locations within HDFS depending on the use of cloud storage:

Parameter

Purpose

Necessity

Required Permissions

Default Location

HDFS_DATA

Stores a persistent copy of data offloaded from Oracle Database, and Incremental Update metadata

Mandatory if any data is to be persisted in HDFS

Read, write for HADOOP_SSH_USER
Read, write for hive group

/user/gluent/offload

HDFS_HOME

Stores the Gluent UDF library file

Mandatory if UDFs are to be based in HDFS

Read, write for HADOOP_SSH_USER
Read for hive group

/user/gluent

HDFS_LOAD

Transient staging area used by the data transport phase of Offload

Mandatory

Read, write for HADOOP_SSH_USER
Read for hive group

/user/gluent/offload

The steps to create the default locations with the correct permissions are detailed below.

Create gluent directory in HDFS (as hdfs):

hdfs dfs -mkdir /user/gluent

Change ownership of gluent directory (as hdfs):

hdfs dfs -chown gluent:hive /user/gluent

Create offload directory (as gluent):

hdfs dfs -mkdir /user/gluent/offload

Change permissions on offload directory to allow group write (as gluent):

hdfs dfs -chmod 770 /user/gluent/offload

Verify permissions on offload directory (as gluent):

hdfs dfs -ls -d /user/gluent/offload

Note

The offload directory should be group writable, i.e., the final ls command above should show permissions of drwxrwx---.

Oracle JDBC Drivers

Oracle’s JDBC driver should be downloaded from Oracle's JDBC and UCP Downloads page and installed to the location shown below. The location is dependent on the method that will be used by Offload to Transport Data to Staging. The driver should be installed on all nodes where offload transport jobs will be initiated.

Offload Transport Method

Location

Sqoop

/var/lib/sqoop

Spark

$SPARK_HOME/jars

Sqoop

If Sqoop will be used to Transport Data to Staging then save the example command below into a temporary script (e.g. gl_sqoop.sh) and modify the placeholders in --connect, --username, --password and --target-dir with appropriate environment values:

gluent$ sqoop import -Doracle.sessionTimeZone=UTC \
-Doraoop.timestamp.string=true \
-Doraoop.jdbc.url.verbatim=true \
--connect \
jdbc:oracle:thin:@<db_host|vip>:<port>/<service> \
--username <database_username> \
--password $'<database_password>' \
--table SYS.DBA_OBJECTS \
--split-by OBJECT_ID \
--target-dir=/user/gluent/offload/test \
--delete-target-dir \
-m4 \
--direct \
--as-avrodatafile \
--outdir=.glsqoop

Note

If the database password contains a single-quote character (‘) then this must be escaped with a backslash.

Run the test Sqoop job (as gluent):

$ ./gl_sqoop.sh

Verify the test Sqoop job completes without error.

Oracle OS Package

Install the operating system libaio package if it is not already present (as root):

# yum install libaio

Kerberos

Note

This prerequisite is needed in a Kerberized cluster only if Gluent Data Platform is to be installed on a Hadoop node or if HDFS commands are to be run from a Hadoop node.

The keytab of the Kerberos principal that will be used to authenticate must be accessible by the Gluent Data Platform OS user on the Hadoop node.

Verify that a Kerberos ticket can be obtained for the principal and keytab created (as gluent):

$ kinit -kt <path_to_keytab_file> <principal_name>
$ klist

Sentry

When Sentry is enabled the user with which Gluent Data Platform authenticates to Impala needs privileges for both one-time installation and configuration tasks, and for continuing operations. Granting the ALL ON SERVER Sentry privilege to this user allows all Gluent Data Platform operations to function.

If the ALL ON SERVER privilege is not permitted for continuing operations and least privileges are required, then the required privileges are detailed below in Installation and Configuration and Continuing Operations.

Before covering these privileges, it is important to understand the Impala databases that are required by Gluent Data Platform.

Impala Databases

Gluent Data Platform requires two Impala databases for each Oracle Database schema that will be offloaded:

Impala Database Name

HDFS Database Location

Database Purpose

DB_NAME_PREFIX _<schema>

HDFS_DATA / DB_NAME_PREFIX _<schema>. HDFS_DB_PATH_SUFFIX

Persistent copy of data offloaded from Oracle Database

DB_NAME_PREFIX _<schema>_load

HDFS_LOAD / DB_NAME_PREFIX _<schema>_load. HDFS_DB_PATH_SUFFIX

Transient staging area used by the data transport phase of Offload

For example when offloading from the SH Oracle Database schema, with both HDFS_DATA and HDFS_LOAD set to /user/gluent/offload and DB_NAME_PREFIX and HDFS_DB_PATH_SUFFIX at their default values, the following Impala databases are required:

Database Name

HDFS Database Location

sh

/user/gluent/offload/sh.db

sh_load

/user/gluent/offload/sh_load.db

Installation and Configuration

The privileges required for the creation of Gluent Data Platform User-Defined Functions (UDFs), sequence table and Impala databases are:

Privilege

Scope

Reason

ALL ON SERVER (CDH5) / CREATE ON SERVER (CDH6)

Installation and Upgrade only

Required for CREATE FUNCTION, DROP FUNCTION and conditionally CREATE DATABASE commands issued when installing UDFs. Refer to Creation of User-Defined Functions

CREATE TABLE

Installation and Upgrade only

Required for the optional Creation of Sequence Table installation step 1

ALL ON SERVER (CDH5) / CREATE ON SERVER (CDH6)

Offloading

Required for CREATE DATABASE commands issued when the --create-backend-db option is used to initially create the required Impala databases during offload

1

Only recommended for Cloudera Data Hub versions earlier than 5.10.x.

In the absence of the Sentry privileges listed above, Connect will be unable to create the Gluent Data Platform UDFs and sequence table, and Offload will be unable to create the Impala databases. They must be created manually by an administrator with the required privileges.

Continuing Operations

The privileges required for continuing operations are:

  • ALL ON DATABASE sh

  • ALL ON DATABASE sh_load

  • SELECT ON DATABASE <database containing UDFs>

  • SELECT ON DATABASE <database containing sequence table> (depending on its existence)

  • ALL ON URI for HDFS_DATA URI

  • ALL ON URI for HDFS_LOAD URI

  • ALL ON URI for HDFS_SNAPSHOT_PATH URI

Where the sh and sh_load database names continue from the example above.

By default UDFs are installed into the default Impala database. This database can be changed by specifying the database name with the OFFLOAD_UDF_DB option.

By default the sequence table is created in the default Impala database and is named gluent_sequence. The database and table name can be changed by specifying the database and table name with the IN_LIST_JOIN_TABLE option.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com