Cloudera Data Hub Prerequisites¶

Table of Contents

Introduction
Cluster Configuration
Provision a Gluent Data Platform OS User
Storage Requirements
Default Shell
Create HDFS Directories
Oracle JDBC Drivers
Sqoop
Oracle OS Package
Kerberos
Sentry
Documentation Feedback

Introduction ¶

This document includes the prerequisite steps for Cloudera Data Hub.

A number of cluster options will need specific Gluent Data Platform configuration parameters to be set when completing the Gluent Data Platform Environment File Creation installation step. Confirm these values as follows and note them down:

Option	Non-Cloudera Manager Installations	Cloudera Manager Installations
DataNode Data Transfer Protection	`grep -i 'dfs.data.transfer.protection' hdfs-site.xml`	Clusters → Cluster Name → HDFS Service → Configuration → Search (dfs.data.transfer.protection)
Data Transfer Encryption	`grep -i 'dfs.encrypt.data.transfer' hdfs-site.xml`	Clusters → Cluster Name → HDFS Service → Configuration → Search (dfs.encrypt.data.transfer)
Hadoop RPC Protection	`grep -i 'hadoop.rpc.protection' hdfs-site.xml`	Clusters → Cluster Name → HDFS Service → Configuration → Search (hadoop.rpc.protection)
HDFS High Availability	`grep -i 'dfs.nameservices' hdfs-site.xml`	Clusters → Cluster Name → HDFS Service → Instances → Federation and High Availability
Impala HS2 Port	`grep -i 'hs2_port' /etc/default/impala`	Clusters → Cluster Name → Impala Service → Configuration → Search (hs2_port)
Kerberos Enabled	`grep -i 'principal' /etc/default/impala`	Administration → Security
Kerberos Principal	`grep -i 'principal' /etc/default/impala`	Clusters → Cluster Name → Impala Service → Configuration → Search (Kerberos Principal)
LDAP Enabled	`grep -i 'ldap_uri' /etc/default/impala`	Clusters → Cluster Name → Impala Service → Configuration → Search (enable_ldap_auth)
Sentry Enabled	`grep -i 'authorization_policy_provider_class' /etc/default/impala`	Clusters → Cluster Name → Impala Service → Configuration → Search (sentry)
SSL Certificate	`grep -i 'ssl_server_certificate' /etc/default/impala`	Clusters → Cluster Name → Impala Service → Configuration → Search (ssl_server_certificate)
SSL Enabled	`grep -i 'ssl_server_certificate' /etc/default/impala`	Clusters → Cluster Name → Impala Service → Configuration → Search (client_services_ssl_enabled)

Provision a Gluent Data Platform OS User ¶

A Gluent Data Platform OS user (assumed to be gluent for the remainder of this document) is required on the Hadoop node(s) on which Gluent Offload Engine commands will be initiated.

This user should be provisioned using the appropriate method for the environment, e.g. LDAP, local users, etc. There are no specific group membership requirements for this user.

Verify the user is present using the following command:

$ id gluent

Storage Requirements ¶

Note

This prerequisite is needed only if Gluent Data Platform is to be installed on Hadoop node(s).

A filesystem location must be created for Gluent Data Platform installation.

Gluent Data Platform occupies approximately 1GB of storage once unpacked.

During operation, Gluent Data Platform will write log and trace files within its installation directory. Sufficient space will need to be allocated for continuing operations.

The filesystem location must be owned by the provisioned Gluent Data Platform OS user.

Default Shell ¶

The owner of the Gluent Data Platform software requires the Bash shell. The outcome of the following should be bash for that user:

$ echo $SHELL

Create HDFS Directories ¶

Gluent Data Platform requires up to three locations within HDFS depending on the use of cloud storage:

Parameter	Purpose	Necessity	Required Permissions	Default Location
`HDFS_DATA`	Stores a persistent copy of data offloaded from Oracle Database, and Incremental Update metadata	Mandatory if any data is to be persisted in HDFS	Read, write for `HADOOP_SSH_USER` Read, write for `hive` group	`/user/gluent/offload`
`HDFS_HOME`	Stores the Gluent UDF library file	Mandatory if UDFs are to be based in HDFS	Read, write for `HADOOP_SSH_USER` Read for `hive` group	`/user/gluent`
`HDFS_LOAD`	Transient staging area used by the data transport phase of Offload	Mandatory	Read, write for `HADOOP_SSH_USER` Read for `hive` group	`/user/gluent/offload`

The steps to create the default locations with the correct permissions are detailed below.

Create gluent directory in HDFS (as hdfs):

hdfs dfs -mkdir /user/gluent

Change ownership of gluent directory (as hdfs):

hdfs dfs -chown gluent:hive /user/gluent

Create offload directory (as gluent):

hdfs dfs -mkdir /user/gluent/offload

Change permissions on offload directory to allow group write (as gluent):

hdfs dfs -chmod 770 /user/gluent/offload

Verify permissions on offload directory (as gluent):

hdfs dfs -ls -d /user/gluent/offload

Note

The offload directory should be group writable, i.e., the final ls command above should show permissions of drwxrwx---.

Oracle JDBC Drivers ¶

Oracle’s JDBC driver should be downloaded from Oracle's JDBC and UCP Downloads page and installed to the location shown below. The location is dependent on the method that will be used by Offload to Transport Data to Staging. The driver should be installed on all nodes where offload transport jobs will be initiated.

Offload Transport Method	Location
Sqoop	`/var/lib/sqoop`
Spark	`$SPARK_HOME/jars`

Sqoop ¶

If Sqoop will be used to Transport Data to Staging then save the example command below into a temporary script (e.g. gl_sqoop.sh) and modify the placeholders in --connect, --username, --password and --target-dir with appropriate environment values:

gluent$ sqoop import -Doracle.sessionTimeZone=UTC \
-Doraoop.timestamp.string=true \
-Doraoop.jdbc.url.verbatim=true \
--connect \
jdbc:oracle:thin:@<db_host|vip>:<port>/<service> \
--username <database_username> \
--password $'<database_password>' \
--table SYS.DBA_OBJECTS \
--split-by OBJECT_ID \
--target-dir=/user/gluent/offload/test \
--delete-target-dir \
-m4 \
--direct \
--as-avrodatafile \
--outdir=.glsqoop

Note

If the database password contains a single-quote character (‘) then this must be escaped with a backslash.

Run the test Sqoop job (as gluent):

$ ./gl_sqoop.sh

Verify the test Sqoop job completes without error.

Oracle OS Package ¶

Install the operating system libaio package if it is not already present (as root):

# yum install libaio

Kerberos ¶

Note

This prerequisite is needed in a Kerberized cluster only if Gluent Data Platform is to be installed on a Hadoop node or if HDFS commands are to be run from a Hadoop node.

The keytab of the Kerberos principal that will be used to authenticate must be accessible by the Gluent Data Platform OS user on the Hadoop node.

Verify that a Kerberos ticket can be obtained for the principal and keytab created (as gluent):

$ kinit -kt <path_to_keytab_file> <principal_name>
$ klist

Sentry ¶

When Sentry is enabled the user with which Gluent Data Platform authenticates to Impala needs privileges for both one-time installation and configuration tasks, and for continuing operations. Granting the ALL ON SERVER Sentry privilege to this user allows all Gluent Data Platform operations to function.

If the ALL ON SERVER privilege is not permitted for continuing operations and least privileges are required, then the required privileges are detailed below in Installation and Configuration and Continuing Operations.

Before covering these privileges, it is important to understand the Impala databases that are required by Gluent Data Platform.

Impala Databases¶

Gluent Data Platform requires two Impala databases for each Oracle Database schema that will be offloaded:

Impala Database Name	HDFS Database Location	Database Purpose
`DB_NAME_PREFIX` _<schema>	`HDFS_DATA` / `DB_NAME_PREFIX` _<schema>. `HDFS_DB_PATH_SUFFIX`	Persistent copy of data offloaded from Oracle Database
`DB_NAME_PREFIX` _<schema>_load	`HDFS_LOAD` / `DB_NAME_PREFIX` _<schema>_load. `HDFS_DB_PATH_SUFFIX`	Transient staging area used by the data transport phase of Offload

For example when offloading from the SH Oracle Database schema, with both HDFS_DATA and HDFS_LOAD set to /user/gluent/offload and DB_NAME_PREFIX and HDFS_DB_PATH_SUFFIX at their default values, the following Impala databases are required:

Database Name	HDFS Database Location
sh	`/user/gluent/offload/sh.db`
sh_load	`/user/gluent/offload/sh_load.db`

Installation and Configuration¶

The privileges required for the creation of Gluent Data Platform User-Defined Functions (UDFs), sequence table and Impala databases are:

Privilege	Scope	Reason
`ALL ON SERVER` (CDH5) / `CREATE ON SERVER` (CDH6)	Installation and Upgrade only	Required for `CREATE FUNCTION`, `DROP FUNCTION` and conditionally `CREATE DATABASE` commands issued when installing UDFs. Refer to Creation of User-Defined Functions
`CREATE TABLE`	Installation and Upgrade only	Required for the optional Creation of Sequence Table installation step 1
`ALL ON SERVER` (CDH5) / `CREATE ON SERVER` (CDH6)	Offloading	Required for `CREATE DATABASE` commands issued when the `--create-backend-db` option is used to initially create the required Impala databases during offload

1: Only recommended for Cloudera Data Hub versions earlier than 5.10.x.

In the absence of the Sentry privileges listed above, Connect will be unable to create the Gluent Data Platform UDFs and sequence table, and Offload will be unable to create the Impala databases. They must be created manually by an administrator with the required privileges.

Continuing Operations¶

The privileges required for continuing operations are:

ALL ON DATABASE sh
ALL ON DATABASE sh_load
SELECT ON DATABASE <database containing UDFs>
SELECT ON DATABASE <database containing sequence table> (depending on its existence)
ALL ON URI for HDFS_DATA URI
ALL ON URI for HDFS_LOAD URI
ALL ON URI for HDFS_SNAPSHOT_PATH URI

Where the sh and sh_load database names continue from the example above.

By default UDFs are installed into the default Impala database. This database can be changed by specifying the database name with the OFFLOAD_UDF_DB option.

By default the sequence table is created in the default Impala database and is named gluent_sequence. The database and table name can be changed by specifying the database and table name with the IN_LIST_JOIN_TABLE option.

Documentation Feedback ¶

Send feedback on this documentation to: feedback@gluent.com