Cloudera Data Hub Prerequisites¶
Table of Contents
Introduction¶
This document includes the prerequisite steps for Cloudera Data Hub.
Cluster Configuration¶
A number of cluster options will need specific Gluent Data Platform configuration parameters to be set when completing the Gluent Data Platform Environment File Creation installation step. Confirm these values as follows and note them down:
Option |
Non-Cloudera Manager Installations |
Cloudera Manager Installations |
---|---|---|
DataNode Data Transfer Protection |
|
Clusters → Cluster Name → HDFS Service → Configuration → Search (dfs.data.transfer.protection) |
Data Transfer Encryption |
|
Clusters → Cluster Name → HDFS Service → Configuration → Search (dfs.encrypt.data.transfer) |
Hadoop RPC Protection |
|
Clusters → Cluster Name → HDFS Service → Configuration → Search (hadoop.rpc.protection) |
HDFS High Availability |
|
Clusters → Cluster Name → HDFS Service → Instances → Federation and High Availability |
Impala HS2 Port |
|
Clusters → Cluster Name → Impala Service → Configuration → Search (hs2_port) |
Kerberos Enabled |
|
Administration → Security |
Kerberos Principal |
|
Clusters → Cluster Name → Impala Service → Configuration → Search (Kerberos Principal) |
LDAP Enabled |
|
Clusters → Cluster Name → Impala Service → Configuration → Search (enable_ldap_auth) |
Sentry Enabled |
|
Clusters → Cluster Name → Impala Service → Configuration → Search (sentry) |
SSL Certificate |
|
Clusters → Cluster Name → Impala Service → Configuration → Search (ssl_server_certificate) |
SSL Enabled |
|
Clusters → Cluster Name → Impala Service → Configuration → Search (client_services_ssl_enabled) |
Provision a Gluent Data Platform OS User¶
A Gluent Data Platform OS user (assumed to be gluent for the remainder of this document) is required on the Hadoop node(s) on which Gluent Offload Engine commands will be initiated.
This user should be provisioned using the appropriate method for the environment, e.g. LDAP, local users, etc. There are no specific group membership requirements for this user.
Verify the user is present using the following command:
$ id gluent
Storage Requirements¶
Note
This prerequisite is needed only if Gluent Data Platform is to be installed on Hadoop node(s).
A filesystem location must be created for Gluent Data Platform installation.
Gluent Data Platform occupies approximately 1GB of storage once unpacked.
During operation, Gluent Data Platform will write log and trace files within its installation directory. Sufficient space will need to be allocated for continuing operations.
The filesystem location must be owned by the provisioned Gluent Data Platform OS user.
Default Shell¶
The owner of the Gluent Data Platform software requires the Bash shell. The outcome of the following should be bash
for that user:
$ echo $SHELL
Create HDFS Directories¶
Gluent Data Platform requires up to three locations within HDFS depending on the use of cloud storage:
Parameter |
Purpose |
Necessity |
Required Permissions |
Default Location |
---|---|---|---|---|
Stores a persistent copy of data offloaded from Oracle Database, and Incremental Update metadata |
Mandatory if any data is to be persisted in HDFS |
Read, write for
HADOOP_SSH_USER Read, write for
hive group |
|
|
Stores the Gluent UDF library file |
Mandatory if UDFs are to be based in HDFS |
Read, write for
HADOOP_SSH_USER Read for
hive group |
|
|
Transient staging area used by the data transport phase of Offload |
Mandatory |
Read, write for
HADOOP_SSH_USER Read for
hive group |
|
The steps to create the default locations with the correct permissions are detailed below.
Create gluent
directory in HDFS (as hdfs):
hdfs dfs -mkdir /user/gluent
Change ownership of gluent directory (as hdfs):
hdfs dfs -chown gluent:hive /user/gluent
Create offload directory (as gluent):
hdfs dfs -mkdir /user/gluent/offload
Change permissions on offload directory to allow group write (as gluent):
hdfs dfs -chmod 770 /user/gluent/offload
Verify permissions on offload directory (as gluent):
hdfs dfs -ls -d /user/gluent/offload
Note
The offload
directory should be group writable, i.e., the final ls
command above should show permissions of drwxrwx---
.
Oracle JDBC Drivers¶
Oracle’s JDBC driver should be downloaded from Oracle's JDBC and UCP Downloads page and installed to the location shown below. The location is dependent on the method that will be used by Offload to Transport Data to Staging. The driver should be installed on all nodes where offload transport jobs will be initiated.
Offload Transport Method |
Location |
---|---|
Sqoop |
|
Spark |
|
Sqoop¶
If Sqoop will be used to Transport Data to Staging then save the example command below into a temporary script (e.g. gl_sqoop.sh
) and modify the placeholders in --connect
, --username
, --password
and --target-dir
with appropriate environment values:
gluent$ sqoop import -Doracle.sessionTimeZone=UTC \
-Doraoop.timestamp.string=true \
-Doraoop.jdbc.url.verbatim=true \
--connect \
jdbc:oracle:thin:@<db_host|vip>:<port>/<service> \
--username <database_username> \
--password $'<database_password>' \
--table SYS.DBA_OBJECTS \
--split-by OBJECT_ID \
--target-dir=/user/gluent/offload/test \
--delete-target-dir \
-m4 \
--direct \
--as-avrodatafile \
--outdir=.glsqoop
Note
If the database password contains a single-quote character (‘) then this must be escaped with a backslash.
Run the test Sqoop job (as gluent):
$ ./gl_sqoop.sh
Verify the test Sqoop job completes without error.
Oracle OS Package¶
Install the operating system libaio
package if it is not already present (as root):
# yum install libaio
Kerberos¶
Note
This prerequisite is needed in a Kerberized cluster only if Gluent Data Platform is to be installed on a Hadoop node or if HDFS commands are to be run from a Hadoop node.
The keytab of the Kerberos principal that will be used to authenticate must be accessible by the Gluent Data Platform OS user on the Hadoop node.
Verify that a Kerberos ticket can be obtained for the principal and keytab created (as gluent):
$ kinit -kt <path_to_keytab_file> <principal_name>
$ klist
Sentry¶
When Sentry is enabled the user with which Gluent Data Platform authenticates to Impala needs privileges for both one-time installation and configuration tasks, and for continuing operations. Granting the ALL ON SERVER
Sentry privilege to this user allows all Gluent Data Platform operations to function.
If the ALL ON SERVER
privilege is not permitted for continuing operations and least privileges are required, then the required privileges are detailed below in Installation and Configuration and Continuing Operations.
Before covering these privileges, it is important to understand the Impala databases that are required by Gluent Data Platform.
Impala Databases¶
Gluent Data Platform requires two Impala databases for each Oracle Database schema that will be offloaded:
Impala Database Name |
HDFS Database Location |
Database Purpose |
---|---|---|
|
|
Persistent copy of data offloaded from Oracle Database |
|
|
Transient staging area used by the data transport phase of Offload |
For example when offloading from the SH Oracle Database schema, with both HDFS_DATA
and HDFS_LOAD
set to /user/gluent/offload
and DB_NAME_PREFIX
and HDFS_DB_PATH_SUFFIX
at their default values, the following Impala databases are required:
Database Name |
HDFS Database Location |
---|---|
sh |
|
sh_load |
|
Installation and Configuration¶
The privileges required for the creation of Gluent Data Platform User-Defined Functions (UDFs), sequence table and Impala databases are:
Privilege |
Scope |
Reason |
---|---|---|
|
Installation and Upgrade only |
Required for |
|
Installation and Upgrade only |
Required for the optional Creation of Sequence Table installation step 1 |
|
Offloading |
Required for |
- 1
Only recommended for Cloudera Data Hub versions earlier than 5.10.x.
In the absence of the Sentry privileges listed above, Connect will be unable to create the Gluent Data Platform UDFs and sequence table, and Offload will be unable to create the Impala databases. They must be created manually by an administrator with the required privileges.
Continuing Operations¶
The privileges required for continuing operations are:
ALL ON DATABASE sh
ALL ON DATABASE sh_load
SELECT ON DATABASE <database containing UDFs>
SELECT ON DATABASE <database containing sequence table>
(depending on its existence)ALL ON URI
forHDFS_DATA
URIALL ON URI
forHDFS_LOAD
URIALL ON URI
forHDFS_SNAPSHOT_PATH
URI
Where the sh and sh_load database names continue from the example above.
By default UDFs are installed into the default
Impala database. This database can be changed by specifying the database name with the OFFLOAD_UDF_DB
option.
By default the sequence table is created in the default
Impala database and is named gluent_sequence
. The database and table name can be changed by specifying the database and table name with the IN_LIST_JOIN_TABLE
option.