Cloudera Data Platform Public Cloud Prerequisites¶

Table of Contents

Introduction
Provision Infrastructure
Provision a CDP User for Gluent Data Platform
Storage Requirements
Default Shell
Create HDFS Directories
Oracle JDBC Drivers
Sqoop
Oracle OS Package
Kerberos
Documentation Feedback

Introduction ¶

This document includes the prerequisite steps for Cloudera Data Platform Public Cloud.

Provision Infrastructure ¶

Data Hub¶

Gluent Offload Engine requires a Data Hub with the following services:

HDFS
YARN
Impala

There are no minimum requirements for the specification of the Data Hub. It should be sized according to the throughput required for the Offload process.

Tip

The Data Hub cluster can be resized after initial creation to easily scale capacity up or down.

Data Warehouse¶

Gluent Query Engine can utilize the Data Hub or a Data Warehouse. Whilst the use of a Data Warehouse is optional, it will usually deliver faster response times than Data Hub.

Provision a CDP User for Gluent Data Platform ¶

A Gluent Data Platform user (assumed to be gluent for the remainder of this document) is required.

This user should be provisioned in User Management in the Cloudera Management Console.

The following actions should be performed for the user:

Action	Details
Set a Workload Password	User Management > Users > gluent > Set Workload Password
Add SSH Key 1	User Management > Users > gluent > SSH Keys > Add SSH Key
Grant Environment Access	Environments > [environment name] > Actions > Manage Access > Access > type gluent into Select group or user > Select EnvironmentUser role
Grant Cloud Storage Access	Environments > [environment name] > Actions > Manage Access > IDBroker Mappings > Current Mappings > Edit > assign an appropriate role 2 to gluent
Get Keytab 3	User Management > Users > gluent > Actions > Get Keytab

1: The public key for the user on the server from which Gluent Offload Engine commands will be initiated should be added.
2: Refer to Onboarding CDP users and groups for cloud storage (AWS) or Onboarding CDP users and groups for cloud storage (Azure).
3: Only required for manual Kerberos ticket management. See Kerberos.

The user must be synchronized:

Synchronize Users: User Management > Users > Actions > Synchronize Users
Synchronize Users to FreeIPA: Environments > [environment name] > Actions > Synchronize Users to FreeIPA

Verify the user is present using the following command:

$ id gluent

Storage Requirements ¶

Note

This prerequisite is needed only if Gluent Data Platform is to be installed on an existing Data Hub server.

A filesystem location must be created for Gluent Data Platform installation.

Gluent Data Platform occupies approximately 1GB of storage once unpacked.

During operation, Gluent Data Platform will write log and trace files within its installation directory. Sufficient space will need to be allocated for continuing operations.

The filesystem location must be owned by the provisioned Gluent Data Platform user.

Default Shell ¶

The owner of the Gluent Data Platform software requires the Bash shell. The outcome of the following should be bash for that user:

$ echo $SHELL

Create HDFS Directories ¶

Gluent Data Platform requires one location within HDFS on the Data Hub:

Parameter	Purpose	Necessity	Required Permissions	Default Location
`HDFS_LOAD`	Transient staging area used by the data transport phase of Offload	Mandatory	Read, write for `HADOOP_SSH_USER` Read for `impala` user	`/user/gluent/offload`

Provisioning the Gluent Data Platform user via User Management creates the /user/gluent directory in HDFS.

The steps to create the default location with the correct permissions are detailed below.

Create offload directory (as gluent):

hdfs dfs -mkdir /user/gluent/offload

Change permissions on offload directory to allow group write (as gluent):

hdfs dfs -chmod 770 /user/gluent/offload

Grant read and execute permissions on offload directory to Impala (as gluent):

hdfs dfs -setfacl -m user:impala:r-x /user/gluent/offload

Verify permissions on offload directory (as gluent):

hdfs dfs -ls -d /user/gluent/offload

Note

The offload directory should be group writable and show an ACL is active, i.e., the final ls command above should show permissions of drwxrwx---+.

Oracle’s JDBC driver should be downloaded from Oracle's JDBC and UCP Downloads page and installed to the location shown below. The location is dependent on the method that will be used by Offload to Transport Data to Staging. Ensure the file permissions are world readable. The driver should be installed on the Data Hub node where offload transport jobs will be initiated. This is typically a Gateway node but can be any Data Hub node running a YARN role (e.g. YARN ResourceManager).

Offload Transport Method	Location
Sqoop	`/var/lib/sqoop`
Spark	`$SPARK_HOME/jars`

Sqoop ¶

If Sqoop will be used to Transport Data to Staging then save the example command below into a temporary script (e.g. gl_sqoop.sh) and modify the placeholders in --connect, --username, --password and --target-dir with appropriate environment values:

gluent$ sqoop import -Doracle.sessionTimeZone=UTC \
-Doraoop.timestamp.string=true \
-Doraoop.jdbc.url.verbatim=true \
--connect \
jdbc:oracle:thin:@<db_host|vip>:<port>/<service> \
--username <database_username> \
--password $'<database_password>' \
--table SYS.DBA_OBJECTS \
--split-by OBJECT_ID \
--target-dir=/user/gluent/offload/test \
--delete-target-dir \
-m4 \
--direct \
--as-avrodatafile \
--outdir=.glsqoop

Note

If the database password contains a single-quote character (‘) then this must be escaped with a backslash.

Run the test Sqoop job (as gluent) from the node to which the Oracle JDBC Drivers were copied:

$ ./gl_sqoop.sh

Verify the test Sqoop job completes without error.

Oracle OS Package ¶

If Gluent Data Platform is to be installed on an existing Data Hub server install the operating system libaio package on that server if it is not already present (as root):

# yum install libaio

Kerberos ¶

The gluent user requires a valid Kerberos ticket to perform the following Gluent Data Platform operations on the Data Hub:

Interact with Impala
Issue HDFS commands
Run Sqoop or Spark on YARN jobs

Interactions with Impala on Data Hub are via the Apache Knox Gateway which manages Kerberos tickets automatically. No manual Kerberos ticket management is necessary.

Depending on the topology of the Data Hub, manual Kerberos ticket management may be required for the remaining operations. If this is the case, the keytab for the gluent user should be obtained and copied to the Data Hub server(s) on which HDFS commands will be issued, and Sqoop or Spark on YARN jobs initiated on behalf of Gluent Data Platform. An appropriate mechanism to maintain an active Kerberos ticket on those Data Hub server(s) should be implemented.

Documentation Feedback ¶

Send feedback on this documentation to: feedback@gluent.com