Cloudera Data Platform Public Cloud Prerequisites¶
Table of Contents
Introduction¶
This document includes the prerequisite steps for Cloudera Data Platform Public Cloud.
Provision Infrastructure¶
Data Hub¶
Gluent Offload Engine requires a Data Hub with the following services:
HDFS
YARN
Impala
There are no minimum requirements for the specification of the Data Hub. It should be sized according to the throughput required for the Offload process.
Tip
The Data Hub cluster can be resized after initial creation to easily scale capacity up or down.
Data Warehouse¶
Gluent Query Engine can utilize the Data Hub or a Data Warehouse. Whilst the use of a Data Warehouse is optional, it will usually deliver faster response times than Data Hub.
Provision a CDP User for Gluent Data Platform¶
A Gluent Data Platform user (assumed to be gluent for the remainder of this document) is required.
This user should be provisioned in User Management in the Cloudera Management Console.
The following actions should be performed for the user:
Action |
Details |
---|---|
Set a Workload Password |
User Management > Users > gluent > Set Workload Password |
Add SSH Key 1 |
User Management > Users > gluent > SSH Keys > Add SSH Key |
Grant Environment Access |
Environments > [environment name] > Actions > Manage Access > Access > type gluent into Select group or user > Select EnvironmentUser role |
Grant Cloud Storage Access |
Environments > [environment name] > Actions > Manage Access > IDBroker Mappings > Current Mappings > Edit > assign an appropriate role 2 to gluent |
Get Keytab 3 |
User Management > Users > gluent > Actions > Get Keytab |
- 1
The public key for the user on the server from which Gluent Offload Engine commands will be initiated should be added.
- 2
Refer to Onboarding CDP users and groups for cloud storage (AWS) or Onboarding CDP users and groups for cloud storage (Azure).
- 3
Only required for manual Kerberos ticket management. See Kerberos.
The user must be synchronized:
Synchronize Users: User Management > Users > Actions > Synchronize Users
Synchronize Users to FreeIPA: Environments > [environment name] > Actions > Synchronize Users to FreeIPA
Verify the user is present using the following command:
$ id gluent
Storage Requirements¶
Note
This prerequisite is needed only if Gluent Data Platform is to be installed on an existing Data Hub server.
A filesystem location must be created for Gluent Data Platform installation.
Gluent Data Platform occupies approximately 1GB of storage once unpacked.
During operation, Gluent Data Platform will write log and trace files within its installation directory. Sufficient space will need to be allocated for continuing operations.
The filesystem location must be owned by the provisioned Gluent Data Platform user.
Default Shell¶
The owner of the Gluent Data Platform software requires the Bash shell. The outcome of the following should be bash
for that user:
$ echo $SHELL
Create HDFS Directories¶
Gluent Data Platform requires one location within HDFS on the Data Hub:
Parameter |
Purpose |
Necessity |
Required Permissions |
Default Location |
---|---|---|---|---|
Transient staging area used by the data transport phase of Offload |
Mandatory |
Read, write for
HADOOP_SSH_USER Read for
impala user |
|
Provisioning the Gluent Data Platform user via User Management creates the /user/gluent
directory in HDFS.
The steps to create the default location with the correct permissions are detailed below.
Create offload directory (as gluent):
hdfs dfs -mkdir /user/gluent/offload
Change permissions on offload directory to allow group write (as gluent):
hdfs dfs -chmod 770 /user/gluent/offload
Grant read and execute permissions on offload directory to Impala (as gluent):
hdfs dfs -setfacl -m user:impala:r-x /user/gluent/offload
Verify permissions on offload directory (as gluent):
hdfs dfs -ls -d /user/gluent/offload
Note
The offload
directory should be group writable and show an ACL is active, i.e., the final ls
command above should show permissions of drwxrwx---+
.
Oracle JDBC Drivers¶
Oracle’s JDBC driver should be downloaded from Oracle's JDBC and UCP Downloads page and installed to the location shown below. The location is dependent on the method that will be used by Offload to Transport Data to Staging. Ensure the file permissions are world readable. The driver should be installed on the Data Hub node where offload transport jobs will be initiated. This is typically a Gateway node but can be any Data Hub node running a YARN role (e.g. YARN ResourceManager).
Offload Transport Method |
Location |
---|---|
Sqoop |
|
Spark |
|
Sqoop¶
If Sqoop will be used to Transport Data to Staging then save the example command below into a temporary script (e.g. gl_sqoop.sh
) and modify the placeholders in --connect
, --username
, --password
and --target-dir
with appropriate environment values:
gluent$ sqoop import -Doracle.sessionTimeZone=UTC \
-Doraoop.timestamp.string=true \
-Doraoop.jdbc.url.verbatim=true \
--connect \
jdbc:oracle:thin:@<db_host|vip>:<port>/<service> \
--username <database_username> \
--password $'<database_password>' \
--table SYS.DBA_OBJECTS \
--split-by OBJECT_ID \
--target-dir=/user/gluent/offload/test \
--delete-target-dir \
-m4 \
--direct \
--as-avrodatafile \
--outdir=.glsqoop
Note
If the database password contains a single-quote character (‘) then this must be escaped with a backslash.
Run the test Sqoop job (as gluent) from the node to which the Oracle JDBC Drivers were copied:
$ ./gl_sqoop.sh
Verify the test Sqoop job completes without error.
Oracle OS Package¶
If Gluent Data Platform is to be installed on an existing Data Hub server install the operating system libaio
package on that server if it is not already present (as root):
# yum install libaio
Kerberos¶
The gluent user requires a valid Kerberos ticket to perform the following Gluent Data Platform operations on the Data Hub:
Interact with Impala
Issue HDFS commands
Run Sqoop or Spark on YARN jobs
Interactions with Impala on Data Hub are via the Apache Knox Gateway which manages Kerberos tickets automatically. No manual Kerberos ticket management is necessary.
Depending on the topology of the Data Hub, manual Kerberos ticket management may be required for the remaining operations. If this is the case, the keytab for the gluent user should be obtained and copied to the Data Hub server(s) on which HDFS commands will be issued, and Sqoop or Spark on YARN jobs initiated on behalf of Gluent Data Platform. An appropriate mechanism to maintain an active Kerberos ticket on those Data Hub server(s) should be implemented.