Cloudera Data Platform Public Cloud Prerequisites

Introduction

This document includes the prerequisite steps for Cloudera Data Platform Public Cloud.

Provision Infrastructure

Data Hub

Gluent Offload Engine requires a Data Hub with the following services:

  • HDFS

  • YARN

  • Impala

There are no minimum requirements for the specification of the Data Hub. It should be sized according to the throughput required for the Offload process.

Tip

The Data Hub cluster can be resized after initial creation to easily scale capacity up or down.

Data Warehouse

Gluent Query Engine can utilize the Data Hub or a Data Warehouse. Whilst the use of a Data Warehouse is optional, it will usually deliver faster response times than Data Hub.

Provision a CDP User for Gluent Data Platform

A Gluent Data Platform user (assumed to be gluent for the remainder of this document) is required.

This user should be provisioned in User Management in the Cloudera Management Console.

The following actions should be performed for the user:

Action

Details

Set a Workload Password

User Management > Users > gluent > Set Workload Password

Add SSH Key 1

User Management > Users > gluent > SSH Keys > Add SSH Key

Grant Environment Access

Environments > [environment name] > Actions > Manage Access > Access > type gluent into Select group or user > Select EnvironmentUser role

Grant Cloud Storage Access

Environments > [environment name] > Actions > Manage Access > IDBroker Mappings > Current Mappings > Edit > assign an appropriate role 2 to gluent

Get Keytab 3

User Management > Users > gluent > Actions > Get Keytab

1

The public key for the user on the server from which Gluent Offload Engine commands will be initiated should be added.

2

Refer to Onboarding CDP users and groups for cloud storage (AWS) or Onboarding CDP users and groups for cloud storage (Azure).

3

Only required for manual Kerberos ticket management. See Kerberos.

The user must be synchronized:

  1. Synchronize Users: User Management > Users > Actions > Synchronize Users

  2. Synchronize Users to FreeIPA: Environments > [environment name] > Actions > Synchronize Users to FreeIPA

Verify the user is present using the following command:

$ id gluent

Storage Requirements

Note

This prerequisite is needed only if Gluent Data Platform is to be installed on an existing Data Hub server.

A filesystem location must be created for Gluent Data Platform installation.

Gluent Data Platform occupies approximately 1GB of storage once unpacked.

During operation, Gluent Data Platform will write log and trace files within its installation directory. Sufficient space will need to be allocated for continuing operations.

The filesystem location must be owned by the provisioned Gluent Data Platform user.

Default Shell

The owner of the Gluent Data Platform software requires the Bash shell. The outcome of the following should be bash for that user:

$ echo $SHELL

Create HDFS Directories

Gluent Data Platform requires one location within HDFS on the Data Hub:

Parameter

Purpose

Necessity

Required Permissions

Default Location

HDFS_LOAD

Transient staging area used by the data transport phase of Offload

Mandatory

Read, write for HADOOP_SSH_USER
Read for impala user

/user/gluent/offload

Provisioning the Gluent Data Platform user via User Management creates the /user/gluent directory in HDFS.

The steps to create the default location with the correct permissions are detailed below.

Create offload directory (as gluent):

hdfs dfs -mkdir /user/gluent/offload

Change permissions on offload directory to allow group write (as gluent):

hdfs dfs -chmod 770 /user/gluent/offload

Grant read and execute permissions on offload directory to Impala (as gluent):

hdfs dfs -setfacl -m user:impala:r-x /user/gluent/offload

Verify permissions on offload directory (as gluent):

hdfs dfs -ls -d /user/gluent/offload

Note

The offload directory should be group writable and show an ACL is active, i.e., the final ls command above should show permissions of drwxrwx---+.

Oracle JDBC Drivers

Oracle’s JDBC driver should be downloaded from Oracle's JDBC and UCP Downloads page and installed to the location shown below. The location is dependent on the method that will be used by Offload to Transport Data to Staging. Ensure the file permissions are world readable. The driver should be installed on the Data Hub node where offload transport jobs will be initiated. This is typically a Gateway node but can be any Data Hub node running a YARN role (e.g. YARN ResourceManager).

Offload Transport Method

Location

Sqoop

/var/lib/sqoop

Spark

$SPARK_HOME/jars

Sqoop

If Sqoop will be used to Transport Data to Staging then save the example command below into a temporary script (e.g. gl_sqoop.sh) and modify the placeholders in --connect, --username, --password and --target-dir with appropriate environment values:

gluent$ sqoop import -Doracle.sessionTimeZone=UTC \
-Doraoop.timestamp.string=true \
-Doraoop.jdbc.url.verbatim=true \
--connect \
jdbc:oracle:thin:@<db_host|vip>:<port>/<service> \
--username <database_username> \
--password $'<database_password>' \
--table SYS.DBA_OBJECTS \
--split-by OBJECT_ID \
--target-dir=/user/gluent/offload/test \
--delete-target-dir \
-m4 \
--direct \
--as-avrodatafile \
--outdir=.glsqoop

Note

If the database password contains a single-quote character (‘) then this must be escaped with a backslash.

Run the test Sqoop job (as gluent) from the node to which the Oracle JDBC Drivers were copied:

$ ./gl_sqoop.sh

Verify the test Sqoop job completes without error.

Oracle OS Package

If Gluent Data Platform is to be installed on an existing Data Hub server install the operating system libaio package on that server if it is not already present (as root):

# yum install libaio

Kerberos

The gluent user requires a valid Kerberos ticket to perform the following Gluent Data Platform operations on the Data Hub:

  • Interact with Impala

  • Issue HDFS commands

  • Run Sqoop or Spark on YARN jobs

Interactions with Impala on Data Hub are via the Apache Knox Gateway which manages Kerberos tickets automatically. No manual Kerberos ticket management is necessary.

Depending on the topology of the Data Hub, manual Kerberos ticket management may be required for the remaining operations. If this is the case, the keytab for the gluent user should be obtained and copied to the Data Hub server(s) on which HDFS commands will be issued, and Sqoop or Spark on YARN jobs initiated on behalf of Gluent Data Platform. An appropriate mechanism to maintain an active Kerberos ticket on those Data Hub server(s) should be implemented.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com