Gluent Node Provisioning¶

Table of Contents

Introduction
Specification
Google Cloud Platform Marketplace Provisioning
Manual Provisioning
Documentation Feedback

Introduction ¶

Gluent Node is the logical name given to a server or cluster that hosts the following functions:

Data Daemon: This is included in the Gluent Data Platform package
Gluent Transport: This package contains Spark Standalone, required for offloading RDBMS data to cloud storage buckets (for temporary staging)

The configuration of Gluent Node is very flexible. For example:

Gluent Node can be either a single server or a cluster
Gluent Node can be an on-premises physical or virtual server or cluster
Gluent Node can be a cloud server or cluster (for example, Google Cloud Platform VMs for operating with Google BigQuery)
Gluent Node can be an edge node of a Hadoop cluster (for example, for Cloudera Data Hub or Cloudera Data Platform environments)
Both packages can be installed on the same or separate servers or clusters and have different resource allocations
The Gluent Transport package can be omitted if an existing Spark cluster (or Hadoop cluster with Spark) is available
Gluent Node could be the RDBMS server or cluster (not recommended)

This document includes both manual and Google Cloud Platform Marketplace provisioning. Contact Gluent Support for details on provisioning not covered in this document.

Specification ¶

The minimum supported specification of the Gluent Node is:

Operating System	Red Hat Enterprise Linux, CentOS, Oracle Linux, 64-bit, version 7
CPU	2
Memory	8GB
Disk	20GB

Google Cloud Platform Marketplace Provisioning ¶

To run the Gluent Node on Google Cloud Platform, the recommended method of provisioning is to use one of the Gluent Data Platform images available in Google Cloud Platform Marketplace. The images contain an installation of Spark Standalone, Gluent Data Platform and a pre-configured Data Daemon. Contact Gluent Support for details on the licensing options available.

The following steps should be performed after the Google Compute Engine instance has been created from the Google Cloud Platform Marketplace image:

Service Account JSON Key File
Configure Spark Standalone
Restart Spark Standalone

Service Account JSON Key File¶

The service account JSON key file should be copied to /opt/gluent.

Configure Spark Standalone¶

The Spark Standalone configuration must be updated from the contents of the JSON key file. A script is provided to achieve this which should be run as the gluent OS user:

$ cd /opt/gluent
$ ./configure_spark_for_gcp <replace-with-service-account-key-file-name>.json

Restart Spark Standalone¶

For the changes to take effect Spark Standalone must be restarted.

To stop Spark Standalone issue the following commands:

$ $SPARK_HOME/sbin/stop-all.sh
$ $SPARK_HOME/sbin/stop-history-server.sh

To start Spark Standalone manually, issue the following commands:

$ $SPARK_HOME/sbin/start-all.sh
$ $SPARK_HOME/sbin/start-history-server.sh

Manual Provisioning ¶

The following are mandatory actions when manually provisioning the Gluent Node:

OS Packages
Gluent Data Platform OS User
Gluent Software Parent Directory

Additional actions are required to install and configure Spark Standalone if required for Google BigQuery and Snowflake backends:

Gluent Data Platform OS User Profile
Gluent Data Platform OS User SSH
Install Spark Standalone
Configure Cloud Storage
Automatically Start Spark Standalone
Start Spark Standalone

All steps should be performed as the root user unless indicated otherwise.

OS Packages¶

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User¶

Provision the Gluent Data Platform OS user (in this example named gluent):

# useradd gluent -u 1066

Gluent Software Parent Directory¶

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Gluent Data Platform OS User Profile¶

Perform the following actions as the Gluent Data Platform OS User:

$ cat << EOF >> ~/.bashrc
export SPARK_HOME=/opt/gluent/transport/spark
export PATH=\$PATH:\$SPARK_HOME/bin
EOF

Gluent Data Platform OS User SSH¶

Configure SSH to localhost for the Gluent Data Platform OS User:

# su - gluent -c 'ssh-keygen -t rsa -N "" -f /home/gluent/.ssh/id_rsa'
# umask 077
# cat /home/gluent/.ssh/id_rsa.pub >> /home/gluent/.ssh/authorized_keys
# chown gluent:gluent /home/gluent/.ssh/authorized_keys
# su - gluent -c 'umask 077 && ssh-keyscan -t ecdsa localhost >> ~/.ssh/known_hosts'

Install Spark Standalone¶

Perform the following actions as the Gluent Data Platform OS User:

$ tar xf <Gluent Data Platform Installation Media Directory>/gluent_transport_spark_<version>.tar.bz2 -C /opt/gluent

Configure Cloud Storage¶

During an offload Gluent Data Platform will use Spark to stage data in a cloud storage bucket or container.

The privileges and authentication configuration Spark requires are dependent on the cloud storage provider to be used.

Privileges¶

The Gluent Data Platform backend prerequisites shown below provide the required Spark privileges. Ensure that the relevant prerequisite has been completed.

Backend		Cloud Storage Provider
Google BigQuery		Google Cloud Storage
Snowflake	Amazon S3	Google Cloud Storage	Microsoft Azure

Authentication Configuration¶

Note

Alternate methods of authentication with Amazon S3 are available and are listed below. Any of these can be used can be used instead of populating the Spark Configuration File.

Instance Attached IAM Role (Recommended)	No Spark level configuration is required. Ensure the IAM role is attached to all Gluent Node instances running Spark
Environment Variables	The `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables can be made available to Spark via standard operating system techniques such as a `.bashrc` file or via the Spark environment file `$SPARK_HOME/conf/spark-env.sh`
AWS Credentials File	Authentication can be configured via a credentials file with `aws configure`. Ensure this is done for all Gluent Nodes running Spark

Spark Configuration File¶

Authentication configuration can be made available to Spark via the configuration file.

Edit /opt/gluent/transport/spark/conf/spark-defaults.conf as the Gluent Data Platform OS User adding the parameters and values for the cloud storage provider to be used:

Cloud Storage Provider	Parameter	Value
Amazon S3	`spark.hadoop.fs.s3a.impl`	`org.apache.hadoop.fs.s3a.S3AFileSystem`
	`spark.hadoop.fs.s3a.access.key`	Access key ID
	`spark.hadoop.fs.s3a.secret.key`	Secret access key
Google Cloud Storage	`spark.hadoop.fs.gs.impl`	`com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem`
	`spark.hadoop.fs.AbstractFileSystem.gs.impl`	`com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS`
	`spark.hadoop.fs.gs.project.id`	project_id value from the JSON key for the service account
	`spark.hadoop.fs.gs.auth.service.account.email`	client_email value from the JSON key for the service account
	`spark.hadoop.fs.gs.auth.service.account.enable`	`true`
	`spark.hadoop.fs.gs.auth.service.account.private.key.id`	private_key_id value from the JSON key for the service account
	`spark.hadoop.fs.gs.auth.service.account.private.key`	private_key value from the JSON key for the service account
	`spark.authenticate.secret`	Replace `change_me` with a unique random string
Microsoft Azure	`spark.hadoop.fs.{SCHEME}.impl` 1	`org.apache.hadoop.fs.azure.NativeAzureFileSystem`
	`spark.hadoop.fs.azure.account.key.{ACCOUNTNAME}.blob.core.windows.net` 2	Storage account key
	`spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped` 3	`true`

1: {SCHEME} should be replaced with one of wasb, wasbs, abfs, abfss.
2: {ACCOUNTNAME} should be replaced with the name of the storage account to be used.
3: Only required when the storage account is a standard general-purpose v2 with hierarchical namespace enabled.

Automatically Start Spark Standalone¶

Add the following to /etc/rc.d/rc.local for automatic startup of Spark Standalone:

# Startup of Spark Standalone
su - gluent -c '$SPARK_HOME/sbin/start-all.sh'
su - gluent -c '$SPARK_HOME/sbin/start-history-server.sh'

Ensure it is executable:

# chmod +x /etc/rc.d/rc.local

Start Spark Standalone¶

Spark Standalone should now be started.