Gluent Node Provisioning¶

Table of Contents

Introduction
Specification
Gluent Node for Azure Synapse Analytics
Gluent Node for Cloudera Data Hub
Gluent Node for Cloudera Data Platform Private Cloud
Gluent Node for Cloudera Data Platform Public Cloud
Gluent Node for Google BigQuery
Gluent Node for Snowflake
Manual Configuration of Cloud Storage
Documentation Feedback

Introduction ¶

Gluent Node is the logical name given to a server or cluster that hosts the following functions:

Data Daemon: This is included in the Gluent Data Platform package
Gluent Transport: This package contains Spark Standalone, required for offloading RDBMS data to cloud storage buckets (for temporary staging)

The configuration of Gluent Node is very flexible. For example:

Gluent Node can be either a single server or a cluster
Gluent Node can be an on-premises physical or virtual server or cluster
Gluent Node can be a cloud server or cluster (for example, Google Cloud Platform VMs for operating with Google BigQuery)
Gluent Node can be an edge node of a Hadoop cluster (for example, for Cloudera Data Hub or Cloudera Data Platform environments)
Both packages can be installed on the same or separate servers or clusters and have different resource allocations
The Gluent Transport package can be omitted if an existing Spark cluster (or Hadoop cluster with Spark) is available
Gluent Node could be the RDBMS server or cluster (not recommended)

The location of Gluent Node(s) and the associated staging location is logically dictated by the location of the backend system. Both the Gluent Node(s) and the staging location should be as close as possible to the backend system.

Gluent Data Platform OS User¶

The owner of the Gluent Data Platform and/or Gluent Transport software is referred to as the Gluent Data Platform OS User.

Specification ¶

The minimum supported specification of a Gluent Node is:

Operating System	Red Hat Enterprise Linux, CentOS, Oracle Linux, 64-bit, version 7
CPU	2
Memory	8GB
Disk	20GB

Gluent Node for Azure Synapse Analytics ¶

For a Azure Synapse Analytics environment both Data Daemon and Spark Standalone are typically run on one or more VM instances within Microsoft Azure. Offloaded data is typically staged in Microsoft Azure Storage.

Manual Provisioning¶

Once the Microsoft Azure VM instance(s) have been created the following actions are required to manually provision a Gluent Node:

Assign Instance Privileges
OS Packages
Gluent Data Platform OS User
Gluent Software Parent Directory
Gluent Data Platform OS User Environment
Gluent Data Platform OS User SSH
Install Spark Standalone
Automatic Startup of Spark Standalone
Start Spark Standalone

Assign Instance Privileges¶

Gluent Data Platform can make use of managed service identities to authenticate with the dedicated SQL pool database.

Note

Authentication with Azure storage by Gluent Offload Engine and Spark must be performed using storage account keys.

OS Packages¶

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User¶

Provision the Gluent Data Platform OS user, e.g. gluent.

To create a local user with the standard UID use:

# useradd gluent -u 1066

Note

The Gluent Data Platform OS user is not requirement to be a local user and may be provisioned by an alternative method, e.g. LDAP.

Gluent Software Parent Directory¶

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Note

An alternative location can be used if required by organization standards.

Gluent Data Platform OS User Environment¶

Perform the following actions as the Gluent Data Platform OS User:

$ cat << EOF >> ~/.bashrc
export SPARK_HOME=/opt/gluent/transport/spark
export PATH=\$PATH:\$SPARK_HOME/bin
EOF

Gluent Data Platform OS User SSH¶

Configure SSH to localhost for the Gluent Data Platform OS User:

# su - gluent -c 'ssh-keygen -t rsa -N "" -f /home/gluent/.ssh/id_rsa'
# umask 077
# cat /home/gluent/.ssh/id_rsa.pub >> /home/gluent/.ssh/authorized_keys
# chown gluent:gluent /home/gluent/.ssh/authorized_keys
# su - gluent -c 'umask 077 && ssh-keyscan -t ecdsa localhost >> ~/.ssh/known_hosts'

Note

SSH access to localhost is required in order to allow the Spark Standalone management scripts to connect to localhost.

Install Spark Standalone¶

Perform the following actions as the Gluent Data Platform OS User:

$ tar xf <Gluent Data Platform Installation Media Directory>/gluent_transport_spark_<version>.tar.bz2 -C /opt/gluent

Automatic Startup of Spark Standalone¶

Add the following to /etc/rc.d/rc.local for automatic startup of Spark Standalone:

# Startup of Spark Standalone
su - gluent -c '$SPARK_HOME/sbin/start-all.sh'
su - gluent -c '$SPARK_HOME/sbin/start-history-server.sh'

Ensure /etc/rc.d/rc.local is executable:

# chmod +x /etc/rc.d/rc.local

Start Spark Standalone¶

Spark Standalone should now be started.

To start Spark Standalone manually, issue the following commands:

$ $SPARK_HOME/sbin/start-all.sh
$ $SPARK_HOME/sbin/start-history-server.sh

Gluent Node for Cloudera Data Hub ¶

For a Cloudera Data Hub environment Data Daemon is typically run on one or more “edge/gateway node” in the Cloudera Data Hub cluster and HDFS storage of the cluster is used as the staging location. In this configuration Gluent Data Platform is installed on the edge/gateway nodes and Data Daemon is configured to start automatically on system startup. There is no requirement for the Gluent Transport package.

OS Packages¶

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User¶

Provision the Gluent Data Platform OS user, e.g. gluent.

To create a local user with the standard UID use:

# useradd gluent -u 1066

Note

The Gluent Data Platform OS user is not requirement to be a local user and may be provisioned by an alternative method, e.g. LDAP.

Gluent Software Parent Directory¶

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Note

An alternative location can be used if required by organization standards.

Gluent Node for Cloudera Data Platform Private Cloud ¶

For a Cloudera Data Platform Private Cloud environment Data Daemon is typically run on one or more “edge/gateway node” in the Cloudera Data Platform Private Cloud cluster and HDFS storage of the cluster is used as the staging location. In this configuration Gluent Data Platform is installed on the edge/gateway nodes and Data Daemon is configured to start automatically on system startup. There is no requirement for the Gluent Transport package.

Edge/Gateway Node Preparation¶

OS Packages
Gluent Data Platform OS User
Gluent Software Parent Directory

OS Packages¶

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User¶

Provision the Gluent Data Platform OS user, e.g. gluent.

To create a local user with the standard UID use:

# useradd gluent -u 1066

Note

The Gluent Data Platform OS user is not requirement to be a local user and may be provisioned by an alternative method, e.g. LDAP.

Gluent Software Parent Directory¶

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Note

An alternative location can be used if required by organization standards.

Gluent Node for Cloudera Data Platform Public Cloud ¶

For a Cloudera Data Platform Public Cloud environment Data Daemon is typically run on one or more “edge/gateway node” in the Cloudera Data Platform Public Cloud cluster and HDFS storage of the cluster is used as the staging location. In this configuration Gluent Data Platform is installed on the edge/gateway nodes and Data Daemon is configured to start automatically on system startup. There is no requirement for the Gluent Transport package.

Edge/Gateway Node Preparation¶

OS Packages
Gluent Data Platform OS User
Gluent Software Parent Directory

OS Packages¶

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User¶

Provision the Gluent Data Platform OS user, e.g. gluent.

To create a local user with the standard UID use:

# useradd gluent -u 1066

Note

The Gluent Data Platform OS user is not requirement to be a local user and may be provisioned by an alternative method, e.g. LDAP.

Gluent Software Parent Directory¶

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Note

An alternative location can be used if required by organization standards.

Gluent Node for Google BigQuery ¶

For a Google BigQuery environment both Data Daemon and Spark Standalone are typically run on one or more VM instances within Google Cloud Platform in the same location as the Google BigQuery dataset. Offloaded data is typically staged in Google Cloud Storage in the same location as the Google BigQuery dataset.

Gluent Node hosts can be provisioned using the Gluent Data Platform listings on Google Cloud Platform Marketplace to simplify the deployment of Gluent Data Platform or Gluent Data Platform can be installed manually on hosts running a supported operating system.

Google Cloud Platform Marketplace Provisioning¶

The Gluent Data Platform images available in Google Cloud Platform Marketplace contain an installation of Spark Standalone, Gluent Data Platform and a pre-configured Data Daemon. This significantly simplifies the provisioning of Gluent Node instances.

Once the Google Compute Engine instance(s) have been created via the relevant Google Cloud Platform Marketplace listing the service account created for Gluent Data Platform should be assigned to the instance(s), see. Service Account.

Manual Provisioning¶

Once the Google Compute Engine instance(s) have been created the following actions are required to manually provision a Gluent Node:

Assign Service Account
OS Packages
Gluent Data Platform OS User
Gluent Software Parent Directory
Gluent Data Platform OS User Environment
Gluent Data Platform OS User SSH
Install Spark Standalone
Automatic Startup of Spark Standalone
Start Spark Standalone

Assign Service Account¶

The service account created for Gluent Data Platform should be assigned to the instance(s), see. Service Account.

All steps should be performed as the root user unless indicated otherwise.

OS Packages¶

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User¶

Provision the Gluent Data Platform OS user, e.g. gluent.

To create a local user with the standard UID use:

# useradd gluent -u 1066

Note

The Gluent Data Platform OS user is not requirement to be a local user and may be provisioned by an alternative method, e.g. LDAP.

Gluent Software Parent Directory¶

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Note

An alternative location can be used if required by organization standards.

Gluent Data Platform OS User Environment¶

Perform the following actions as the Gluent Data Platform OS User:

$ cat << EOF >> ~/.bashrc
export SPARK_HOME=/opt/gluent/transport/spark
export PATH=\$PATH:\$SPARK_HOME/bin
EOF

Gluent Data Platform OS User SSH¶

Configure SSH to localhost for the Gluent Data Platform OS User:

# su - gluent -c 'ssh-keygen -t rsa -N "" -f /home/gluent/.ssh/id_rsa'
# umask 077
# cat /home/gluent/.ssh/id_rsa.pub >> /home/gluent/.ssh/authorized_keys
# chown gluent:gluent /home/gluent/.ssh/authorized_keys
# su - gluent -c 'umask 077 && ssh-keyscan -t ecdsa localhost >> ~/.ssh/known_hosts'

Note

SSH access to localhost is required in order to allow the Spark Standalone management scripts to connect to localhost.

Install Spark Standalone¶

Perform the following actions as the Gluent Data Platform OS User:

$ tar xf <Gluent Data Platform Installation Media Directory>/gluent_transport_spark_<version>.tar.bz2 -C /opt/gluent

Automatic Startup of Spark Standalone¶

Add the following to /etc/rc.d/rc.local for automatic startup of Spark Standalone:

# Startup of Spark Standalone
su - gluent -c '$SPARK_HOME/sbin/start-all.sh'
su - gluent -c '$SPARK_HOME/sbin/start-history-server.sh'

Ensure /etc/rc.d/rc.local is executable:

# chmod +x /etc/rc.d/rc.local

Start Spark Standalone¶

Spark Standalone should now be started.

To start Spark Standalone manually, issue the following commands:

$ $SPARK_HOME/sbin/start-all.sh
$ $SPARK_HOME/sbin/start-history-server.sh

Gluent Node for Snowflake ¶

For a Snowflake environment both Data Daemon and Spark Standalone are typically run on one or more VM instances within the same cloud platform as the Snowflake environment and in the same location. Offloaded data is typically staged in the cloud storage provided by the same cloud platform as the Snowflake environment and in the same location.

Manual Provisioning¶

Once the VM instance(s) have been created in the same cloud platform and same region as the Snowflake environment the following actions are required to manually provision a Gluent Node:

Assign Instance Privileges
OS Packages
Gluent Data Platform OS User
Gluent Software Parent Directory
Gluent Data Platform OS User Environment
Gluent Data Platform OS User SSH
Install Spark Standalone
Automatic Startup of Spark Standalone
Start Spark Standalone

Assign Instance Privileges¶

The most secure way to provide the instances running Spark Standalone with the required access to the relevant cloud storage location is via the cloud platform mechanism to assign privileges directly to instances, namely:

Google Cloud Platform: assign an appropriately privileged service account to the instance
Amazon Web Services: assign an appropriately privileged role to the instance

OS Packages¶

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User¶

Provision the Gluent Data Platform OS user, e.g. gluent.

To create a local user with the standard UID use:

# useradd gluent -u 1066

Note

The Gluent Data Platform OS user is not requirement to be a local user and may be provisioned by an alternative method, e.g. LDAP.

Gluent Software Parent Directory¶

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Note

An alternative location can be used if required by organization standards.

Gluent Data Platform OS User Environment¶

Perform the following actions as the Gluent Data Platform OS User:

$ cat << EOF >> ~/.bashrc
export SPARK_HOME=/opt/gluent/transport/spark
export PATH=\$PATH:\$SPARK_HOME/bin
EOF

Gluent Data Platform OS User SSH¶

Configure SSH to localhost for the Gluent Data Platform OS User:

# su - gluent -c 'ssh-keygen -t rsa -N "" -f /home/gluent/.ssh/id_rsa'
# umask 077
# cat /home/gluent/.ssh/id_rsa.pub >> /home/gluent/.ssh/authorized_keys
# chown gluent:gluent /home/gluent/.ssh/authorized_keys
# su - gluent -c 'umask 077 && ssh-keyscan -t ecdsa localhost >> ~/.ssh/known_hosts'

Note

SSH access to localhost is required in order to allow the Spark Standalone management scripts to connect to localhost.

Install Spark Standalone¶

Perform the following actions as the Gluent Data Platform OS User:

$ tar xf <Gluent Data Platform Installation Media Directory>/gluent_transport_spark_<version>.tar.bz2 -C /opt/gluent

Automatic Startup of Spark Standalone¶

Add the following to /etc/rc.d/rc.local for automatic startup of Spark Standalone:

# Startup of Spark Standalone
su - gluent -c '$SPARK_HOME/sbin/start-all.sh'
su - gluent -c '$SPARK_HOME/sbin/start-history-server.sh'

Ensure /etc/rc.d/rc.local is executable:

# chmod +x /etc/rc.d/rc.local

Start Spark Standalone¶

Spark Standalone should now be started.

To start Spark Standalone manually, issue the following commands:

$ $SPARK_HOME/sbin/start-all.sh
$ $SPARK_HOME/sbin/start-history-server.sh

Manual Configuration of Cloud Storage ¶

The sections above cover how to create Gluent Node VM instance(s) using the preferred approach of assigning privileges to the VM instances directly using the mechanism applicable to the relevant cloud platform. It is also possible and fully supported to configure authentication using other methods as detailed below.

The privileges and authentication configuration Spark requires are dependent on the cloud storage provider to be used.

Privileges¶

The Gluent Data Platform backend prerequisites shown below provide the required Spark privileges. Ensure that the relevant prerequisite has been completed.

Backend		Cloud Storage Provider
Azure Synapse Analytics			Microsoft Azure
Google BigQuery		Google Cloud Storage
Snowflake	Amazon S3	Google Cloud Storage	Microsoft Azure

Authentication Configuration¶

Spark Configuration File¶

Authentication configuration can be made available to Spark via the configuration file.

Edit /opt/gluent/transport/spark/conf/spark-defaults.conf as the Gluent Data Platform OS User adding the parameters and values for the cloud storage provider to be used:

Cloud Storage Provider	Parameter	Value
Amazon S3	`spark.hadoop.fs.s3a.impl`	`org.apache.hadoop.fs.s3a.S3AFileSystem`
	`spark.hadoop.fs.s3a.access.key`	Access key ID
	`spark.hadoop.fs.s3a.secret.key`	Secret access key
Google Cloud Storage	`spark.hadoop.fs.gs.impl`	`com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem`
	`spark.hadoop.fs.AbstractFileSystem.gs.impl`	`com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS`
	`spark.hadoop.fs.gs.project.id`	project_id value from the JSON key for the service account
	`spark.hadoop.fs.gs.auth.service.account.email`	client_email value from the JSON key for the service account
	`spark.hadoop.fs.gs.auth.service.account.enable`	`true`
	`spark.hadoop.fs.gs.auth.service.account.private.key.id`	private_key_id value from the JSON key for the service account
	`spark.hadoop.fs.gs.auth.service.account.private.key`	private_key value from the JSON key for the service account
	`spark.authenticate.secret`	Replace `change_me` with a unique random string
Microsoft Azure	`spark.hadoop.fs.{SCHEME}.impl` 1	`org.apache.hadoop.fs.azure.NativeAzureFileSystem`
	`spark.hadoop.fs.azure.account.key.{ACCOUNTNAME}.blob.core.windows.net` 2	Storage account key
	`spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped` 3	`true`

1: {SCHEME} should be replaced with one of wasb, wasbs, abfs, abfss.
2: {ACCOUNTNAME} should be replaced with the name of the storage account to be used.
3: Only required when the storage account is a standard general-purpose v2 with hierarchical namespace enabled.

Note

Alternate methods of authentication with Amazon S3 are available and are listed below. Any of these can be used can be used instead of populating the Spark Configuration File.

Environment Variables	The `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables can be made available to Spark via standard operating system techniques such as a `.bashrc` file or via the Spark environment file `$SPARK_HOME/conf/spark-env.sh`
AWS Credentials File	Authentication can be configured via a credentials file with `aws configure`. Ensure this is done for all Gluent Nodes running Spark

Documentation Feedback ¶

Send feedback on this documentation to: feedback@gluent.com