Gluent Node Provisioning

Introduction

Gluent Node is the logical name given to a server or cluster that hosts the following functions:

  • Data Daemon: This is included in the Gluent Data Platform package

  • Gluent Transport: This package contains Spark Standalone, required for offloading RDBMS data to cloud storage buckets (for temporary staging)

The configuration of Gluent Node is very flexible. For example:

  • Gluent Node can be either a single server or a cluster

  • Gluent Node can be an on-premises physical or virtual server or cluster

  • Gluent Node can be a cloud server or cluster (for example, Google Cloud Platform VMs for operating with Google BigQuery)

  • Gluent Node can be an edge node of a Hadoop cluster (for example, for Cloudera Data Hub or Cloudera Data Platform environments)

  • Both packages can be installed on the same or separate servers or clusters and have different resource allocations

  • The Gluent Transport package can be omitted if an existing Spark cluster (or Hadoop cluster with Spark) is available

  • Gluent Node could be the RDBMS server or cluster (not recommended)

This document includes both manual and Google Cloud Platform Marketplace provisioning. Contact Gluent Support for details on provisioning not covered in this document.

Specification

The minimum supported specification of the Gluent Node is:

Operating System

Red Hat Enterprise Linux, CentOS, Oracle Linux, 64-bit, version 7

CPU

2

Memory

8GB

Disk

20GB

Google Cloud Platform Marketplace Provisioning

To run the Gluent Node on Google Cloud Platform, the recommended method of provisioning is to use one of the Gluent Data Platform images available in Google Cloud Platform Marketplace. The images contain an installation of Spark Standalone, Gluent Data Platform and a pre-configured Data Daemon. Contact Gluent Support for details on the licensing options available.

The following steps should be performed after the Google Compute Engine instance has been created from the Google Cloud Platform Marketplace image:

Service Account JSON Key File

The Service Account JSON key file should be copied to /opt/gluent.

Configure Spark Standalone

The Spark Standalone configuration must be updated from the contents of the JSON key file. A script is provided to achieve this which should be run as the gluent OS user:

$ cd /opt/gluent
$ ./configure_spark_for_gcp <replace-with-service-account-key-file-name>.json

Restart Spark Standalone

For the changes to take effect Spark Standalone must be restarted.

To stop Spark Standalone issue the following commands:

$ $SPARK_HOME/sbin/stop-all.sh
$ $SPARK_HOME/sbin/stop-history-server.sh

To start Spark Standalone manually, issue the following commands:

$ $SPARK_HOME/sbin/start-all.sh
$ $SPARK_HOME/sbin/start-history-server.sh

Manual Provisioning

The following steps are required to manually provision the Gluent Node. The installation and configuration of Spark Standalone is mandatory for Google BigQuery only. All steps should be performed as the root user unless indicated otherwise:

OS Packages

Install the following required packages:

# yum install -y bzip2 java-11 libaio

Gluent Data Platform OS User

Provision the Gluent Data Platform OS user (in this example named gluent):

# useradd gluent -u 1066

Gluent Software Parent Directory

Create a parent directory for the Gluent Data Platform and Spark Standalone software:

# mkdir /opt/gluent
# chown gluent:gluent /opt/gluent

Gluent Data Platform OS User Profile

Perform the following actions as the Gluent Data Platform OS User:

$ cat << EOF >> ~/.bashrc
export SPARK_HOME=/opt/gluent/transport/spark
export PATH=\$PATH:\$SPARK_HOME/bin
EOF

Gluent Data Platform OS User SSH

Configure SSH to localhost for the Gluent Data Platform OS User:

# su - gluent -c 'ssh-keygen -t rsa -N "" -f /home/gluent/.ssh/id_rsa'
# umask 077
# cat /home/gluent/.ssh/id_rsa.pub >> /home/gluent/.ssh/authorized_keys
# chown gluent:gluent /home/gluent/.ssh/authorized_keys
# su - gluent -c 'umask 077 && ssh-keyscan -t ecdsa localhost >> ~/.ssh/known_hosts'

Install Spark Standalone

Perform the following actions as the Gluent Data Platform OS User:

$ tar xf <Gluent Data Platform Installation Media Directory>/gluent_transport_spark_<version>.tar.bz2 -C /opt/gluent

Configure Spark Standalone

Edit /opt/gluent/transport/spark/conf/spark-defaults.conf as the Gluent Data Platform OS User adding the following parameters and values:

Parameter

Value

spark.hadoop.fs.gs.impl

com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

spark.hadoop.fs.AbstractFileSystem.gs.impl

com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

spark.hadoop.fs.gs.project.id

project_id value from the JSON key for the Service Account

spark.hadoop.fs.gs.auth.service.account.email

client_email value from the JSON key for the Service Account

spark.hadoop.fs.gs.auth.service.account.enable

true

spark.hadoop.fs.gs.auth.service.account.private.key.id

private_key_id value from the JSON key for the Service Account

spark.hadoop.fs.gs.auth.service.account.private.key

private_key value from the JSON key for the Service Account

spark.authenticate.secret

Replace change_me with a unique random string

Automatically Start Spark Standalone

Add the following to /etc/rc.d/rc.local for automatic startup of Spark Standalone:

# Startup of Spark Standalone
su - gluent -c '$SPARK_HOME/sbin/start-all.sh'
su - gluent -c '$SPARK_HOME/sbin/start-history-server.sh'

Ensure it is executable:

# chmod +x /etc/rc.d/rc.local

Start Spark Standalone

Spark Standalone should now be started.

To start Spark Standalone manually, issue the following commands:

$ $SPARK_HOME/sbin/start-all.sh
$ $SPARK_HOME/sbin/start-history-server.sh

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com