Integrating with Cloud Storage

Introduction

Gluent Offload Engine can offload data to cloud storage and present Hadoop tables with data in cloud storage back to the RDBMS. Presenting cloud storage tables is transparent to Gluent Offload Engine. Offloading data to cloud storage requires a small amount of configuration. Gluent UDFs will be stored in cloud storage or HDFS depending on the chosen configuration options.

Supported Cloud Storage

Gluent Offload Engine supports the following cloud storage:

  • Amazon S3

  • Microsoft Azure Data Lake Storage Generation 1

  • Microsoft Azure Data Lake Storage Generation 2

Parameters required for cloud storage offload are:

Parameter

Reference

OFFLOAD_FS_SCHEME

The storage scheme in which the offloaded data will be persisted. Ad hoc override available with --offload-fs-scheme

OFFLOAD_FS_CONTAINER

The name of the bucket or container to be used for offloads. Ad hoc override available with --offload-fs-container

OFFLOAD_FS_PREFIX

Set this to a subdirectory defined within the bucket or container or an empty string. Ad hoc override available with --offload-fs-prefix

Note

Before attempting to interact with cloud storage using Gluent Data Platform confirm that the Hadoop cluster can read from and write to the target bucket or container. Use native hdfs dfs commands to confirm this.

Offload Scenarios

There are three likely scenarios when planning to offload to cloud storage:

  • By default RDBMS schemas shall be offloaded to cloud storage. A limited number of offloads will use HDFS

  • By default RDBMS schemas shall be offloaded to HDFS. A limited number of offloads will use cloud storage

  • There is a mix of schemas in the RDBMS and some should use HDFS while others should use cloud storage

Identifying the appropriate use-case above defines how configuration will be completed.

The Default Offload Location is HDFS

For this case, it is recommended OFFLOAD_FS_SCHEME is left at the default value inherit. All tables created by offload will inherit the location from the parent database. If databases are created using --create-backend-db a default location of HDFS will be used.

Offloads to cloud storage can be completed on an ad hoc basis using --offload-fs-scheme s3a|adl|abfs|abfss.

The Default Offload Location is Cloud Storage

In this case set OFFLOAD_FS_SCHEME to the correct value for your cloud storage target. All tables created by offload will be offloaded to cloud storage and any databases created using --create-backend-db will include a default cloud storage location.

Offloads to HDFS can be completed on an ad hoc basis using --offload-fs-scheme hdfs.

The Default Offload Location is Mixed Depending on the Schema

For this case, it is recommended OFFLOAD_FS_SCHEME is left at the default value inherit. All tables created by offload will inherit the location from the parent database. When databases are created using --create-backend-db it is important to include the correct value for --offload-fs-scheme, i.e either hdfs or the correct value for your cloud storage target. If Hadoop databases are created outside of Gluent Data Platform then be sure to define the appropriate location.

Ad hoc offloads to the non-default filesystem can be completed using --offload-fs-scheme.

Environment Verification (Connect) will verify the cloud storage configuration.

User Defined Functions

When OFFLOAD_FS_SCHEME is set to a cloud storage target, the Gluent UDF library will be copied to the cloud storage location specified by OFFLOAD_FS_CONTAINER and OFFLOAD_FS_PREFIX and the UDFs will be created referencing that location.

When OFFLOAD_FS_SCHEME is set to hdfs or inherit, the Gluent UDF library will be copied to the HDFS location specified by HDFS_HOME and the UDFs will be created referencing that location.

Important

Gluent UDFs must be installed on cloud storage when Gluent Query Engine uses Data Warehouse on Cloudera Data Platform Public Cloud.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com