Integrating with Cloud Storage

Introduction

Gluent Offload Engine can offload data to cloud storage and present Hadoop tables with data in cloud storage back to the RDBMS. Presenting cloud storage tables is transparent to Gluent Offload Engine. Offloading data to cloud storage requires a small amount of configuration.

Supported Cloud Storage

Gluent Offload Engine supports the following cloud storage:

  • Amazon S3

  • Microsoft Azure Data Lake Storage Generation 1

  • Microsoft Azure Data Lake Storage Generation 2

Parameters required for cloud storage offload are:

Parameter

Reference

OFFLOAD_FS_SCHEME

The storage scheme in which the offloaded data will be persisted. Ad hoc override available with --offload-fs-scheme

OFFLOAD_FS_CONTAINER

The name of the bucket or container to be used for offloads. Ad hoc override available with --offload-fs-container

OFFLOAD_FS_PREFIX

Set this to a subdirectory defined within the bucket or container or an empty string. Ad hoc override available with --offload-fs-prefix

Note

Before attempting to interact with cloud storage using Gluent Data Platform confirm that the Hadoop cluster can read from and write to the target bucket or container. Use native hdfs dfs commands to confirm this.

Scenarios

There are three likely scenarios when planning to offload to cloud storage:

  • By default RDBMS schemas shall be offloaded to cloud storage. A limited number of offloads will use HDFS

  • By default RDBMS schemas shall be offloaded to HDFS. A limited number of offloads will use cloud storage

  • There is a mix of schemas in the RDBMS and some should use HDFS while others should use cloud storage

Identifying the appropriate use-case above defines how configuration will be completed.

The Default Offload Location is HDFS

For this case, it is recommended OFFLOAD_FS_SCHEME is left at the default value inherit. All tables created by offload will inherit the location from the parent database. If databases are created using --create-backend-db a default location of HDFS will be used.

Offloads to cloud storage can be completed on an ad hoc basis using --offload-fs-scheme s3a|adl|abfs|abfss.

The Default Offload Location is Cloud Storage

In this case set OFFLOAD_FS_SCHEME to the correct value for your cloud storage target. All tables created by offload will be offloaded to cloud storage and any databases created using --create-backend-db will include a default cloud storage location.

Offloads to HDFS can be completed on an ad hoc basis using --offload-fs-scheme hdfs.

The Default Offload Location is Mixed Depending on the Schema

For this case, it is recommended OFFLOAD_FS_SCHEME is left at the default value inherit. All tables created by offload will inherit the location from the parent database. When databases are created using --create-backend-db it is important to include the correct value for --offload-fs-scheme, i.e either hdfs or the correct value for your cloud storage target. If Hadoop databases are created outside of Gluent Data Platform then be sure to define the appropriate location.

Ad hoc offloads to the non-default filesystem can be completed using --offload-fs-scheme.

Important

Any changes made to the Gluent Data Platform environment file (offload.env) must be propagated across all installations.

Environment Verification (Connect) will verify the cloud storage configuration.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com