Integrating with Cloud Storage¶
Table of Contents
Introduction¶
Gluent Offload Engine can offload data to cloud storage and present Hadoop tables with data in cloud storage back to the RDBMS. Presenting cloud storage tables is transparent to Gluent Offload Engine. Offloading data to cloud storage requires a small amount of configuration. Gluent UDFs will be stored in cloud storage or HDFS depending on the chosen configuration options.
Supported Cloud Storage¶
Gluent Offload Engine supports the following cloud storage:
Amazon S3
Microsoft Azure Data Lake Storage Generation 1
Microsoft Azure Data Lake Storage Generation 2
Parameters required for cloud storage offload are:
Parameter |
Reference |
---|---|
The storage scheme in which the offloaded data will be persisted. Ad hoc override available with |
|
The name of the bucket or container to be used for offloads. Ad hoc override available with |
|
Set this to a subdirectory defined within the bucket or container or an empty string. Ad hoc override available with |
Note
Before attempting to interact with cloud storage using Gluent Data Platform confirm that the Hadoop cluster can read from and write to the target bucket or container. Use native hdfs dfs
commands to confirm this.
Offload Scenarios¶
There are three likely scenarios when planning to offload to cloud storage:
By default RDBMS schemas shall be offloaded to cloud storage. A limited number of offloads will use HDFS
By default RDBMS schemas shall be offloaded to HDFS. A limited number of offloads will use cloud storage
There is a mix of schemas in the RDBMS and some should use HDFS while others should use cloud storage
Identifying the appropriate use-case above defines how configuration will be completed.
The Default Offload Location is HDFS¶
For this case, it is recommended OFFLOAD_FS_SCHEME
is left at the default value inherit
. All tables created by offload will inherit the location from the parent database. If databases are created using --create-backend-db
a default location of HDFS will be used.
Offloads to cloud storage can be completed on an ad hoc basis using --offload-fs-scheme
s3a|adl|abfs|abfss
.
The Default Offload Location is Cloud Storage¶
In this case set OFFLOAD_FS_SCHEME
to the correct value for your cloud storage target. All tables created by offload will be offloaded to cloud storage and any databases created using --create-backend-db
will include a default cloud storage location.
Offloads to HDFS can be completed on an ad hoc basis using --offload-fs-scheme
hdfs
.
The Default Offload Location is Mixed Depending on the Schema¶
For this case, it is recommended OFFLOAD_FS_SCHEME
is left at the default value inherit
. All tables created by offload will inherit the location from the parent database. When databases are created using --create-backend-db
it is important to include the correct value for --offload-fs-scheme
, i.e either hdfs
or the correct value for your cloud storage target. If Hadoop databases are created outside of Gluent Data Platform then be sure to define the appropriate location.
Ad hoc offloads to the non-default filesystem can be completed using --offload-fs-scheme
.
Environment Verification (Connect) will verify the cloud storage configuration.
User Defined Functions¶
When OFFLOAD_FS_SCHEME
is set to a cloud storage target, the Gluent UDF library will be copied to the cloud storage location specified by OFFLOAD_FS_CONTAINER
and OFFLOAD_FS_PREFIX
and the UDFs will be created referencing that location.
When OFFLOAD_FS_SCHEME
is set to hdfs
or inherit
, the Gluent UDF library will be copied to the HDFS location specified by HDFS_HOME
and the UDFs will be created referencing that location.
Important
Gluent UDFs must be installed on cloud storage when Gluent Query Engine uses Data Warehouse on Cloudera Data Platform Public Cloud.