Offload

Documentation Conventions

  • Commands and keywords are in this font.

  • $OFFLOAD_HOME is set when the environment file (offload.env) is sourced, unless already set, and refers to the directory named offload that is created when the software is unpacked. This is also referred to as <OFFLOAD_HOME> in sections of this guide where the environment file has not been created/sourced.

  • Third party vendor product names might be aliased or shortened for simplicity. See Third Party Vendor Products for cross-references to full product names and trademarks.

Introduction

Offload is a core component of Gluent Offload Engine that provides a simple and flexible way to copy data from an RDBMS (such as Oracle Database) to a variety of backend data platforms (such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery, Snowflake, Azure Synapse Analytics). This guide describes how to use Offload in a number of common and advanced use-cases. It is assumed that readers are familiar with Gluent Data Platform Concepts prior to reading this Offload guide, particularly the Gluent Offload Engine and Offload concepts.

Overview of the Offload Process

To understand the Offload software and its use-cases, options and operation, it is useful to start with a high-level understanding of an offload process in general. Offloading data from an RDBMS to a backend requires a number of sequential operations to be orchestrated, involving a variety of components, interfaces and parts of the source and target systems. When running an offload, the various steps executed by Offload are visible both on the terminal and in a log file, but a broad description of what these steps are doing is provided below.

An offload process has four broad logical phases:

  1. Phase 1: Setup

  2. Phase 2: Offload Transport

  3. Phase 3: Hybrid Objects

  4. Phase 4: Verification

Phase 1: Setup

In the setup phase, Offload will gather a range of information about the data to be offloaded, such as the structure of the data, its data types, metadata from previous offloads and so on. With this information, Offload is able to prepare the target backend system ready for the Offload Transport phase. This includes the creation of target databases and supporting objects (if required), the creation of a staging location and objects (if required) for the offloaded data, and for first-time offloads, a final table in the target backend database.

Phase 2: Offload Transport

In the transport phase, Offload copies the source RDBMS data to the backend staging location. Once the staged data is validated, it is loaded into the backend target table. This two-phased approach provides two main benefits:

  1. Atomicity: the load from staged data to target table either fully succeeds or completely fails

  2. Partitioning: for backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table

Phase 3: Hybrid Objects

In the hybrid objects phase, Gluent Offload Engine creates a number of database objects in the RDBMS hybrid schema and saves metadata to Gluent Metadata Repository to support both the operation of Gluent Data Platform and provide an interface for users to transparently query offloaded data. Statistics, synonyms and privileges are also created as required. See Offload Objects below for an example of the hybrid objects that are created by a standard Offload.

Where applicable, Offload will also create hybrid copies of any views that are dependent on the offloaded RDBMS table. If any such dependent hybrid views already exist, Offload will also re-generate these with any changes that are required as a result of the latest offload operation.

Phase 4: Verification

The verification phase is the final operation in an offload process. In this phase, Offload queries both the hybrid view and the source RDBMS table to verify that the offloaded data set matches in both databases. This has a secondary benefit of validating the hybrid objects. Following successful verification of the offloaded data, the offload is complete. Any notices or warnings identified during the various phases are logged and Offload exits with success (i.e. an exit code of 0). See the --verify option for available validation methods. By default, the verification SQL runs serially, but can be executed in parallel by using the --verify-parallelism option.

Controlling the Offload Process

The offload process is highly configurable. Many steps in the process can be modified by options or configuration parameters and a number of steps are also optional and can be disabled per offload. In addition to the remainder of this Offload guide (which will describe many of the options), see the Reference guide for a full list of options and configuration parameters and their descriptions.

Offload Objects

The following table lists the standard set of RDBMS and backend database objects that are created by an offload of SCHEMA.TABLE.

Table 1: Objects Created by Offload

Database

Object Type

Object Naming Format

Hybrid Object Type

Backend

Table

SCHEMA_LOAD.TABLE

Temporary staging table

Backend

Table

SCHEMA.TABLE

Offloaded table

RDBMS

View

SCHEMA_H.TABLE

Hybrid view

RDBMS

Table

SCHEMA_H.TABLE_EXT

Hybrid external table

RDBMS

View

SCHEMA_H.TABLE_AGG

Aggregate hybrid view

RDBMS

Table

SCHEMA_H.TABLE_AGG_EXT

Aggregate hybrid external table

RDBMS

Rewrite Equivalence

SCHEMA_H.TABLE_AGG

Aggregate rewrite rule

RDBMS

View

SCHEMA_H.TABLE_CNT_AGG

Aggregate hybrid view

RDBMS

Table

SCHEMA_H.TABLE_CNT_AGG_EXT

Aggregate hybrid external table

RDBMS

Rewrite Equivalence

SCHEMA_H.TABLE_CNT_AGG

Aggregate rewrite rule

Google BigQuery identifiers (dataset, table and view names) are case-sensitive. Snowflake identifiers are by default case-sensitive when quoted. Azure Synapse Analytics identifiers are case-sensitive when quoted or when the Synapse dedicated SQL pool is created as a case-sensitive database. Gluent Offload Engine allows a default case to be set for all Google BigQuery, Snowflake and Azure Synapse Analytics identifiers created by Offload to ensure consistency. See BACKEND_IDENTIFIER_CASE for more details.

Offload adds a variety of suffixes to the offloaded table name when naming the hybrid objects. If the length of a hybrid object name exceeds the RDBMS limit, Offload will truncate the hybrid object name to below the limit and pad the remainder with a unique string to avoid potential clashes with other truncated identifiers.

See also

Glossary for definitions of the hybrid object types in addition to the Concepts and Performance guides for further information on their purpose and usage.

Offloading Data

Offload supports several scenarios for offloading data from the RDBMS:

Full Offload

Heap, partitioned and index-organized tables can be fully-offloaded from the RDBMS as follows.

Example 1: Offload a Full Table

$ . $OFFLOAD_HOME/conf/offload.env

$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x

Note

This example includes the sourcing of the Gluent Data Platform environment file for an offload session. All other examples in this guide will assume this step has occurred.

A log file containing all of the steps in the offload process will be generated in $OFFLOAD_HOME/log. Using the -v or --vv option will result in more screen output during the offload.

Backend tables can be optionally partitioned, even if the source RDBMS table is not partitioned. See Managing Backend Partitioning below for more details.

It is possible to rename the target backend table when offloading with the --target-name option. This can be useful for a number of reasons, including when:

  • The source RDBMS table name includes a character that is not supported by the backend (for example, Oracle Database allows $ to be used for identifiers but this is not supported by Impala or BigQuery)

  • The naming standards used for applications in the backend system are different to the source RDBMS application (either for database names or object names)

  • The RDBMS schema name is too long to accommodate necessary Offload extensions such as the _load suffix for the backend staging database or the value of DB_NAME_PREFIX (for example, Impala has a database name limit of 128 characters, which means that the maximum supported length of the RDBMS schema is 123 when DB_NAME_PREFIX is not in use or 122 minus the length of DB_NAME_PREFIX when it is in use)

Example 2: Change the Offload Target Name

The following example offloads a RDBMS table named SH.SALES$ to a backend that doesn’t support the $ character in table names, meaning that the target table needs to be renamed to remove the $.

$ $OFFLOAD_HOME/bin/offload -t 'SH.SALES$' -x --target-name=SH.SALES

See also

This guide describes a number of Offload options, but a full list is provided in the Offload Parameters section of the Reference manual.

Partition-Based Offload

Partition-Based Offload enables some or all partitions of a partitioned table to be offloaded across one or more offload sessions. A common use of this feature is to initially offload the historical partitions for a table and then periodically offload new partitions as they arrive in the RDBMS, appending the data to the backend version of the table.

Partition-Based Offload can be used for the following scenarios:

Offloading Range-Partitioned Tables

Tables that are range-partitioned on DATE, TIMESTAMP, NUMBER or [N]VARCHAR2 columns can be fully or partially offloaded. Partitions can be offloaded contiguously up to a boundary by specifying a high water mark in the offload command. Partition boundaries can be increased in subsequent offloads to append additional partitions without affecting historical partitions.

Partition Boundary Options

To offload a contiguous range of partitions from a range-partitioned table, one of the following boundary options must be used:

--older-than-date=2015-10-01
  • --older-than-days: Offload partitions older than this number of days (exclusive, i.e. the boundary partition is not offloaded). This option is suitable for keeping data up to a certain age in the source table and is an alternative to the --older-than-date option. If both are supplied, --older-than-date will be used, e.g.

--older-than-days=90
  • --less-than-value: Offload partitions with a high water mark less than this value. Will accept integer, string or date values and allows intraday date and timestamp values to be specified, e.g.

--less-than-value=100000000
--less-than-value=2015-01-01
--less-than-value="2015-01-01 12:00:00"
--less-than-value=M
  • --partition-names: Offload partitions with a high water mark matching that of the named partition, e.g.

--partition-names=P201509

Boundary options for range-partitioned tables are exclusive and the partition that contains data for the specified high water mark will not be offloaded. Furthermore, Offload will not offload partial partitions. For example, suppose a partition has data with a value range of 2000-01-01 to 2000-01-31. Using --older-than-date=2000-01-15 will not offload data from 2000-01-01 to 2000-01-14 because the partition has the potential for data up to and including 2000-01-31. Instead, data will be offloaded up to the boundary of the previous range partition. To offload the example partition, an --older-than-date=2000-02-01 value must be used, but only when loading for the entire partition has completed.

Note

Backend partitioning concepts are not applicable when offloading to Snowflake or Azure Synapse Analytics.

Offloading with Date Partition Boundaries

The following is an example of offloading all partitions below a given date boundary.

Example 3: Offload a Range of Date Partitions
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-10-01

By default, Example 3 above will use the RDBMS partition key column as the backend partition column for backends that support partitioning (they can be different if required; see User-Defined Partitioning for details). It is common to use a different granularity for the backend partitions, especially useful when the backend system better supports either larger or fewer partitions. For example, an RDBMS table partitioned by day might be better offloaded to monthly partitions in a Hadoop-based backend. See --partition-granularity for details of the default granularities for various scenarios and backends.

Offloading with Numeric Partition Boundaries

For range-partitioned tables with numeric partition keys, the partition granularity of the backend table must be specified. The command syntax differs slightly according to the target backend, as the following examples demonstrate.

Example 4: Offload a Range of Numeric Partitions (Hadoop)

This example offloads all partitions with a partition key value of less than 5,000,000,000 and creates backend partitions with a range of 10,000,000. This applies when offloading to Hadoop-based platforms.

$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
      --less-than-value=5000000000 \
      --partition-granularity=10000000

Note

It is also possible to specify a limit to the size of numeric partition keys (see --partition-digits).

Example 5: Offload a Range of Numeric Partitions (BigQuery)

For Google BigQuery, the full range of potential partition key values must be specified when creating a table with numeric partitions, hence the additional Offload options in the following example.

$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
      --less-than-value=5000000000 \
      --partition-granularity=10000000 \
      --partition-lower-value=1 \
      --partition-upper-value=99999999999999999

Note

Any partition key data that falls outside the range specified by the lower and upper values will be loaded into the __UNPARTITIONED__ partition by BigQuery.

Offloading with String Partition Boundaries

For range-partitioned tables with string partition keys, the partition granularity of the backend table must be specified when offloading to Hadoop-based backends. Google BigQuery does not support partitioning on STRING columns; therefore the source partition column data must be converted to INT64 with a custom partition function (see --partition-functions) to enable the backend table to be synthetically partitioned. The following examples demonstrate the different options required when offloading to Hadoop or BigQuery backends.

Example 6: Offload a Range of String Partitions (Hadoop)

This example offloads all partitions with a partition key value of less than ‘M’ and creates backend partitions based on the first two characters of the source string data. This applies when offloading to Hadoop-based platforms.

$ $OFFLOAD_HOME/bin/offload -t SH.VARCHAR2_PARTITIONED_FACT -x \
      --less-than-value=M \
      --partition-granularity=2
Example 7: Offload a Range of String Partitions (BigQuery)

Google BigQuery does not have native STRING partitioning support, but the source string data can still be used to partition the backend table synthetically if a custom UDF is provided to convert the string data to an INT64 type. The following example shows a VARCHAR2 partition key used as the source for Google BigQuery partitioning. A custom UDF is provided to generate an ASCII value for the first character of the source data and the resulting INT64 value is used to partition the backend table. The full range of potential partition key values must be specified when offloading a table with a VARCHAR2 partition column.

$ $OFFLOAD_HOME/bin/offload -t SH.VARCHAR2_PARTITIONED_FACT -x \
      --less-than-value=F \
      --partition-granularity=1 \
      --partition-lower-value=65 \
      --partition-upper-value=150 \
      --partition-functions=MY_UDF_DATASET.STRING_TO_ASCII
Fully Offloading Range-Partitioned Tables with Offload Type

Range-partitioned tables are also supported for full offload, using either the --offload-type option with a value of FULL or by excluding any of the partition boundary options described above. The following option combinations will fully offload a range-partitioned table:

  • --offload-type=FULL: Offloads all partitions and creates a simple hybrid view that queries all the data from the backend

  • --offload-type=FULL plus one of the boundary options: Offloads all partitions and data to the backend but creates a composite hybrid view that queries the data below the specified boundary from the backend and data above the specified boundary from the RDBMS

  • None of the offload type or partition boundary options described above: This is the same as specifying --offload-type=FULL

See also

See Offload Type Considerations for more details on support for partial and full offloads of partitioned tables.

Moving the Range High Water Mark in a Hybrid View

The partition boundary in a hybrid view can be modified by running a subsequent offload with a boundary option matching a partition value that has already been offloaded. No data will be copied from the RDBMS to the backend but the associated hybrid view(s) will be reconfigured to reflect the new settings. If no partition is matched by the boundary option values then the partition boundary will be aligned with the data that has already been offloaded.

Moving the partition boundary does not affect any data that has already been offloaded. It is a metadata operation that simply changes where hybrid queries source their results from above and below the boundary. As described in Offload Type Considerations, this technique can be useful to get the best trade-off between making current data available in the backend system as soon as possible but having the most frequently accessed data queried from the RDBMS until it becomes less-used.

MAXVALUE Partition Considerations

Offloads for range-partitioned tables with a MAXVALUE partition as the latest partition will behave differently depending on the offload options used, as follows:

  • If neither the --offload-type option nor any of the partition boundary options are specified, the MAXVALUE partition will be offloaded

  • If --offload-type=FULL is specified, either with or without a partition boundary option, the MAXVALUE partition will be offloaded

  • If --offload-type=INCREMENTAL is specified, either with or without a partition boundary option, the MAXVALUE partition will not be offloaded

  • If a partition boundary option is specified but not --offload-type, the MAXVALUE partition will not be offloaded

Note

Offloading the MAXVALUE partition for a table will prevent any further partition-based offloads for that table.

Multi-Column Partitioning

Oracle Database supports multi-column partitioning, whereby a table is partitioned on a set of columns rather than a single partition key (for example, PARTITION BY RANGE (yr, mon)). Offload supports this for range-partitioned tables. Any boundary condition must be provided via the --less-than-value option as comma-separated-values. For example:

--less-than-value=2015,10

Or:

--less-than-value=2015-10-31,200

By default, when offloading a multi-column partitioned table to Hadoop-based backends, the generated Hadoop table will be partitioned and subpartitioned to match the columns used in the RDBMS partitioning scheme. For example, an Oracle Database table partitioned by YR, MON will be partitioned in Hadoop by yr and then subpartitioned by mon. This can also be controlled using the --partition-columns option.

When offloading a multi-column partitioned table to Google BigQuery, the generated BigQuery table will be partitioned according to the leading partition key column only. The --partition-columns option will only allow one column to be specified when offloading to Google BigQuery.

Offloading Interval-Partitioned Tables

Offload supports interval-partitioned tables in exactly the same way as range-partitioned tables.

Offloading List-Partitioned Tables

Tables that are list-partitioned on DATE, TIMESTAMP, NUMBER, [N]VARCHAR2 or [N]CHAR columns can be partially or fully offloaded. Discrete sets of one or more partitions can be offloaded by specifying either partition key values or partition names in the offload command. Additional partitions can be offloaded in subsequent offloads without affecting historical partitions.

List Partition Specification

Partitions in a list-partitioned table can be offloaded using either of the options below:

  • --equal-to-values: Offload partitions by partition key value, accepting partition key literals that match the RDBMS partitions, e.g.

--equal-to-values=2015-01-01
--equal-to-values=US --equal-to-values=AP --equal-to-values=EU
--equal-to-values=201501,201502,201503
--partition-names=TRANSACTIONS_P2015Q1,TRANSACTIONS_P2015Q2

When using --equal-to-values, each partition must have its own option specification, as shown above and in Example 8 below.

Example 8: Offloading Partitions from a List-Partitioned Table

The following example offloads two partitions from a date list-partitioned table:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --equal-to-values=2015-01-01 --equal-to-values=2015-01-02
Example 9: Offloading Multi-Valued Partitions from a List-Partitioned Table

When using --equal-to-values, each option must equal the full high values specification for a single partition. The following example offloads two multi-valued list partitions where the partition key is a numeric representation of year-month:

$ $OFFLOAD_HOME/bin/offload -t SH.TRANSACTIONS -x --equal-to-values=201501,201502,201503 --equal-to-values=201504,201505,201506

Offloads for tables with numeric list-partition keys does not require the --partition-granularity option, but for offloading to Google BigQuery the --partition-lower-value and --partition-upper-value options must still be used. See Managing Backend Partitioning for further details.

Example 10: Offloading List-Partitions by Partition Name

The --partition-names option is an alternative way to specify multiple partitions to offload and can be useful if partition names are well-formed and known. The following example offloads four partitions from a table that is list-partitioned on financial quarter:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-names=SALES_P2015Q1,SALES_P2015Q2,SALES_P2015Q3,SALES_P2015Q4
Resetting the List High Water Mark in a Hybrid View

List partition offloading is additive. A partition that has previously been offloaded will not be re-offloaded, even if it is specified in multiple offload commands. If there is a need to modify the hybrid view so that offloaded partitions are accessed from the RDBMS rather than the backend system, the --reset-hybrid-view option can be used with the next offload. Any previously offloaded partitions matched by partition identification options will not be re-offloaded but will be reflected in the hybrid view predicates.

DEFAULT Partition Considerations

Offloads for list-partitioned tables with a DEFAULT partition will behave differently depending on the offload options used, as follows:

  • If neither the --offload-type nor any of the partition identification options (--equal-to-values, --partition-names) are specified, the DEFAULT partition will be offloaded

  • If --offload-type=FULL is specified, either with or without a partition identification option, the DEFAULT partition will be offloaded

  • If --offload-type=INCREMENTAL is specified, either with or without a partition identification option, the DEFAULT partition will not be offloaded

  • If a partition identification option is specified but not --offload-type, the DEFAULT partition will not be offloaded

Note

Offloading the DEFAULT partition for a table will prevent any further partition-based offloads for that table.

Offloading List-Partitioned Tables as Range

In cases where a list-partitioned table has been structured to mimic range partitioning, the table can be offloaded exactly as described in Offloading Range-Partitioned Tables. To use this feature the table should adhere to the following:

  • Each partition must have a single literal as its high value

  • All new partitions must be added with high values that are greater than those that have already been offloaded

  • Supported data types must match those referenced in Offloading Range-Partitioned Tables

In this scenario, a DEFAULT list partition will be treated in the same way that a MAXVALUE partition is treated for range partition offload (see MAXVALUE Partition Considerations for details).

Backend partition granularities differ between range-partitioned tables and list-partitioned tables offloaded as range. See Managing Backend Partitioning for details.

Example 11: Offload a Set of List Partitions Using a Date Range Partition Boundary

The following example offloads all list partitions with a partition key value of less than 2015-02-01.

An example using --older-than-date:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-02-01

Boundary options are exclusive and will not offload a partition with a key that equals the supplied value.

If using the --partition-names option to specify the partition to use as an offload boundary, the --offload-predicate-type option must also be specified with a value of LIST_AS_RANGE, otherwise Offload will treat the table as a basic list-partitioned table rather than use range-partitioned table behavior.

Subpartition-Based Offload

Tables that are range-subpartitioned on DATE, TIMESTAMP, NUMBER or [N]VARCHAR2 columns can be fully or partially offloaded. With Subpartition-Based Offload, partial offloading is supported even if the top-level partitioning scheme of a table is unsupported, because the subpartition high values are used to determine the offload boundary. To offload a range-subpartitioned table by subpartition, the --offload-by-subpartition option must be used.

Subpartition-Based Offload is useful for tables that are organized with ranged data (such as a time series or business date) at subpartition level, rather than partition level. In such cases, the subpartition boundaries can be used to determine the data to be offloaded, even when the partition-level data is also ranged.

When offloading a subpartitioned table with Partition-Based Offload, all subpartitions within the identified range or list partition(s) are offloaded as standard.

Example 12: Offload a Set of Range Subpartitions Using a Date Range Boundary

The following example demonstrates offloading all subpartitions below a date threshold of 2015-07-01:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-07-01 --offload-by-subpartition

Subpartition Boundary Options

The Offload options for specifying subpartition boundaries are the same as for range partition offloading. See Partition Boundary Options for details.

Subpartition-Based Offload has an additional restriction, such that the boundary option value must guarantee that no subpartitions have potential for more data to be added. For example, suppose every partition in a table contains two range subpartitions with high values of 90 and 100. In this case, the boundary option of --less-than-value=100 identifies a common subpartition boundary across the entire partition set and this guarantees that all subpartitions below this threshold are “complete” and ready for offloading. However, suppose just one of the partitions has range subpartitions with high values of 90 and 110. In this case, --less-than-value=100 option would not be able to offload all subpartitions below this threshold because the boundary is not common (i.e. the subpartition with a high value of 110 cannot be guaranteed to be “complete”).

In cases where the specified boundary is not valid, Offload will terminate with a warning and recommend that a common boundary is identified and used. In many cases it is possible to project the boundary value forward until a value is found that guarantees the safe offloading and appending of “complete” subpartitions.

Boundary options are exclusive and will not offload subpartitions with a key that equals the supplied value.

Other Subpartition-Based Offload Behavior

Subpartition-Based Offload behavior is the same as Partition-Based Offload in almost all cases. Considerations such as MAXVALUE subpartitions, full offloading with --offload-type, moving the high water mark in the hybrid view, using multi-column subpartition keys and so on are the same for range subpartitions as they are for range partitions. See Offloading Range-Partitioned Tables for details. Note that any examples would need the --offload-by-subpartition option to apply the same principle to range-subpartitioned tables.

Predicate-Based Offload

Offload supports the offloading of subsets of data (or an entire table) by predicate. This feature applies to any table type (partitioned or non-partitioned) and is known as Predicate-Based Offload.

With this feature, three offloading patterns are possible:

  1. Simple Predicate-Based Offload: Offload one or more non-overlapping subsets of data from a table

  2. Late-Arriving Predicate-Based Offload: Offload one or more late-arriving subsets of data from a table or partition that has already been offloaded

  3. Intra-Day Predicate-Based Offload: Offload a new partition in non-overlapping subsets as soon as each subset of data is loaded

Predicates are provided using a simple grammar as described in Predicate Grammar.

Simple Predicate-Based Offload

Example 13: Offload a Subset of Data from a Table Using a Predicate

The following example demonstrates a simple Predicate-Based Offload scenario by offloading all data for the ‘Electronics’ category.

$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x --offload-predicate='column(PROD_CATEGORY) = string("Electronics")'

All data matching this predicate will be offloaded and the predicate will be added to the offload boundary in the hybrid view. Predicates are additive and any predicate that has been offloaded will not be re-offloaded, even if specified in future offload commands. If there is a need to modify the hybrid view so that offloaded data matching one or more predicates is accessed from the RDBMS rather than the backend system, the --reset-hybrid-view option can be used with the next offload.

Backend tables can be optionally partitioned when using simple Predicate-Based Offload. See Managing Backend Partitioning for more details.

Late-Arriving Predicate-Based Offload

The late-arriving scenario caters for tables or (sub)partitions that sometimes have small subsets of data arriving after the original offload has taken place. This scenario benefits from the efficiencies of Full Offload or (Sub)Partition-Based Offload but with the flexibility of Predicate-Based Offload for “topping-up” the offloaded data. This pattern can be used with fully offloaded tables, range-(sub)partitioned tables or list-partitioned tables offloaded with range partition semantics.

  • When using Late-Arriving Predicate-Based Offload with tables previously offloaded with (Sub)Partition-Based Offload:

    • The Predicate-Based Offload command must include the --offload-predicate-type option with a value of RANGE or LIST_AS_RANGE

    • The late-arriving subset of data to be offloaded must not overlap with any data that has already been offloaded

    • The late-arriving subset of data to be offloaded must not reference data in (sub)partitions that have not yet been offloaded

  • When using Late-Arriving Predicate-Based Offload with tables previously offloaded with Full Offload:

    • The --offload-predicate-type option is not required

    • The late-arriving subset of data to be offloaded must not overlap with any data that has already been offloaded

The following examples demonstrate Late-Arriving Predicate-Based Offload for a table previously offloaded with Full Offload and a set of partitions previously offloaded with Partition-Based Offload.

Example 14: Offload a Late-Arriving Subset of Data from a Previously-Offloaded Table

In the following example, the SH.CUSTOMERS table has been fully-offloaded using Full Offload. The Predicate-Based Offload command below is used to additionally offload a small set of data for new customers that have been inserted since the original offload took place.

$ $OFFLOAD_HOME/bin/offload -t SH.CUSTOMERS -x \
      --offload-predicate='column(CUST_CREATED_DATE) > datetime(2019-01-15 13:23:34)'

All data matching this predicate will be offloaded but the predicate itself will not be added to the hybrid view or its metadata.

Example 15: Offload a Late-Arriving Subset of Data from a Previously-Offloaded Range Partition

In the following example, the SH.SALES table has previously been offloaded using Partition-Based Offload up to and including data for 2019-01-15. The Predicate-Based Offload command below is used to additionally offload a small set of data for product 123 that has arrived late and been loaded into a partition that has already been offloaded.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2019-01-15) and column(PROD_ID) = numeric(123)' \
      --offload-predicate-type=RANGE

All data matching this predicate will be offloaded but the predicate itself will not be added to the offload boundary or metadata for the hybrid view (only the existing Partition-Based Offload boundary will be used).

Intra-Day Predicate-Based Offload

This offload pattern can be used to switch between Partition-Based Offload and Predicate-Based Offload as required. It can be useful as a means to offload the majority of a partitioned table by partition boundaries, but swap to offloading subsets of data for a new partition as soon as they arrive (rather than wait for the entire partition to be loaded before offloading). This can ensure that data is available in the target backend earlier than would otherwise be possible. This pattern can only be used with range-partitioned tables or list-partitioned tables offloaded with range partition semantics.

Example 16: Offload a Subset of Data for a New Range Partition (Intra-Day Offloading)

In the following example, the historical data for the SH.SALES table is offloaded by Partition-Based Offload (range partitioned by TIME_ID) at time T0 to begin the offload lifecycle for this table. For new data, rather than wait a full day for an entire partition of data to be ready, data is instead offloaded as soon as each product set is loaded into the new partition (3 separate loads at times T1, T2, T3). When all loads and offloads have completed, the table metadata is reset (at time T4) and the table is ready to repeat the same Predicate-Based Offload pattern for the next processing cycle.

Time

Description

Command

T0

Offload all history by range partition

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-01

T1

Offload first set of products loaded at time T1

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(123), numeric(234))' \
      --offload-predicate-type=RANGE_AND_PREDICATE

T2

Offload second set of products loaded at time T2

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(345), numeric(456))' \
      --offload-predicate-type=RANGE_AND_PREDICATE

T3

Offload third set of products loaded at time T3

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(567), numeric(678))' \
      --offload-predicate-type=RANGE_AND_PREDICATE

T4

All products now loaded and offloaded, reset the metadata for the table

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-02

The Predicate-Based Offload commands for T1-T3 require that the --offload-predicate option includes a predicate for the Partition-Based Offload partition key and that it matches the value of the new data (TIME_ID=2020-07-01 in this example). The --offload-predicate-type option must also be used (see option reference for valid values).

When the metadata is reset after the end of a sequence of Predicate-Based Offload commands (as per T4 above), either the same pattern of offloading can continue for the next partition or, if preferred, Partition-Based Offload can resume (i.e. wait for the next partition to be fully-loaded before offloading in one command as described in Partition-Based Offload).

Predicate Grammar

At present, the offload predicate DSL or grammar used for the --offload-predicate option covers a simple range of possibilities, based on a <column> <operator> <value> model.

Predicate Form

Predicates can take several forms:

  • <column> <operator> <value>

  • <column> [NOT] IN (<value>[, <value>, ...])

  • <column> IS [NOT] NULL

The DSL allows specification of any number of predicates in a range of forms, which can be logically combined and nested with AND/OR, for example:

  • <predicate>) AND|OR <predicate>

  • (<predicate> AND|OR <predicate>) AND|OR (<predicate>)

Parentheses are required to form logical predicate groups, as shown in the second example above. Whitespace is insignificant.

Column Form

Predicates always contain a single column written with the column keyword and a “function-call” syntax. Columns are not case-sensitive and can include aliases. The following examples are all valid:

  • column(CUST_ID)

  • column(SALES.CUST_ID)

  • column(cust_id)

Columns must exist in the table being offloaded and expressions/functions are not supported.

Operator Form

The basic SQL operators are supported and are written as in SQL. Member operators and null-value expressions are also supported. The full range is as follows:

  • =, !=, <, <=, >, >=

  • [NOT] IN

  • IS [NOT] NULL

Value Form

Values must be qualified with a data type classification and have a “function-call” syntax as follows:

  • string("This is a string")

  • numeric(123)

  • datetime(2020-07-01)

String values use the string keyword and double-quotes as shown above. Double-quote characters can be included in the string value by escaping with a backslash.

Numeric values use the numeric keyword and both signed integral and signed decimal values are supported with the same keyword.

Date/time values use the datetime keyword and can be specified at several levels of precision from dates to timestamps with nanosecond precision. The following are all valid examples of a datetime value:

  • datetime(2020-07-01)

  • datetime(2020-07-01 13:01:01)

  • datetime(2020-07-01 13:01:01.123)

  • datetime(2020-07-01 13:01:01.123456789)

Value lists for use in the [NOT] IN operator must be surrounded by parentheses and separated by commas. The values in a value list must be specified as normal, for example:

  • column(cust_id) IN (numeric(120), numeric(121), numeric(122))

The predicate DSL currently supports columns of the following Oracle Database data types:

Table 2: Predicate-Based Offload DSL Data Type Support

DSL Data Type Class

Supported Oracle Database Data Types

String

VARCHAR2, CHAR, NVARCHAR2, NCHAR

Numeric

NUMBER, FLOAT, BINARY_FLOAT, BINARY_DOUBLE

Datetime

DATE, TIMESTAMP

Offloading Joins

Offload enables joins between fully and/or partially-offloaded tables to be offloaded to the backend. This feature allows the processing of joins in a hybrid query to be pushed down from the RDBMS to the backend system where applicable. The join specification is provided using offload with the --offload-join option, based on a simple grammar as described in Offload Join Grammar.

Example 17: Offloading a Join

The following example offloads a three-table join to the backend system.

$ $OFFLOAD_HOME/bin/offload -t sh.salescountry_join -x \
      --offload-join "table(sales) alias(s)" \
      --offload-join "table(customers) alias(cu) inner-join(s) join-clauses(cust_id)" \
      --offload-join "table(countries) alias(co) inner-join(cu) join-clauses(country_id)"

In this example, the SH.SALES table is partially offloaded and the other tables are fully offloaded. The offload boundary for the SH.SALES table is used for the Offload Join to ensure that all offloaded data for this table is aligned to a common boundary.

A hybrid view named SALESCOUNTRY_JOIN will be created in the hybrid schema for this join. Hybrid views created by Offload Join operations are not intended to be queried directly, because they are designed for transparent query rewrite. This feature enables users to benefit from the performance of having the joins processed in the backend system, but without having to rewrite any SQL. However, Offload Join hybrid views can be queried directly if required (by replacing the corresponding joins in existing SQL statements with the hybrid view or writing new queries using the hybrid view), but in most cases it should not be necessary.

Offload Join Grammar

When offloading joins, each table in the join has its own --offload-join specification, as demonstrated in Example 17 above. The full grammar for this option is provided in Table 3 below.

Table 3: Offload Join Grammar

Directive and Syntax

Description

TABLE([<hybrid_schema>.]<table-name>)
  • The table name to be included in the join

  • This directive is mandatory

  • Can include the hybrid schema name if joining objects from different schemas

ALIAS(<table-alias>)
  • Alias to be used for the join

  • Defaults to <table-name> if not specified

PROJECT(<*|<column>[, ..]>)
  • List of columns selected from <table-name>. * is expanded to all columns.

  • Columns in join clauses are added to the list

  • Partition key columns used are added to the list

INNER-JOIN(<join-alias>)
  • Alias to join with. Required for all but the first table

JOIN-CLAUSES(<alias>.<column>
           = <alias>.<column>[, ..])

JOIN-CLAUSES(<column>)
  • Clause defining the relationship between <table-name> and <join-alias>

  • Expanded to syntax above assuming <column> exists in both tables

Note

Expansion of * projections excludes the OFFLOAD_BUCKET_ID partition column that Offload automatically creates when offloading to Hadoop-based backends. For performance reasons (such as supporting parallel queries), it might be desirable to include OFFLOAD_BUCKET_ID in the final Hadoop table or view for an Offload Join. This can be achieved by explicitly including it in the PROJECT() clause. For example PROJECT(*,offload_bucket_id) will include all table columns plus the OFFLOAD_BUCKET_ID column. This is not required for offloads to Google BigQuery, Snowflake or Azure Synapse Analytics.

Materialized Offload Joins

By default, results of an offloaded join are materialized in a backend table of the name provided by the -t option (SH.SALESCOUNTRY_JOIN in Example 17 above). Results do not have to be materialized: if using the --no-materialize-join option, a view for the join will be created in the backend instead of a populated table.

Offload Join and Partially-Offloaded Tables

Tables offloaded with Partition-Based Offload, Subpartition-Based Offload or Predicate-Based Offload are all supported by Offload Join. When an Offload Join operation includes a partially-offloaded table, the boundary predicates are included in the hybrid view and, if the join is not materialized to a table, the boundary predicates will also be included in the backend join view that is created in place of a materialized resultset.

Joins can include multiple partially-offloaded tables, but all tables must have the same incremental key type (i.e. be offloaded in the same way) and must have been offloaded to the same partition or data threshold. This means:

  • Partition-Based Offload tables must have the same partitioning type, partitioning key and offloaded high value. Join column names do not have to match as the Offload Join specification allows for different column names

  • Subpartition-Based Offload tables must have the same subpartitioning key and offloaded high value. As with Partition-Based Offload, join column names do not need to match

  • Predicate-Based Offload tables must have the same predicate type and same set of predicates. The predicates for each table must match exactly

Refreshing Offload Joins

Offload Join objects are not automatically maintained and under some scenarios will become stale when a member table is further offloaded. The guidelines are as follows:

  • If the Offload Join is materialized, the join needs to be refreshed

  • If the Offload Join is not materialized but contains one or more partially-offloaded tables, the join needs to be refreshed

  • If the Offload Join is not materialized and does not contain any partially-offloaded tables, the join does not need to be refreshed

Refreshing Materialized Offload Joins

Materialized Offload Joins always need to be refreshed when the offloaded data for a member table has changed. The method for doing this depends on which of the member tables in an Offload Join has been subject to further offloads:

  • If a fully-offloaded table is re-offloaded, the dependent materialized Offload Joins require a full refresh. To do this, the original Offload Join offload command can be repeated with the --reset-backend-table option. Note that this will fully reset and re-populate the materialized Offload Join table with the new results of the join

  • If a partially-offloaded table has a further incremental offload of one or more partitions, subpartitions or predicates, the dependent materialized Offload Joins can be incrementally refreshed. To do this, simply repeat the original Offload Join offload command and Offload will append new data to the materialized join table

Refreshing Non-Materialized Offload Joins

Non-materialized Offload Joins only need to be refreshed when the offload boundaries for the partially-offloaded member tables have changed. To refresh the join, the original Offload Join offload command can be repeated but must include the --force or -f option. This ensures that the backend view is re-created with the new boundary predicates of the partially-offloaded tables in the join.

Updating Offloaded Data

Gluent Data Platform supports the updating of offloaded data in several ways with Incremental Update. See the Incremental Update guide for details.

Offload Type Considerations

As described in the Concepts guide, there are three formats for a hybrid view (see Hybrid View Types) that determine where hybrid queries access their data:

  • 100-0: All offloaded data is accessed from the backend table and none is accessed from the source RDBMS table

  • 90-10: Partially offloaded data is accessed from the backend table and non-offloaded data is accessed from the source RDBMS table

  • 100-10: All data is offloaded but only a some of the offloaded data is accessed from the backend table and some is accessed from the source RDBMS (based on a given boundary)

In most cases, the hybrid view type is created as default for a given offload operation, but the --offload-type option can be used to provide greater flexibility over the initial and subsequent offloads of list-partitioned and range-partitioned tables. See Offload Type for a general description of offload type.

Default Offload Type Behavior

The following describes the default behavior for offload type when offloading a range-partitioned or list-partitioned table.

  • If no --offload-type is specified:

    • Use of any of the partition boundary options offloads the table as INCREMENTAL with a default 90-10 hybrid view

    • An offload without any of the partition boundary options offloads the table as FULL offload type with a default 100-0 hybrid view

  • If the --offload-type option is specified:

    • Specifying --offload-type=INCREMENTAL with or without any of the partition boundary options offloads the table to match the boundary partition and defines the table as an INCREMENTAL offload type with a supporting 90-10 hybrid view

    • Specifying --offload-type=FULL without any of the partition boundary options offloads all data from the table and defines the table as a FULL offload type with a supporting 100-0 hybrid view

    • Specifying --offload-type=FULL with any of the partition boundary options offloads all data from the table and defines the table as a FULL offload type with a specialized 100-10 hybrid view

  • If a table has a DEFAULT or MAXVALUE partition:

    • A FULL offload will offload the DEFAULT/MAXVALUE partition

    • An INCREMENTAL offload will not offload any DEFAULT/MAXVALUE partition

  • For subsequent offloads:

    • The offload type of a table is automatically picked up on all future offloads, unless overridden

    • The offload type of a table can be changed on future offloads by specifying a new --offload-type value (see Changing Offload Type below)

Note

Predicate-Based Offload does not support the 100-10 offload type pattern unless it is Late-Arriving Predicate-Based Offload.

Changing Offload Type

It is possible to change the offload type of a table on one or more subsequent offloads of the table (subject to some limitations). As described above, the offload type of a table is set on initial offload and is maintained for all future offloads (unless it is overridden by a new --offload-type value on a later offload).

The rules and behavior for changing offload type from INCREMENTAL to FULL are as follows:

  • Any partitions not yet offloaded are offloaded (including any DEFAULT/MAXVALUE if applicable), the table is re-defined as a FULL offload type and the hybrid view is re-created accordingly

The rules and behavior for changing offload type from FULL to INCREMENTAL are as follows:

  • If a partition boundary option is used, any corresponding partitions are offloaded, the table is re-defined as an INCREMENTAL offload type and the hybrid view is re-created accordingly

  • If the table has a DEFAULT or MAXVALUE partition, it cannot be converted from FULL to INCREMENTAL

  • If the table has Incremental Update enabled, it cannot be converted from FULL to INCREMENTAL

  • If the table was offloaded with Subpartition-Based Offload and was offloaded as FULL, it cannot be converted from FULL to INCREMENTAL

Offload Data Types

When offloading data, Offload will map data types from the source RDBMS to the backend database automatically. However, as with most operations in Offload, it is possible to override many of the default data type mappings if required. See Tables 4 and 6 below for details.

Supported Data Types

Gluent Data Platform currently supports all of the Oracle Database data types listed in Table 4 below. Oracle Database tables that contain columns of any data type not listed cannot currently be offloaded.

Default Data Type Mappings

Table 4 lists the default data type mappings when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform), Google BigQuery, Snowflake or Azure Synapse Analytics.

Table 4: Offload Default Data Type Mappings (Oracle Database)

Oracle Database

Impala

Google BigQuery

Snowflake

Azure Synapse Analytics

Comments

CHAR

STRING

STRING

VARCHAR

char

See Offloading String Data to Azure Synapse Analytics

NCHAR

STRING

STRING

VARCHAR

nchar

CLOB

STRING

STRING

VARCHAR

varchar(max)

See Offloading LOB Data Types to Snowflake and Azure Synapse Analytics and Offloading String Data to Azure Synapse Analytics

NCLOB

STRING

STRING

VARCHAR

nvarchar(max)

See Offloading LOB Data Types to Snowflake and Azure Synapse Analytics

VARCHAR2

STRING

STRING

VARCHAR

varchar

See Offloading String Data to Azure Synapse Analytics

NVARCHAR2

STRING

STRING

VARCHAR

nvarchar

RAW

STRING

BYTES

BINARY

varbinary

BLOB

STRING

BYTES

BINARY

varbinary(max)

See Offloading LOB Data Types to Snowflake and Azure Synapse Analytics

NUMBER(<=4,0)

BIGINT

INT64

NUMBER(p,0)

smallint

See Offloading Integral Data Types to Impala

NUMBER([5-9],0)

BIGINT

INT64

NUMBER(p,0)

int

See Offloading Integral Data Types to Impala

NUMBER([10-18],0)

BIGINT

INT64

NUMBER(p,0)

bigint

See Offloading Integral Data Types to Impala

NUMBER(>18,0)

DECIMAL(38,0)

NUMERIC BIGNUMERIC

NUMBER(38,0)

numeric(38,0)

See Offloading Numeric Data to Google BigQuery

NUMBER(*,*)

DECIMAL(38,s)

NUMERIC BIGNUMERIC

NUMBER(p,s)

numeric(p,s)

See Offloading Numeric Data to Google BigQuery

FLOAT

DECIMAL

NUMERIC

NUMBER(p,s)

numeric(38,18)

BINARY_FLOAT

FLOAT

-

-

real

See Floating Point Data Types in Google BigQuery and Snowflake and Floating Point Data Types in Azure Synapse Analytics

BINARY_DOUBLE

DOUBLE

FLOAT64

FLOAT

float

See Floating Point Data Types in Azure Synapse Analytics

DATE

TIMESTAMP

DATETIME

TIMESTAMP_NTZ

datetime2

TIMESTAMP

TIMESTAMP

DATETIME

TIMESTAMP_NTZ

datetime2

See Offloading High-Precision Timestamp Data to Google BigQuery and Azure Synapse Analytics

TIMESTAMP WITH TIME ZONE

TIMESTAMP

TIMESTAMP

TIMESTAMP_TZ

datetimeoffset

See Offloading High-Precision Timestamp Data to Google BigQuery and Azure Synapse Analytics and Offloading Time Zoned Data

INTERVAL DAY TO SECOND

STRING

STRING

VARCHAR

varchar

See Offloading Interval Data Types

INTERVAL YEAR TO MONTH

STRING

STRING

VARCHAR

varchar

See Offloading Interval Data Types

Data Sampling During Offload

Gluent Offload Engine fine-tunes the default data mappings by sampling data for all columns in the source RDBMS table that are either date or timestamp-based or defined as a number without a precision and scale.

By sampling numeric columns, Offload can choose an appropriate precision and scale for Impala DECIMAL data and this can lead to performance improvements due to reduced CPU consumption when querying. This behavior can be disabled by adding the --no-auto-detect-numbers option to the offload command.

The Impala TIMESTAMP data type supports a minimum date of 1400-01-01. Offload samples source date and timestamp-based columns to ensure that the data to be offloaded satisfies the Impala specification. In cases where Offload discovers data below the minimum specification, the column is offloaded to an Impala STRING column. This behavior can be disabled with the --no-auto-detect-dates option, but in this case the source data needs to be corrected before it can be offloaded.

Offload bases the volume of data to be sampled on the size of the RDBMS table. This can be overridden by adding the --data-sample-percent option to the offload command (specifying a percentage between 0 and 100, where 0 disables sampling altogether). Without sampling, all Impala decimal columns are defaulted to DECIMAL(38,18) and no date or timestamp checking is performed.

Offload determines the degree of parallelism to use when sampling data from the value of DATA_SAMPLE_PARALLELISM. This can be overridden by adding the --data-sample-parallelism option to the offload command (specifying a degree of 0 or a positive integer, where 0 disables parallelism)

Offloading Not Null Columns to Google BigQuery, Snowflake and Azure Synapse Analytics

Google BigQuery, Snowflake and Azure Synapse Analytics allow columns to be defined as mandatory (i.e. NOT NULL). When offloading a table to one of these systems for the first time, by default Offload will copy mandatory column definitions from the RDBMS to the backend. This behavior is defined globally by the OFFLOAD_NOT_NULL_PROPAGATION configuration option, which defaults to AUTO (i.e. propagate all NOT NULL constraints from the RDBMS to the backend). To avoid propagating any NOT NULL constraints, this option can be set to NONE.

Offload only considers columns to be mandatory if they are defined by Oracle Database as not nullable (DBA_TAB_COLUMNS.NULLABLE = 'NO'). The following NOT NULL Oracle Database columns are not automatically propagated by Offload:

  • Columns with user-defined check constraints rather than a NOT NULL definition (e.g. CHECK (column_name IS NOT NULL))

  • Columns defined with NOT NULL ENABLE NOVALIDATE

To override the global OFFLOAD_NOT_NULL_PROPAGATION configuration value, or to include columns with mandatory constraints that are not automatically propagated, the --not-null-columns option can be added to an Offload command. This option accepts a list of one or more valid columns, one or more wildcards, or a combination of the two. Offload will define all specified columns as NOT NULL when creating the backend table, regardless of their status in the RDBMS.

Care should be taken when propagating constraints for columns that cannot be guaranteed to contain no NULLs. During validation of the staging data, Offload will specifically check for NULLs for any column that is defined as mandatory in the backend.

Offloading Integral Data Types to Impala

When offloading a table to Impala, integer columns(i.e. NUMBER(p,0)) with a precision in the range of an Impala BIGINT data type are automatically stored as BIGINT. No attempt is made by Offload to downscale the Impala integral type in case the data precision grows beyond its type specification in the future. Offload safely defaults to BIGINT because schema evolution of smaller integrals to BIGINT cannot be actioned by a simple ALTER TABLE statement in Impala (meaning a more complex data migration would be required to upscale the integral type if needed). However, it is possible to override this default behavior and users confident in a column’s future contents (for example, day of month or year) can override the mapping and safely utilize smaller integral types in Impala. See Override Data Type Mappings for details.

Source integer columns with a precision beyond the range of an Impala BIGINT are automatically stored as DECIMAL(38,0).

Offloading Numeric Data to Google BigQuery

Google BigQuery provides two decimal data types: NUMERIC and BIGNUMERIC. The NUMERIC data type has a specification of (38,9) with a fixed decimal point, meaning a maximum of 29 digits to the left of the decimal point and a maximum of 9 digits to the right. The BIGNUMERIC data type has a specification of (76,38) with a fixed decimal point, meaning a maximum of 38 digits to the left of the decimal point and a maximum of 38 digits to the right (more precisely, the specification of BIGNUMERIC is (76.76,38), allowing for some numbers with 39-digits to the left of the decimal point).

When offloading numeric data such as decimals or large integrals to Google BigQuery, Offload determines which BigQuery type is most appropriate, based on either the known precision and scale of Oracle columns of type NUMBER(p,s) or from sampled data for Oracle columns of unbounded type NUMBER. Data that offloads to NUMERIC by default can be offloaded to BIGNUMERIC with the --decimal-columns and --decimal-columns-type override options (see Table 6: Offload Override Data Type Mappings (Oracle Database) below).

For numeric data that exceeds the specifications of NUMERIC and BIGNUMERIC, Offload offers two options:

  1. Use decimal rounding options during offload (see Decimal Scale Rounding below)

  2. Offload numeric data to a floating point type (see Converting Numeric Data to Double below)

Important

Both of these options result in some data change and it will be for users to decide if this is a valid course of action.

Offloading String Data to Azure Synapse Analytics

At the time of writing, the Azure Synapse Analytics char and varchar data types do not support UTF-8 or Unicode. Depending on the source Oracle Database characterset and the string data to be offloaded, it is possible that not all CHAR, VARCHAR2 or CLOB data can be offloaded to the equivalent Synapse data types without data loss. In such cases, CHAR, VARCHAR2 and CLOB columns will need to be offloaded to nchar or nvarchar as applicable, using the --unicode-string-columns override option.

When offloading string data to Azure Synapse Analytics, it is also important to consider database collations, to ensure that the behavior of the Oracle Database collation (determined by NLS_COMP and NLS_SORT) matches the behavior of the default collation used by the Synapse dedicated SQL pool. The SYNAPSE_COLLATION configuration variable can be used to ensure that all string columns are offloaded and queried by Gluent Data Platform with a collation that matches that of the source RDBMS. By default, this is set to Latin1_General_100_BIN2, which matches the behavior of an Oracle Database NLS_COMP and NLS_SORT setting of BINARY.

Floating Point Data Types in Google BigQuery and Snowflake

Google BigQuery and Snowflake both provide a single 64-bit floating point data type (FLOAT64 and FLOAT, respectively). This means that the Oracle Database 32-bit BINARY_FLOAT data type does not have a corresponding type in BigQuery or Snowflake and cannot currently be offloaded. Support for an override mapping to the 64-bit floating point data type will be added in a future release of Gluent Data Platform (although this will be potentially lossy - see Lossy Data Operations below).

Floating Point Data Types in Azure Synapse Analytics

Azure Synapse Analytics’ real and float floating point data types do not support NaN or Infinity special values. When offloading BINARY_FLOAT or BINARY_DOUBLE, special values must be converted to NULL to allow the data to load into Synapse. For this reason, Offload will not allow BINARY_FLOAT or BINARY_DOUBLE columns to be offloaded unless an override option is provided (see Converting Floating Point Special Values below).

Important

Allowing floating point columns to be offloaded to Azure Synapse Analytics is potentially lossy (see Lossy Data Operations below for the implications of this).

Offloading High-Precision Timestamp Data to Google BigQuery and Azure Synapse Analytics

At the time of writing, Google BigQuery’s DATETIME and TIMESTAMP data types both support microsecond precision (i.e. a precision of 6) and Azure Synapse Analytics datetime2 and datetimeoffset data types support a maximum precision of 7. Oracle Database supports up to nanosecond precision (i.e. a precision of 9) with its TIMESTAMP and TIMESTAMP WITH TIME ZONE data types. When Offload detects a source column with a higher level of precision than that supported by the backend (e.g. when attempting to offload TIMESTAMP(7) to BigQuery or TIMESTAMP(8) to Synapse), it will terminate. In such cases, the --allow-nanosecond-timestamp-columns option will allow the offload operation to continue, but should ideally be used only when the data to be offloaded is known to be at microsecond precision or lower.

Important

Allowing high-precision timestamp columns to be offloaded to Google BigQuery or Azure Synapse Analytics is potentially lossy (see Lossy Data Operations below for the implications of this).

Offloading LOB Data Types to Snowflake and Azure Synapse Analytics

Snowflake restricts the size of string data to a maximum of 16MB and binary data to a maximum of 8MB. Oracle LOB data that exceeds the limits of the corresponding Snowflake data type cannot currently be offloaded.

Azure Synapse Analytics tables support LOBs up to 2GB in size with its varchar(max) and varbinary(max) specification. However, the Microsoft PolyBase engine that is invoked when loading external staging data into the target Synapse table has a 1,000,000 byte limit per row. Given that PolyBase treats each character as 2 bytes, this means that offloaded rows with LOBs are currently restricted to under 500KB in size.

Offloading Time Zoned Data

For Oracle Database columns of TIMESTAMP WITH TIME ZONE data type, data is normalized to Coordinated Universal Time (UTC) during offload to Hadoop-based backends and to Google BigQuery. When offloading to Snowflake or Azure Synapse Analytics, data is stored with the time zone offset of the source Oracle data. Offload converts named time zones, e.g. 2016-01-14 22:39:44 US/Pacific, to time zone offsets, e.g. 2016-01-14 10:39:44 -8:00 to ensure backend systems can process the values regardless of their time zone database edition. When returning offloaded data for hybrid queries, Smart Connector normalizes all data to UTC, regardless of the backend. For this reason, values may appear different to the eye although they are temporally the same as those in the source RDBMS data.

Offloading Interval Data Types

Oracle Database INTERVAL DAY TO SECOND and INTERVAL YEAR TO MONTH data types are offloaded as strings in a normalized format, as shown in Table 5 below.

Table 5: Interval Data Type String Formats

Interval Type

Sample Source Data

Impala String Format

Google BigQuery/Snowflake/Azure Synapse Analytics String Format

Day to second (+ve)

+000000001 02:15:30.123456000

1 02:15:30.123456

+1 02:15:30.123456

Day to second (-ve)

-000000001 02:15:30.123456000

-1 02:15:30.123456

-1 02:15:30.123456

Year to month (+ve)

+000000001-01

1-1

+1-1

Year to month (-ve)

-000000001-01

-1-1

-1-1

Smart Connector converts offloaded interval strings back to the correct interval type when executing hybrid queries from the RDBMS. Interval strings can also be presented back to Oracle Database interval types by using the appropriate Present option as required (see the Present guide for further details).

Override Data Type Mappings

When initially offloading a table, it is possible to override some of the default data type mappings to change the specification of the backend table. Table 6 lists the override mappings and associated options that are available when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform), Google BigQuery, Snowflake or Azure Synapse Analytics.

Table 6: Offload Override Data Type Mappings (Oracle Database)

Oracle Database

Offload Option

Impala

BigQuery

Snowflake

Synapse

Comments

CHAR

--unicode-string-columns

-

-

-

nchar

See Offloading String Data to Azure Synapse Analytics

VARCHAR2

--unicode-string-columns

-

-

-

nvarchar

See Offloading String Data to Azure Synapse Analytics

CLOB

--unicode-string-columns

-

-

-

nvarchar(max)

See Offloading String Data to Azure Synapse Analytics

DATE

--date-columns

DATE

DATE

DATE

date

Impala versions below 3.3 will use TIMESTAMP

--timestamp-tz-columns

TIMESTAMP

TIMESTAMP

TIMESTAMP_TZ

datetimeoffset

--variable-string-columns

STRING

STRING

VARCHAR

varchar

TIMESTAMP

--date-columns

DATE

DATE

DATE

date

Impala versions below 3.3 will use TIMESTAMP

--timestamp-tz-columns

TIMESTAMP

TIMESTAMP

TIMESTAMP_TZ

datetimeoffset

--variable-string-columns

STRING

STRING

VARCHAR

varchar

NUMBER

--integer-1-columns

TINYINT

INT64

NUMBER(3,0)

smallint

Use for NUMBER([1-2],0)

--integer-2-columns

SMALLINT

INT64

NUMBER(5,0)

smallint

Use for NUMBER([3-4],0)

--integer-4-columns

INT

INT64

NUMBER(10,0)

int

Use for NUMBER([5-9],0)

--integer-8-columns

BIGINT

INT64

NUMBER(19,0)

bigint

Use for NUMBER([10-18],0)

--integer-38-columns

DECIMAL(38,0)

INT64

NUMBER(38,0)

numeric(38,0)

Use for NUMBER(>18,0)

--decimal-columns

DECIMAL(p,s)

NUMERIC BIGNUMERIC

NUMBER(p,s)

numeric(p,s)

Use for NUMBER(p,s)

--double-columns

DOUBLE

FLOAT64

FLOAT

float

See Converting Numeric Data to Double

FLOAT

--double-columns

DOUBLE

FLOAT64

FLOAT

float

See Converting Numeric Data to Double

BINARY_FLOAT

--double-columns

DOUBLE

-

-

float

See Converting Numeric Data to Double

Example 18: Overriding Data Types During an Offload

In the following example, the SH.SALES table is offloaded with several data type overrides.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --date-columns=TIME_ID \
      --integer-4-columns=PROD_ID,CHANNEL_ID \
      --decimal-columns-type=10,2 --decimal-columns=AMOUNT_SOLD

Each override can be given one or more columns. The --decimal-columns override can be used to fine-tune the precision and scale of the Impala DECIMAL data type (which can lead to better space utilization and CPU reduction) or the precision and scale of the Snowflake NUMBER data type. It can also be used to choose a BigQuery BIGNUMERIC over NUMERIC (or vice-versa if required). This override option behaves differently to the others in that it must be paired with a --decimal-columns-type option to define the precision and scale for the columns listed. Multiple different names/type pairings can be provided in the same command (see --decimal-columns for further details).

Data type override options support use of the wildcard character *. This is useful when a logical model has column naming rules indicating the data they contain. The example above could use --decimal-columns=AMOUNT* to match multiple columns containing monetary values.

Oracle Database number precision is specified in terms of the number of digits. Impala number precision is specified by the data type. Care should be taken if selecting an override target data type. The DECIMAL data type can be used as a catch-all numeric data type in Impala but users are encouraged to consider native integral data types for reasons of performance and space utilization.

Managing Exceptional Data

No two systems align perfectly with their data type specifications and implementations. There are several cases where data in the source Oracle Database table might not be easily offloaded to the backend system. Some have already been described, but include:

  • Source RDBMS numbers with a precision greater than that supported by the backend (for example, the 38 digit limit of Impala DECIMAL, BigQuery BIGNUMERIC or Snowflake NUMBER)

  • Source RDBMS numbers with a scale greater than that supported by the backend (for example, Snowflake’s NUMBER scale limit of 37 digits)

  • Source RDBMS dates or timestamps with a value below Impala’s lower bound of 1400-01-01

  • Source RDBMS timestamps with a higher level of precision than BigQuery’s microsecond limit

  • Source RDBMS 32-bit floating point numbers offloading to a backend system without a 32-bit floating point data type (e.g. BigQuery)

  • Source RDBMS floating point numbers with special NaN or Infinity values that are not supported by the backend system (e.g. Synapse)

  • Source RDBMS large objects (LOBs) with data that exceeds the limit of the corresponding backend data type (e.g. Snowflake’s 16MB VARCHAR limit and 8MB BINARY limit)

When exceptional data is detected, Offload will in most cases terminate with an exception (and in some cases provide advice on how to proceed). The only exception to this is with offloading outlying dates and timestamps to Impala, which will be automatically offloaded to a lossless string data type (unless this behavior is disabled by the user as described earlier).

Note

Offload will never attempt to automatically offload exceptional data when it would cause a potential data loss to do so. In all cases, the user must provide an explicit course of action by providing the appropriate options with the offload command.

Decimal Scale Rounding

The Oracle Database NUMBER data type is extremely flexible and caters for high-scale decimal data that exceeds the limits available in some target backend systems (such as Snowflake). To offload decimal data that exceeds the specification of the backend system, the --allow-decimal-scale-rounding option must be used in conjunction with the --decimal-columns-type and --decimal-columns options to define the target type for the data. By using the rounding option, the user acknowledges that some data loss due to decimal rounding is likely and acceptable.

Converting Numeric Data to Double

Using the --double-columns override option to offload high-precision/scale numeric data or 32-bit floating point data to a 64-bit double can lead to data loss and should only be considered when there is no alternative and the consequences are both understood and acceptable.

Converting Floating Point Special Values

Not all backend systems support floating point special values such as NaN and Infinity. For such systems, Offload will not allow BINARY_FLOAT or BINARY_DOUBLE columns to be offloaded without the --allow-floating-point-conversions override option. Using this option will enable Offload to convert special floating point values to NULL. This option currently applies only to Azure Synapse Analytics.

Lossy Data Operations

To summarize, the potentially-lossy offload operations and their corresponding options are as follows:

  • Rounding decimals: Using the --allow-decimal-scale-rounding option to round decimals that exceed the backend specification for decimal data

  • Offloading sub-microsecond timestamps: Using the --allow-nanosecond-timestamp-columns option to offload columns with a sub-microsecond specification to a backend system that doesn’t support the same level of precision

  • Converting data to a 64-bit floating point number: Using the --double-columns option to offload decimal or 32-bit floating-point data that would not otherwise be possible

  • Offloading special floating point values (only required for Azure Synapse Analytics): Using the --allow-floating-point-conversions option will allow Offload to convert NaN and Infinity values to NULL

Important

Enabling lossy offloads can cause wrong results with some hybrid queries when data loss has occurred; such as when Smart Connector pushes down an equality or inequality predicate containing values from the lossy column(s). It is not advised to enable lossy data type overrides for columns that are likely to be used in hybrid query predicates or joins unless the offload was known to be lossless.

Offload Transport

Offload Transport describes the part of the offload process that:

  • Copies the source RDBMS data to a staging location

  • Validates the staged data

  • Loads the staged data into the backend target table

The two-phased approach of staging and loading data provides two main benefits:

  1. Atomicity: The load from staged data to target table either fully succeeds or completely fails

  2. Partitioning: For backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table

The transport phase of an offload is split into four main operations:

Transport Data to Staging

Gluent Offload Engine uses one of a number of tools to extract data from the source RDBMS. The appropriate tool will be chosen automatically based on configuration preferences. Each tool reads the source data to offload and attempts to split the source data equally and in a non-overlapping way between concurrent reader processes.

Data is staged in either Avro or Parquet format (depending on the backend platform) and native data types are used where possible. For Avro staging files, STRING is used where there is no identical match between Avro and the source RDBMS table. For example, Oracle Database DATE, TIMESTAMP and NUMBER data types have no direct equivalent in Avro; therefore data is staged as STRING and converted to the appropriate backend data type during the final loading phase.

The data extraction tools available to Offload are:

Apache Sqoop

Apache Sqoop is only considered for use when offloading to Hadoop-based backends. Offload will invoke Sqoop on the host defined by OFFLOAD_TRANSPORT_CMD_HOST. Depending on the source RDBMS table, Sqoop will be used in either its standard table-centric mode or using a free-form query defined by Offload. The number of Sqoop mappers used when extracting data is defined by OFFLOAD_TRANSPORT_PARALLELISM or per offload with the --offload-transport-parallelism option. The SQOOP_QUEUE_NAME parameter can be defined to submit the Sqoop job a specific YARN queue.

Default configuration is appropriate for the majority of offloads but occasionally the nature of the source table requires tuning of the following:

  • --offload-transport-fetch-size can be used on a per-offload basis to override OFFLOAD_TRANSPORT_FETCH_SIZE. This can be useful to manage memory requirements for tables with a large row size (such as when offloading LOBs), by reducing the fetch size

  • SQOOP_OVERRIDES or the per-offload option --offload-transport-jvm-overrides can be used to inject JVM parameters into the Sqoop command line. This can be useful to override JVM memory defaults, for example, --offload-transport-jvm-overrides="-Dmapreduce.map.java.opts=-Xmx2048m -Dmapreduce.map.memory.mb=2048"

Apache Spark

Three interfaces to Spark are supported by Offload: Spark Submit, Spark Thrift Server and Apache Livy. The interface used by Offload is chosen automatically, based on configuration. If multiple interfaces are configured for use, the order of priority is Spark Thrift Server, Apache Livy and Spark Submit. SparkSQL is used in all cases (via a PySpark script for Apache Livy and Spark Submit, or as pure SQL for Spark Thrift Server) to extract the data to be offloaded from the source RDBMS table.

When offloading to Hadoop-based backends it is expected that the Spark cluster and interfaces are configured components of the target Hadoop cluster. When offloading to cloud warehouses such as Google BigQuery, Snowflake or Azure Synapse Analytics, it is typical to use Spark Standalone (Gluent Data Platform includes a Transport package containing Spark Standalone components for this purpose), although an existing Spark cluster can be utilized if available.

Spark Submit is available for use by Offload if OFFLOAD_TRANSPORT_CMD_HOST and OFFLOAD_TRANSPORT_SPARK_SUBMIT_EXECUTABLE are defined. When using a Spark Standalone cluster jobs will be submitted to the cluster defined in OFFLOAD_TRANSPORT_SPARK_SUBMIT_MASTER_URL. In a Hadoop environment, YARN is used as the Spark master. OFFLOAD_TRANSPORT_SPARK_QUEUE_NAME or --offload-transport-queue-name can be used to submit the Spark job a specific YARN queue.

Spark Thrift Server is available for use by Offload if OFFLOAD_TRANSPORT_SPARK_THRIFT_HOST and OFFLOAD_TRANSPORT_SPARK_THRIFT_PORT are defined. The Spark Thrift server can be configured to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible.

Apache Livy is available for use by Offload when offloading to Hadoop-based backends if OFFLOAD_TRANSPORT_LIVY_API_URL is defined. Livy can be configured (using OFFLOAD_TRANSPORT_LIVY_IDLE_SESSION_TIMEOUT) to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible. The maximum number of Livy sessions that Offload will start and use concurrently is defined by OFFLOAD_TRANSPORT_LIVY_MAX_SESSIONS.

The number of tasks in an offload transport Spark job is defined by OFFLOAD_TRANSPORT_PARALLELISM or per offload with the --offload-transport-parallelism option. It should be noted that defining more tasks than there are available Spark executors will result in queuing. Therefore, OFFLOAD_TRANSPORT_PARALLELISM should ideally be no more than the number of available executors.

Default configuration is appropriate for the majority of offloads but occasionally the nature of the source RDBMS table requires tuning of the following:

Query Import

Query Import is used for for low-volume offloads. Data is extracted and staged by Offload itself and not using an external tool. This avoids delays incurred when invoking Sqoop or Spark Submit. Non-partitioned tables with a source RDBMS size smaller than --offload-transport-small-table-threshold are eligible for Query Import. Query Import does not run in parallel, i.e. OFFLOAD_TRANSPORT_PARALLELISM is ignored.

RDBMS Options

Connections to the source RDBMS are made to the address defined in OFFLOAD_TRANSPORT_DSN. This defaults to ORA_CONN but can be overridden if data needs to be extracted from an alternative address (such as from a replica database).

Concurrent offload transport processes open independent sessions in the source RDBMS. If a high value for OFFLOAD_TRANSPORT_PARALLELISM is required then consideration should be given to any session limits in the RDBMS. The OFFLOAD_TRANSPORT_CONSISTENT_READ parameter or the per offload option --offload-transport-consistent-read can be used to ensure that all concurrent extraction queries reference a specific point in time. When set to true, extraction queries will include an AS OF SCN clause. When the source RDBMS table or input partitions are known to be cold (i.e. not subject to any data modifications), this can be set to false to reduce resource consumption.

If there is a requirement to modify offload transport RDBMS session settings, such as setting an Oracle Database initialization parameter (under the guidance of Gluent Support), then modification of OFFLOAD_TRANSPORT_RDBMS_SESSION_PARAMETERS will be required.

Validate Staged Data

Once data has been staged, it is validated to ensure that the number of staged rows matches the number of rows read from the source RDBMS. While doing this Offload might also:

  • Check for NULL values in any custom partition scheme defined using the --partition-columns option. A positive match results in a warning

  • Check for NULL values in any column defined as mandatory (i.e. NOT NULL)

  • Check for any NaN (Not a Number) special values if the source table has floating point numeric data types. This is because source RDBMS and target backend systems do not necessarily treat NaN values consistently. A positive match results in a warning

  • Check for lossy decimal rounding as described in Decimal Scale Rounding

  • Check for source data that is invalid for the chosen backend partition scheme (if applicable), such as numeric data outside of any --partition-lower-value/--partition-upper-value range. A positive match results in a warning

Validate Type Conversions

Data types used for staging data will rarely match those of the backend target table. Data is converted to the correct type when it is loaded into the final target table. This stage therefore verifies that there will be no invalid conversions when loading. While this is a duplication of type conversions in the Load Staged Data phase, it provides the advantage of checking the data before the more compute-intensive data load and is able to report all columns with data issues in a single pass.

Example 19: Catching Invalid Data Type Conversions

In the following example, the SH.SALES table is offloaded to a Hadoop cluster with an invalid data type for two columns: the data in the PROD_ID and CUST_ID columns is not compatible with the user requested single-byte integer data type.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --integer-1-columns=PROD_ID,CUST_ID

This results in the following exception:

CAST() of load data will cause data loss due to lack of precision in target data type in 6887232 rows
Failing casts are:
    (`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
    (`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
The SQL below will assist identification of problem data:
SELECT PROD_ID
,      CUST_ID
FROM   `sh_load`.`sales`
WHERE  (`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
OR     (`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
LIMIT 50

The exception provides three important pieces of information:

  • The number of rows with issues

  • The columns/conversions with issues

  • A SQL statement to use offline to review a sample of the problematic data

Load Staged Data

In this phase of an offload staged data is converted to correct data types as described in Validate Type Conversions and inserted into the target backend table. Where available, performance metrics from the backend system are recorded in the Offload log file written to $OFFLOAD_HOME/log.

Offload Transport Chunks

The source RDBMS data is offloaded in “chunks”. A chunk can comprise an entire table or a set of one or more partitions. The sequence of four operations described above are executed for each chunk until the whole input set has been offloaded. A final task in this phase, for backend systems that support it, is to update statistics on the freshly offloaded data.

For partitioned RDBMS tables, data is offloaded in sets of partitions. Input partitions are grouped into chunks, based on either the number of partitions or cumulative partition size (including subpartitions where relevant). For non-partitioned tables the table itself is considered a single chunk.

The size limit for a partition chunk is defined by MAX_OFFLOAD_CHUNK_SIZE or per offload with the --max-offload-chunk-size option. The maximum number of partitions in a chunk is defined by MAX_OFFLOAD_CHUNK_COUNT or per offload with the --max-offload-chunk-count option. Both size and count thresholds are active at the same time, therefore the first threshold to be breached is the one that closes the chunk. If the first partition in a chunk breaches the size threshold then it will be included for offload and that chunk will be greater in size than the configured threshold.

Sets of partitions are an effective way of managing resource consumption. For example, if a source RDBMS typically has partitions of 8GB in size and the backend system can comfortably load 64GB of data at a time without impacting other users, MAX_OFFLOAD_CHUNK_SIZE can be set to 64G and roughly 8 partitions will be offloaded per chunk.

When calculating the cumulative size of partitions Offload uses RDBMS segment sizes. This is important to note because compressed data in the RDBMS may increase in size when extracted and staged.

Partitions are considered for chunking in the logical order they are defined in the source RDBMS. For example, partitions from an Oracle Database range-partitioned table are split into chunks while maintaining the order of oldest to most recent partition. Offload does not attempt to optimize chunking by shuffling the order of partitions because that would offload partitions out of sequence and be a risk to atomicity.

Data Storage Options

Gluent Data Platform supports several storage schemes or services for offloaded data, listed in Table 7 below.

Table 7: Offload Storage Options

Storage Scheme or Service

Offload Staging Data

Offloaded Data

OFFLOAD_FS_SCHEME

HDFS

Hadoop only

Hadoop only

hdfs

Amazon Simple Storage Service (S3)

Snowflake only

Hadoop only

s3a

Microsoft Azure Data Lake Storage Generation 1

-

Hadoop only

adl

Microsoft Azure Data Lake Storage Generation 2

-

Hadoop only (Impala >= 6.2)

abfs, abfss

Microsoft Azure Blob Storage

Snowflake and Synapse only

-

wasb, wasbs, abfs, abfss

Google Cloud Storage (GCS)

BigQuery and Snowflake only

-

gcs

Storage preferences such as base HDFS file locations and/or cloud storage buckets can be specified as Offload command-line options, but it is more common to configure them once in the Gluent Data Platform environment file.

Gluent Offload Engine can offload data to cloud storage and requires a small amount of configuration, as follows:

When offloading to Hadoop-based backends:

  • The two-phase loading process stages data in an Avro table before persisting it to the Hadoop database

  • A value of inherit can be used for OFFLOAD_FS_SCHEME. With this value all tables created by Offload will inherit the location from the parent database. If databases are created using the --create-backend-db option, a default location of HDFS will be used

  • Specifying a storage scheme in cloud storage will persist the offloaded data to that location, but data must first be staged in HDFS as part of the two-phase loading process

  • Offloaded data is compressed. The default compression can be overridden with the --storage-compression option

  • The default file format for all persisted data is Parquet, regardless of the staging storage scheme

When offloading to Google BigQuery:

  • The two-phase loading process stages data as Avro file(s) in Google Cloud Storage before persisting it to Google BigQuery

  • Google BigQuery has its own proprietary storage format for the persisted data

When offloading to Snowflake:

  • The two-phase loading process stages data as Parquet file(s) in the configured cloud storage system before persisting it to Snowflake. The storage format can be be overridden to use Avro with OFFLOAD_STAGING_FORMAT, but the default of Parquet is strongly recommended

  • Snowflake has its own proprietary storage format for the persisted data

When offloading to Azure Synapse Analytics:

  • The two-phase loading process stages data as Parquet file(s) in the configured cloud storage system before persisting it to Synapse

  • Azure Synapse Analytics has its own proprietary storage format for the persisted data

Regardless of the storage provider:

Managing Data Distribution

In addition to the variety of cloud storage options supported by Gluent Data Platform, Offload also provides options to manage the offloaded data distribution:

Managing Backend Partitioning

When offloading data, backend tables can be partitioned in several ways, depending on the backend platform and user preferences:

Note

This section does not apply to Snowflake or Azure Synapse Analytics.

Surrogate Partitioning for Bucketing

When offloading to Hadoop-based backends, Offload always includes a surrogate partition key named OFFLOAD_BUCKET_ID, in addition to any other partitioning scheme that might be defined for the table. This surrogate partition key is used internally by Gluent Data Platform to enable Smart Connector to scan backend data in parallel when executing a parallel hybrid query. The number of partition key values defaults to the DEFAULT_BUCKETS configuration but can be overridden with the --num-buckets option.

A hash function is used by Offload to distribute data across the OFFLOAD_BUCKET_ID partitions. Offload will try to pick the most selective column to distribute data evenly across the surrogate partitions, but in the absence of a primary key or column statistics for the source RDBMS table, it will default to the first table column. Users can override the hashing column by including the --bucket-hash-column option with the initial offload command.

The surrogate partition column is not required when offloading to Google BigQuery.

Inherited Partitioning

Tables offloaded with Partition-Based Offload or Subpartition-Based Offload are automatically partitioned in the backend with the (sub)partition key of the source table, unless overridden by the user. The granularity of the inherited (sub)partitions can be different to the source RDBMS (sub)partitions if required. In some cases it is mandatory to specify the granularity of the backend partition scheme (see Partition Granularity below for details)

For Hadoop-based backends, the inherited partition key can comprise multiple columns and is used in addition to the surrogate OFFLOAD_BUCKET_ID partition key described above. Backend partition key columns are always implemented by the addition of synthetic partition key columns. See Synthetic Partitioning below for details.

For Google BigQuery, only the leading (sub)partition key column will be used. In some cases, the backend partitioning might be implemented by a synthetic partition key column, but not always. See Synthetic Partitioning below for details.

User-Defined Partitioning

Tables can be offloaded with the --partition-columns option to define a custom partitioning scheme for the backend table. This enables non-partitioned RDBMS tables to be partitioned in the backend if required, in addition to allowing (sub)partitioned RDBMS tables to be offloaded with a different partitioning scheme in the backend. It is also possible (and in some cases mandatory) to specify the granularity of the backend partitions (see Partition Granularity for details below) and user-defined partitioning supports more RDBMS data types than inherited partitioning.

For Hadoop-based backends, the user-defined partition key can comprise multiple columns and is added to the surrogate OFFLOAD_BUCKET_ID partition key described above. The following example offloads a partitioned table to a Hadoop backend and subpartitions the offloaded data by year (derived from a date) and product ID (in steps of 1000).

Example 20: Offload with User-Defined Partitioning
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-columns=TIME_ID,PROD_ID --partition-granularity=Y,1000

When offloading to a Hadoop-based backend, user-defined partitioning can include date/timestamp, numeric or string columns.

When offloading to Google BigQuery, user-defined partitioning can include date/timestamp, numeric or string columns, but only one partition key column can be defined. In some cases, the backend partitioning will be implemented by a synthetic partition key column, but not always. See Synthetic Partitioning below for details.

Synthetic Partitioning

Synthetic columns are used to partition backend tables instead of the corresponding natural columns. Depending on the data type and backend system, Offload will sometimes implement inherited and user-defined partitioning schemes with additional synthetic columns (one per source (sub)partition column). This is usually when:

  • The backend system has no native partitioning support for the data type of the partition key column

  • The backend system has no native partitioning support for the change in granularity requested for the offloaded partitions

Synthetic partition keys are used internally by Gluent Query Engine to generate partition-pruning predicates in hybrid queries and by Offload to ensure that the backend partition columns remain consistent across multiple Partition-Based Offload or Subpartition-Based Offload operations.

Synthetic partition key columns are named by Gluent Offload Engine as a derivative of the corresponding source column name (e.g. GL_PART_M_TIME_ID or GL_PART_U0_SOURCE_CODE).

When a table is offloaded with partitioning to a Hadoop-based backend, synthetic partition column(s) are always generated by Offload. The resulting synthetic column(s) are always of type STRING.

When a table is offloaded with partitioning to Google BigQuery, a synthetic partition key will only be generated when the natural partition column is of a NUMERIC, BIGNUMERIC or STRING BigQuery data type, and the resulting synthetic partition key column will be created as an INT64 type (the integral magnitude of the source numeric data must not exceed the INT64 limits). For STRING columns, or for extreme [BIG]NUMERIC data that cannot be reduced to INT64 values with --partition-granularity (i.e. the granularity itself would need to exceed INT64 limits), a custom user-defined function (UDF) must be created and used to enable Gluent Offload Engine to create an INT64 synthetic partition key representation of the source data (see Partition Functions). Native BigQuery partitioning will be used when the natural partition column has a data type of INT64, DATE, DATETIME or TIMESTAMP.

Offload populates synthetic partition key columns with generated data when offloading based on the type and granularity of the data or based on the type and a custom partition function. Gluent Query Engine will generate predicates for the synthetic partition columns when pushing down predicates for the corresponding natural columns.

Partition Granularity

Partition granularity defines the range or contents of a single backend partition. In some cases, granularity must be defined (using the --partition-granularity option) when offloading an RDBMS table for the first time, but in many cases, Offload will choose a default granularity based on several factors (including the type of RDBMS partitioning, the RDBMS (sub)partition key type and the backend platform). Any defaults chosen by Offload can be overridden with the --partition-granularity option if required.

Partition Granularity Behavior With Hadoop Backends

When offloading with partitioning to a Hadoop-based backend, the following granularity behavior applies.

  • If the RDBMS table is list-partitioned and the partition scheme is inherited:

    • Date/timestamp partition data defaults to a daily granularity

    • Numeric partition data defaults to a granularity of 1

    • String partition data defaults to the length of the RDBMS string partition column

  • In all other scenarios:

Partition Granularity Behavior With Google BigQuery

When offloading with partitioning to Google BigQuery, the following granularity behavior applies.

Partition Functions

The Partition Functions feature is only available when offloading to Google BigQuery. This extensibility feature allows users to provide a custom user-defined function (UDF) for Gluent Offload Engine to use when synthetically partitioning offloaded data. It enables users to choose a source partitioning column that would otherwise not be usable as a partition key in the backend system. For example, there is no native partitioning option for STRING data in BigQuery, which means that Gluent Offload Engine’s standard synthetic partitioning cannot be used. Also, some extreme [BIG]NUMERIC data cannot be reduced to INT64 values with Offload’s standard synthetic partitioning scheme (i.e. with the --partition-granularity option). The Partition Functions feature provides a way for users to create an INT64 representation of the source partitioning data to use as a synthetic partition key in both of these cases.

Example 21: Offloading with Partition Functions

In the following example, the SH.CUSTOMERS table is offloaded to BigQuery and synthetically partitioned by the CUST_LAST_NAME column. Because the source column is VARCHAR2, the Partition Functions feature is required. First, the BigQuery UDF must be created (or already exist) to enable Offload to generate an INT64 representation of the source string data. In this example, the backend table will be partitioned according to the first letter of the CUST_LAST_NAME data, using the following BigQuery SQL UDF:

CREATE FUNCTION UDFS.CUST_NAME_TO_PART_KEY (str STRING) RETURNS INT64 AS (ASCII(UPPER(str)));

With this UDF, Gluent Offload Engine is able to offload the SH.CUSTOMERS table and add a synthetic partition key with an INT64 representation of the CUST_LAST_NAME source data as follows:

$ $OFFLOAD_HOME/bin/offload -t SH.CUSTOMERS -x \
      --partition-columns=CUST_LAST_NAME \
      --partition-functions=UDFS.CUST_NAME_TO_PART_KEY \
      --partition-lower-value=65 \
      --partition-upper-value=90 \
      --partition-granularity=1

With this command, Offload will apply the UDFS.CUST_NAME_TO_PART_KEY function to the CUST_LAST_NAME column to generate a synthetic INT64 partition key value between 65 and 90 (capital letters) and each backend synthetic partition will have a range of 1.

Note

In addition to Full Offload (as shown in Example 21), Partition Functions can also be used with (Sub)Partition-Based Offload (see Example 7: Offload a Range of String Partitions (BigQuery)) and Predicate-Based Offload.

There are several ways to reference and qualify the name and location of a custom UDF for use as a partition function:

  • Fully-qualified in the --partition-functions option (e.g. --partition-functions=MY_UDF_DATASET.MY_UDF_NAME)

  • If the dataset name is not provided in the --partition-functions option (e.g. --partition-functions=MY_UDF_NAME

    • If OFFLOAD_UDF_DB is set in the offload.env configuration file, Offload will use this to identify the UDF dataset

    • If OFFLOAD_UDF_DB is not set in the offload.env configuration file, Offload will assume the UDF is in the same dataset as the offloaded table

Google BigQuery identifiers are case-sensitive, so in all cases the dataset and UDF values must be provided in the correct case.

When creating a UDF for use with Partition Functions, the following conditions must be met:

  • The UDF must be a Google BigQuery SQL UDF (JavaScript UDFs are not supported)

  • The UDF return data type must be INT64

  • The UDF must have a single parameter of data type STRING, [BIG]NUMERIC or INT64

  • The UDF must be deterministic (i.e. the same input will always yield the same output)

  • Synthetic backend partitioning is range partitioning; therefore, the SQL UDF must retain the same ordering relationship and characteristics of the source data. This is especially critical if the source partition column is likely to be queried with range operators (< <= > >= BETWEEN), else the query can yield wrong results. In Example 21 above, the CUST_NAME_TO_PART_KEY function preserves the ordering of the CUST_LAST_NAME data (i.e. all names beginning ‘A’ - ASCII 65 - are less than names beginning with ‘B’ - ASCII 66 and so on)

  • See Google BigQuery Custom Partition Functions for permissions and other installation requirements

Backend Data Distribution and Sorting/Clustering

In addition to partitioning, offloaded data can optionally be distributed according to distribution, sorting or clustering options provided by the backend. This can lead to more efficient data access (such as range scans) for hybrid queries and other users of the offloaded data.

The following data distribution options are provided by the backends:

  • Impala: For Hadoop-based backends, Offload uses partitioning to hash distribute offloaded data into buckets, as described in Surrogate Partitioning for Bucketing

  • Azure Synapse Analytics: Tables can be hash distributed by one column

When offloading to Azure Synapse Analytics, most tables will be distributed by either HASH or ROUND_ROBIN (assuming that the table does not contain any data types that prevent clustered columnstore storage). The hashing method depends on the value of OFFLOAD_TRANSPORT_SMALL_TABLE_THRESHOLD globally or the table-level --offload-transport-small-table-threshold override option (smaller tables are distributed by ROUND_ROBIN and larger tables by HASH). The column used to distribute the data is automatically selected by Offload unless a specific column is provided with the --bucket-hash-column option.

The following data sorting and clustering options are provided by the backends:

  • Impala: Data can be stored sorted by one or more columns

  • Google BigQuery: Tables can be clustered by up to four columns

  • Snowflake: Tables can be clustered by more than one column, but it is recommended that no more than four columns are used

  • Azure Synapse Analytics: Tables can have a clustered columnstore index with more than one numeric column

To enable data sorting and clustering globally, set the OFFLOAD_SORT_ENABLED configuration parameter to true. Alternatively, this can be managed for individual offloads by adding the --offload-sort-enabled option to the offload command (set to true), along with the --sort-columns option to specify the columns for distributing the data in the backend.

Managing Hybrid Query Performance

In addition to creating the database objects and metadata required to enable Gluent Query Engine to execute hybrid queries, Offload also provides a solid foundation for good hybrid query performance. It does this through a combination of standard operations and configurable options, designed to prepare the hybrid environment for Smart Connector optimizations. Several of these are described below.

Managing Parallelism

By default, Offload prepares the hybrid environment for parallel query access in several ways, unless disabled by configuration:

Offload Buckets

As described earlier, in Hadoop-based environments, Gluent Offload Engine automatically partitions the backend table by a surrogate OFFLOAD_BUCKET_ID column to enable parallel access by multiple Smart Connector sessions. This can be configured globally as a default number of buckets (see DEFAULT_BUCKETS) and/or specified per table with the --num-buckets option. This should be set to the highest degree of parallelism required for tables that need to be accessed by parallel queries. It can be capped globally to avoid creating too many bucket partitions in the backend table. It can also be capped to limit the degree of parallel access (this applies to Impala only).

Small tables below a threshold (configured with the DEFAULT_BUCKETS_THRESHOLD global parameter) are automatically offloaded with a single bucket.

Google BigQuery, Snowflake and Azure Synapse Analytics do not require this Gluent Data Platform data storage optimization. Smart Connector utilizes the BigQuery Storage API for natively-managed parallel data retrieval from Google BigQuery. For parallel data retrieval from Snowflake and Azure Synapse Analytics, Smart Connector first caches and shards the results of the query and then reads the results in parallel.

Hybrid External Table Location Files

Hybrid external tables can be created with multiple location files. When doing so, parallel hybrid queries spawn multiple Smart Connector processes for parallel data retrieval. The number of location files for each hybrid external table is determined by either the NUM_LOCATION_FILES global default parameter or per table with the --num-location-files option.

Note that when offloading to Impala, the number of external table location files is always aligned with the number of offload buckets.

To disable parallel data retrieval for an external table, the --num-location-files option should be added to the offload command and should be set to 1.

Hybrid External Table Degree

By default, Offload copies the degree of parallelism (DOP) from the source RDBMS table to the hybrid external table when offloading. This ensures that hybrid queries are executed with the same default DOP as the original application queries. This can be configured globally with the HYBRID_EXT_TABLE_DEGREE parameter or set per table with the --ext-table-degree option. In most cases, the default suffices.

This does not apply to hybrid external tables that are created for Advanced Aggregation Pushdown; these are always created with a degree of 1.

Managing Statistics

Database statistics enable query optimizers to determine the most efficient way to execute queries. To ensure that hybrid queries benefit from efficient execution plans, Offload manages statistics for both the hybrid external table and the backend target table (where applicable).

Hybrid Object Statistics

When offloading an RDBMS table for the first time, Offload copies the table and column statistics from the source table to the base hybrid external table. No prorating or modification of the statistics occurs, even if the source table is partially-offloaded. In cases where partially-offloaded tables continue to be offloaded (for example, a Partition-Based Offload of the latest few partitions), Offload will not overwrite the external table statistics unless the source RDBMS statistics indicate that the source table has grown. This prevents the statistics from being downgraded if the source RDBMS table shrinks due to offloaded partitions being dropped from the RDBMS.

Hybrid external tables that support general, custom or count-only Advanced Aggregation Pushdown have their statistics derived from the backend table statistics. When offloading to Google BigQuery or Snowflake, statistics for these hybrid external tables are derived by Offload.

Backend Table Statistics

When offloading to Hadoop-based or Azure Synapse Analytics backends, Offload will compute statistics on the backend target table every time it is loaded.

For Impala, default behavior for partitioned tables is to compute incremental statistics. For very large tables, or tables with a large number of partitions, this can sometimes lead to pressure for Impala resources. In such cases (and to save time on statistics operations in general), Offload provides a mechanism to copy statistics from the source RDBMS table to the backend target table. This is a metadata-only operation and as such is quick to process. The backend statistics are derived from the source RDBMS values. To specify that statistics should be copied rather than computed, the --offload-stats option should be added with a value of COPY to the offload command.

For Azure Synapse Analytics, stats are updated after every offload using a fixed 5% sample, unless disabled with the --offload-stats option (using a value of NONE). Unlike Impala, stats cannot be copied from the RDBMS table to the Azure Synapse Analytics table.

This step is skipped when offloading to either Google BigQuery or Snowflake (BigQuery tables do not have statistics and Snowflake gathers statistics automatically while loading data).

Data Distribution and Sorting/Clustering

See Backend Data Distribution and Sorting/Clustering.

Resetting an Offloaded Table

Offload provides a --reset-backend-table option that can be added to an offload command to fully reset a previously offloaded table. This option is useful when refreshing small offloaded tables such as reference tables or dimensions (these are commonly offloaded to support Offload Join operations). This option will cause Offload to drop all offloaded data from the backend, so it should be used carefully.

Important

The --reset-backend-table option will drop the existing backend table and supporting objects. This option must only be used when it is certain that resetting the backend table will not cause data loss. Care must be taken to ensure that this option is never used if some of the previously-offloaded data cannot be replaced (for example, partitions that are dropped from the source RDBMS after offloading to the backend).

Offload Security

Gluent Data Platform interfaces with the security features of the RDBMS and backend data platforms, providing support for authentication, authorization and encryption features.

See the Security guide for details of supported security features and the Installation & Upgrade guide for implementation details.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com