Offload ¶

Table of Contents

Offload

Documentation Conventions ¶

Commands and keywords are in this font.
$OFFLOAD_HOME is set when the environment file (offload.env) is sourced, unless already set, and refers to the directory named offload that is created when the software is unpacked. This is also referred to as <OFFLOAD_HOME> in sections of this guide where the environment file has not been created/sourced.
Third party vendor product names might be aliased or shortened for simplicity. See Third Party Vendor Products for cross-references to full product names and trademarks.

Offload is a core component of Gluent Offload Engine that provides a simple and flexible way to copy data from an RDBMS (such as Oracle Database) to a variety of backend data platforms (such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery, Snowflake, Azure Synapse Analytics). This guide describes how to use Offload in a number of common and advanced use-cases. It is assumed that readers are familiar with Gluent Data Platform Concepts prior to reading this Offload guide, particularly the Gluent Offload Engine and Offload concepts.

Overview of the Offload Process ¶

To understand the Offload software and its use-cases, options and operation, it is useful to start with a high-level understanding of an offload process in general. Offloading data from an RDBMS to a backend requires a number of sequential operations to be orchestrated, involving a variety of components, interfaces and parts of the source and target systems. When running an offload, the various steps executed by Offload are visible both on the terminal and in a log file, but a broad description of what these steps are doing is provided below.

An offload process has four broad logical phases:

Phase 1: Setup
Phase 2: Offload Transport
Phase 3: Hybrid Objects
Phase 4: Verification

Phase 1: Setup¶

In the setup phase, Offload will gather a range of information about the data to be offloaded, such as the structure of the data, its data types, metadata from previous offloads and so on. With this information, Offload is able to prepare the target backend system ready for the Offload Transport phase. This includes the creation of target databases and supporting objects (if required), the creation of a staging location and objects (if required) for the offloaded data, and for first-time offloads, a final table in the target backend database.

Phase 2: Offload Transport¶

In the transport phase, Offload copies the source RDBMS data to the backend staging location. Once the staged data is validated, it is loaded into the backend target table. This two-phased approach provides two main benefits:

Atomicity: the load from staged data to target table either fully succeeds or completely fails
Partitioning: for backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table

Phase 3: Hybrid Objects¶

In the hybrid objects phase, Gluent Offload Engine creates a number of database objects in the RDBMS hybrid schema and saves metadata to Gluent Metadata Repository to support both the operation of Gluent Data Platform and provide an interface for users to transparently query offloaded data. Statistics, synonyms and privileges are also created as required. See Offload Objects below for an example of the hybrid objects that are created by a standard Offload.

Where applicable, Offload will also create hybrid copies of any views that are dependent on the offloaded RDBMS table. If any such dependent hybrid views already exist, Offload will also re-generate these with any changes that are required as a result of the latest offload operation.

Phase 4: Verification¶

The verification phase is the final operation in an offload process. In this phase, Offload queries both the hybrid view and the source RDBMS table to verify that the offloaded data set matches in both databases. This has a secondary benefit of validating the hybrid objects. Following successful verification of the offloaded data, the offload is complete. Any notices or warnings identified during the various phases are logged and Offload exits with success (i.e. an exit code of 0). See the --verify option for available validation methods. By default, the verification SQL runs serially, but can be executed in parallel by using the --verify-parallelism option.

Controlling the Offload Process¶

The offload process is highly configurable. Many steps in the process can be modified by options or configuration parameters and a number of steps are also optional and can be disabled per offload. In addition to the remainder of this Offload guide (which will describe many of the options), see the Reference guide for a full list of options and configuration parameters and their descriptions.

Offload Objects¶

The following table lists the standard set of RDBMS and backend database objects that are created by an offload of SCHEMA.TABLE.

Table 1: Objects Created by Offload¶

Database	Object Type	Object Naming Format	Hybrid Object Type
Backend	Table	SCHEMA_LOAD.TABLE	Temporary staging table
Backend	Table	SCHEMA.TABLE	Offloaded table
RDBMS	View	SCHEMA_H.TABLE	Hybrid view
RDBMS	Table	SCHEMA_H.TABLE_EXT	Hybrid external table
RDBMS	View	SCHEMA_H.TABLE_AGG	Aggregate hybrid view
RDBMS	Table	SCHEMA_H.TABLE_AGG_EXT	Aggregate hybrid external table
RDBMS	Rewrite Equivalence	SCHEMA_H.TABLE_AGG	Aggregate rewrite rule
RDBMS	View	SCHEMA_H.TABLE_CNT_AGG	Aggregate hybrid view
RDBMS	Table	SCHEMA_H.TABLE_CNT_AGG_EXT	Aggregate hybrid external table
RDBMS	Rewrite Equivalence	SCHEMA_H.TABLE_CNT_AGG	Aggregate rewrite rule

Google BigQuery identifiers (dataset, table and view names) are case-sensitive. Snowflake identifiers are by default case-sensitive when quoted. Azure Synapse Analytics identifiers are case-sensitive when quoted or when the Synapse dedicated SQL pool is created as a case-sensitive database. Gluent Offload Engine allows a default case to be set for all Google BigQuery, Snowflake and Azure Synapse Analytics identifiers created by Offload to ensure consistency. See BACKEND_IDENTIFIER_CASE for more details.

Offload adds a variety of suffixes to the offloaded table name when naming the hybrid objects. If the length of a hybrid object name exceeds the RDBMS limit, Offload will truncate the hybrid object name to below the limit and pad the remainder with a unique string to avoid potential clashes with other truncated identifiers.

See also

Glossary for definitions of the hybrid object types in addition to the Concepts and Performance guides for further information on their purpose and usage.

Offloading Data ¶

Offload supports several scenarios for offloading data from the RDBMS:

Full Offload
Partition-Based Offload
Subpartition-Based Offload
Predicate-Based Offload
Offloading Joins

Full Offload¶

Heap, partitioned and index-organized tables can be fully-offloaded from the RDBMS as follows.

Example 1: Offload a Full Table¶

$ . $OFFLOAD_HOME/conf/offload.env

$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x

Note

This example includes the sourcing of the Gluent Data Platform environment file for an offload session. All other examples in this guide will assume this step has occurred.

A log file containing all of the steps in the offload process will be generated in $OFFLOAD_HOME/log. Using the -v or --vv option will result in more screen output during the offload.

Backend tables can be optionally partitioned, even if the source RDBMS table is not partitioned. See Managing Backend Partitioning below for more details.

It is possible to rename the target backend table when offloading with the --target-name option. This can be useful for a number of reasons, including when:

The source RDBMS table name includes a character that is not supported by the backend (for example, Oracle Database allows $ to be used for identifiers but this is not supported by Impala or BigQuery)
The naming standards used for applications in the backend system are different to the source RDBMS application (either for database names or object names)
The RDBMS schema name is too long to accommodate necessary Offload extensions such as the _load suffix for the backend staging database or the value of DB_NAME_PREFIX (for example, Impala has a database name limit of 128 characters, which means that the maximum supported length of the RDBMS schema is 123 when DB_NAME_PREFIX is not in use or 122 minus the length of DB_NAME_PREFIX when it is in use)

Example 2: Change the Offload Target Name¶

The following example offloads a RDBMS table named SH.SALES$ to a backend that doesn’t support the $ character in table names, meaning that the target table needs to be renamed to remove the $.

$ $OFFLOAD_HOME/bin/offload -t 'SH.SALES$' -x --target-name=SH.SALES

See also

This guide describes a number of Offload options, but a full list is provided in the Offload Parameters section of the Reference manual.

Partition-Based Offload¶

Partition-Based Offload enables some or all partitions of a partitioned table to be offloaded across one or more offload sessions. A common use of this feature is to initially offload the historical partitions for a table and then periodically offload new partitions as they arrive in the RDBMS, appending the data to the backend version of the table.

Partition-Based Offload can be used for the following scenarios:

Offloading Range-Partitioned Tables
Offloading Interval-Partitioned Tables
Offloading List-Partitioned Tables
Offloading List-Partitioned Tables as Range

Offloading Range-Partitioned Tables¶

Tables that are range-partitioned on DATE, TIMESTAMP, NUMBER or [N]VARCHAR2 columns can be fully or partially offloaded. Partitions can be offloaded contiguously up to a boundary by specifying a high water mark in the offload command. Partition boundaries can be increased in subsequent offloads to append additional partitions without affecting historical partitions.

Partition Boundary Options¶

To offload a contiguous range of partitions from a range-partitioned table, one of the following boundary options must be used:

--older-than-date: Offload DATE or TIMESTAMP partitions with a high water mark less than this value. Synonym for --less-than-value, e.g.

--older-than-date=2015-10-01

--older-than-days: Offload partitions older than this number of days (exclusive, i.e. the boundary partition is not offloaded). This option is suitable for keeping data up to a certain age in the source table and is an alternative to the --older-than-date option. If both are supplied, --older-than-date will be used, e.g.

--older-than-days=90

--less-than-value: Offload partitions with a high water mark less than this value. Will accept integer, string or date values and allows intraday date and timestamp values to be specified, e.g.

--less-than-value=100000000
--less-than-value=2015-01-01
--less-than-value="2015-01-01 12:00:00"
--less-than-value=M

--partition-names: Offload partitions with a high water mark matching that of the named partition, e.g.

--partition-names=P201509

Boundary options for range-partitioned tables are exclusive and the partition that contains data for the specified high water mark will not be offloaded. Furthermore, Offload will not offload partial partitions. For example, suppose a partition has data with a value range of 2000-01-01 to 2000-01-31. Using --older-than-date=2000-01-15 will not offload data from 2000-01-01 to 2000-01-14 because the partition has the potential for data up to and including 2000-01-31. Instead, data will be offloaded up to the boundary of the previous range partition. To offload the example partition, an --older-than-date=2000-02-01 value must be used, but only when loading for the entire partition has completed.

Note

Backend partitioning concepts are not applicable when offloading to Snowflake or Azure Synapse Analytics.

Offloading with Date Partition Boundaries¶

The following is an example of offloading all partitions below a given date boundary.

Example 3: Offload a Range of Date Partitions¶

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-10-01

By default, Example 3 above will use the RDBMS partition key column as the backend partition column for backends that support partitioning (they can be different if required; see User-Defined Partitioning for details). It is common to use a different granularity for the backend partitions, especially useful when the backend system better supports either larger or fewer partitions. For example, an RDBMS table partitioned by day might be better offloaded to monthly partitions in a Hadoop-based backend. See --partition-granularity for details of the default granularities for various scenarios and backends.

Offloading with Numeric Partition Boundaries¶

For range-partitioned tables with numeric partition keys, the partition granularity of the backend table must be specified. The command syntax differs slightly according to the target backend, as the following examples demonstrate.

Example 4: Offload a Range of Numeric Partitions (Hadoop)¶

This example offloads all partitions with a partition key value of less than 5,000,000,000 and creates backend partitions with a range of 10,000,000. This applies when offloading to Hadoop-based platforms.

$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
      --less-than-value=5000000000 \
      --partition-granularity=10000000

Note

It is also possible to specify a limit to the size of numeric partition keys (see --partition-digits).

Example 5: Offload a Range of Numeric Partitions (BigQuery)¶

For Google BigQuery, the full range of potential partition key values must be specified when creating a table with numeric partitions, hence the additional Offload options in the following example.

$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
      --less-than-value=5000000000 \
      --partition-granularity=10000000 \
      --partition-lower-value=1 \
      --partition-upper-value=99999999999999999

Note

Any partition key data that falls outside the range specified by the lower and upper values will be loaded into the __UNPARTITIONED__ partition by BigQuery.

Offloading with String Partition Boundaries¶

For range-partitioned tables with string partition keys, the partition granularity of the backend table must be specified when offloading to Hadoop-based backends. Google BigQuery does not support partitioning on STRING columns; therefore the source partition column data must be converted to INT64 with a custom partition function (see --partition-functions) to enable the backend table to be synthetically partitioned. The following examples demonstrate the different options required when offloading to Hadoop or BigQuery backends.

Example 6: Offload a Range of String Partitions (Hadoop)¶

This example offloads all partitions with a partition key value of less than ‘M’ and creates backend partitions based on the first two characters of the source string data. This applies when offloading to Hadoop-based platforms.

$ $OFFLOAD_HOME/bin/offload -t SH.VARCHAR2_PARTITIONED_FACT -x \
      --less-than-value=M \
      --partition-granularity=2

Example 7: Offload a Range of String Partitions (BigQuery)¶

Google BigQuery does not have native STRING partitioning support, but the source string data can still be used to partition the backend table synthetically if a custom UDF is provided to convert the string data to an INT64 type. The following example shows a VARCHAR2 partition key used as the source for Google BigQuery partitioning. A custom UDF is provided to generate an ASCII value for the first character of the source data and the resulting INT64 value is used to partition the backend table. The full range of potential partition key values must be specified when offloading a table with a VARCHAR2 partition column.

$ $OFFLOAD_HOME/bin/offload -t SH.VARCHAR2_PARTITIONED_FACT -x \
      --less-than-value=F \
      --partition-granularity=1 \
      --partition-lower-value=65 \
      --partition-upper-value=150 \
      --partition-functions=MY_UDF_DATASET.STRING_TO_ASCII

Fully Offloading Range-Partitioned Tables with Offload Type¶

Range-partitioned tables are also supported for full offload, using either the --offload-type option with a value of FULL or by excluding any of the partition boundary options described above. The following option combinations will fully offload a range-partitioned table:

--offload-type=FULL: Offloads all partitions and creates a simple hybrid view that queries all the data from the backend
--offload-type=FULL plus one of the boundary options: Offloads all partitions and data to the backend but creates a composite hybrid view that queries the data below the specified boundary from the backend and data above the specified boundary from the RDBMS
None of the offload type or partition boundary options described above: This is the same as specifying --offload-type=FULL

See also

See Offload Type Considerations for more details on support for partial and full offloads of partitioned tables.

Moving the Range High Water Mark in a Hybrid View¶

The partition boundary in a hybrid view can be modified by running a subsequent offload with a boundary option matching a partition value that has already been offloaded. No data will be copied from the RDBMS to the backend but the associated hybrid view(s) will be reconfigured to reflect the new settings. If no partition is matched by the boundary option values then the partition boundary will be aligned with the data that has already been offloaded.

Moving the partition boundary does not affect any data that has already been offloaded. It is a metadata operation that simply changes where hybrid queries source their results from above and below the boundary. As described in Offload Type Considerations, this technique can be useful to get the best trade-off between making current data available in the backend system as soon as possible but having the most frequently accessed data queried from the RDBMS until it becomes less-used.

MAXVALUE Partition Considerations¶

Offloads for range-partitioned tables with a MAXVALUE partition as the latest partition will behave differently depending on the offload options used, as follows:

If neither the --offload-type option nor any of the partition boundary options are specified, the MAXVALUE partition will be offloaded
If --offload-type=FULL is specified, either with or without a partition boundary option, the MAXVALUE partition will be offloaded
If --offload-type=INCREMENTAL is specified, either with or without a partition boundary option, the MAXVALUE partition will not be offloaded
If a partition boundary option is specified but not --offload-type, the MAXVALUE partition will not be offloaded

Note

Offloading the MAXVALUE partition for a table will prevent any further partition-based offloads for that table.

Multi-Column Partitioning¶

Oracle Database supports multi-column partitioning, whereby a table is partitioned on a set of columns rather than a single partition key (for example, PARTITION BY RANGE (yr, mon)). Offload supports this for range-partitioned tables. Any boundary condition must be provided via the --less-than-value option as comma-separated-values. For example:

--less-than-value=2015,10

Or:

--less-than-value=2015-10-31,200

By default, when offloading a multi-column partitioned table to Hadoop-based backends, the generated Hadoop table will be partitioned and subpartitioned to match the columns used in the RDBMS partitioning scheme. For example, an Oracle Database table partitioned by YR, MON will be partitioned in Hadoop by yr and then subpartitioned by mon. This can also be controlled using the --partition-columns option.

When offloading a multi-column partitioned table to Google BigQuery, the generated BigQuery table will be partitioned according to the leading partition key column only. The --partition-columns option will only allow one column to be specified when offloading to Google BigQuery.

Offloading Interval-Partitioned Tables¶

Offload supports interval-partitioned tables in exactly the same way as range-partitioned tables.

Offloading List-Partitioned Tables¶

Tables that are list-partitioned on DATE, TIMESTAMP, NUMBER, [N]VARCHAR2 or [N]CHAR columns can be partially or fully offloaded. Discrete sets of one or more partitions can be offloaded by specifying either partition key values or partition names in the offload command. Additional partitions can be offloaded in subsequent offloads without affecting historical partitions.

List Partition Specification¶

Partitions in a list-partitioned table can be offloaded using either of the options below:

--equal-to-values: Offload partitions by partition key value, accepting partition key literals that match the RDBMS partitions, e.g.

--equal-to-values=2015-01-01
--equal-to-values=US --equal-to-values=AP --equal-to-values=EU
--equal-to-values=201501,201502,201503

--partition-names: Offload partitions by partition name, e.g.

--partition-names=TRANSACTIONS_P2015Q1,TRANSACTIONS_P2015Q2

When using --equal-to-values, each partition must have its own option specification, as shown above and in Example 8 below.

Example 8: Offloading Partitions from a List-Partitioned Table¶

The following example offloads two partitions from a date list-partitioned table:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --equal-to-values=2015-01-01 --equal-to-values=2015-01-02

Example 9: Offloading Multi-Valued Partitions from a List-Partitioned Table¶

When using --equal-to-values, each option must equal the full high values specification for a single partition. The following example offloads two multi-valued list partitions where the partition key is a numeric representation of year-month:

$ $OFFLOAD_HOME/bin/offload -t SH.TRANSACTIONS -x --equal-to-values=201501,201502,201503 --equal-to-values=201504,201505,201506

Offloads for tables with numeric list-partition keys does not require the --partition-granularity option, but for offloading to Google BigQuery the --partition-lower-value and --partition-upper-value options must still be used. See Managing Backend Partitioning for further details.

Example 10: Offloading List-Partitions by Partition Name¶

The --partition-names option is an alternative way to specify multiple partitions to offload and can be useful if partition names are well-formed and known. The following example offloads four partitions from a table that is list-partitioned on financial quarter:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-names=SALES_P2015Q1,SALES_P2015Q2,SALES_P2015Q3,SALES_P2015Q4

Resetting the List High Water Mark in a Hybrid View¶

List partition offloading is additive. A partition that has previously been offloaded will not be re-offloaded, even if it is specified in multiple offload commands. If there is a need to modify the hybrid view so that offloaded partitions are accessed from the RDBMS rather than the backend system, the --reset-hybrid-view option can be used with the next offload. Any previously offloaded partitions matched by partition identification options will not be re-offloaded but will be reflected in the hybrid view predicates.

DEFAULT Partition Considerations¶

Offloads for list-partitioned tables with a DEFAULT partition will behave differently depending on the offload options used, as follows:

If neither the --offload-type nor any of the partition identification options (--equal-to-values, --partition-names) are specified, the DEFAULT partition will be offloaded
If --offload-type=FULL is specified, either with or without a partition identification option, the DEFAULT partition will be offloaded
If --offload-type=INCREMENTAL is specified, either with or without a partition identification option, the DEFAULT partition will not be offloaded
If a partition identification option is specified but not --offload-type, the DEFAULT partition will not be offloaded

Note

Offloading the DEFAULT partition for a table will prevent any further partition-based offloads for that table.

Offloading List-Partitioned Tables as Range¶

In cases where a list-partitioned table has been structured to mimic range partitioning, the table can be offloaded exactly as described in Offloading Range-Partitioned Tables. To use this feature the table should adhere to the following:

Each partition must have a single literal as its high value
All new partitions must be added with high values that are greater than those that have already been offloaded
Supported data types must match those referenced in Offloading Range-Partitioned Tables

In this scenario, a DEFAULT list partition will be treated in the same way that a MAXVALUE partition is treated for range partition offload (see MAXVALUE Partition Considerations for details).

Backend partition granularities differ between range-partitioned tables and list-partitioned tables offloaded as range. See Managing Backend Partitioning for details.

Example 11: Offload a Set of List Partitions Using a Date Range Partition Boundary¶

The following example offloads all list partitions with a partition key value of less than 2015-02-01.

An example using --older-than-date:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-02-01

Boundary options are exclusive and will not offload a partition with a key that equals the supplied value.

If using the --partition-names option to specify the partition to use as an offload boundary, the --offload-predicate-type option must also be specified with a value of LIST_AS_RANGE, otherwise Offload will treat the table as a basic list-partitioned table rather than use range-partitioned table behavior.

Subpartition-Based Offload¶

Tables that are range-subpartitioned on DATE, TIMESTAMP, NUMBER or [N]VARCHAR2 columns can be fully or partially offloaded. With Subpartition-Based Offload, partial offloading is supported even if the top-level partitioning scheme of a table is unsupported, because the subpartition high values are used to determine the offload boundary. To offload a range-subpartitioned table by subpartition, the --offload-by-subpartition option must be used.

Subpartition-Based Offload is useful for tables that are organized with ranged data (such as a time series or business date) at subpartition level, rather than partition level. In such cases, the subpartition boundaries can be used to determine the data to be offloaded, even when the partition-level data is also ranged.

When offloading a subpartitioned table with Partition-Based Offload, all subpartitions within the identified range or list partition(s) are offloaded as standard.

Example 12: Offload a Set of Range Subpartitions Using a Date Range Boundary¶

The following example demonstrates offloading all subpartitions below a date threshold of 2015-07-01:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-07-01 --offload-by-subpartition

Subpartition Boundary Options¶

The Offload options for specifying subpartition boundaries are the same as for range partition offloading. See Partition Boundary Options for details.

Subpartition-Based Offload has an additional restriction, such that the boundary option value must guarantee that no subpartitions have potential for more data to be added. For example, suppose every partition in a table contains two range subpartitions with high values of 90 and 100. In this case, the boundary option of --less-than-value=100 identifies a common subpartition boundary across the entire partition set and this guarantees that all subpartitions below this threshold are “complete” and ready for offloading. However, suppose just one of the partitions has range subpartitions with high values of 90 and 110. In this case, --less-than-value=100 option would not be able to offload all subpartitions below this threshold because the boundary is not common (i.e. the subpartition with a high value of 110 cannot be guaranteed to be “complete”).

In cases where the specified boundary is not valid, Offload will terminate with a warning and recommend that a common boundary is identified and used. In many cases it is possible to project the boundary value forward until a value is found that guarantees the safe offloading and appending of “complete” subpartitions.

Boundary options are exclusive and will not offload subpartitions with a key that equals the supplied value.

Other Subpartition-Based Offload Behavior¶

Subpartition-Based Offload behavior is the same as Partition-Based Offload in almost all cases. Considerations such as MAXVALUE subpartitions, full offloading with --offload-type, moving the high water mark in the hybrid view, using multi-column subpartition keys and so on are the same for range subpartitions as they are for range partitions. See Offloading Range-Partitioned Tables for details. Note that any examples would need the --offload-by-subpartition option to apply the same principle to range-subpartitioned tables.

Predicate-Based Offload¶

Offload supports the offloading of subsets of data (or an entire table) by predicate. This feature applies to any table type (partitioned or non-partitioned) and is known as Predicate-Based Offload.

With this feature, three offloading patterns are possible:

Simple Predicate-Based Offload: Offload one or more non-overlapping subsets of data from a table
Late-Arriving Predicate-Based Offload: Offload one or more late-arriving subsets of data from a table or partition that has already been offloaded
Intra-Day Predicate-Based Offload: Offload a new partition in non-overlapping subsets as soon as each subset of data is loaded

Predicates are provided using a simple grammar as described in Predicate Grammar.

Simple Predicate-Based Offload¶

Example 13: Offload a Subset of Data from a Table Using a Predicate¶

The following example demonstrates a simple Predicate-Based Offload scenario by offloading all data for the ‘Electronics’ category.

$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x --offload-predicate='column(PROD_CATEGORY) = string("Electronics")'

All data matching this predicate will be offloaded and the predicate will be added to the offload boundary in the hybrid view. Predicates are additive and any predicate that has been offloaded will not be re-offloaded, even if specified in future offload commands. If there is a need to modify the hybrid view so that offloaded data matching one or more predicates is accessed from the RDBMS rather than the backend system, the --reset-hybrid-view option can be used with the next offload.

Backend tables can be optionally partitioned when using simple Predicate-Based Offload. See Managing Backend Partitioning for more details.

Late-Arriving Predicate-Based Offload¶

The late-arriving scenario caters for tables or (sub)partitions that sometimes have small subsets of data arriving after the original offload has taken place. This scenario benefits from the efficiencies of Full Offload or (Sub)Partition-Based Offload but with the flexibility of Predicate-Based Offload for “topping-up” the offloaded data. This pattern can be used with fully offloaded tables, range-(sub)partitioned tables or list-partitioned tables offloaded with range partition semantics.

When using Late-Arriving Predicate-Based Offload with tables previously offloaded with (Sub)Partition-Based Offload:
- The Predicate-Based Offload command must include the --offload-predicate-type option with a value of RANGE or LIST_AS_RANGE
- The late-arriving subset of data to be offloaded must not overlap with any data that has already been offloaded
- The late-arriving subset of data to be offloaded must not reference data in (sub)partitions that have not yet been offloaded
When using Late-Arriving Predicate-Based Offload with tables previously offloaded with Full Offload:
- The --offload-predicate-type option is not required
- The late-arriving subset of data to be offloaded must not overlap with any data that has already been offloaded

The following examples demonstrate Late-Arriving Predicate-Based Offload for a table previously offloaded with Full Offload and a set of partitions previously offloaded with Partition-Based Offload.

Example 14: Offload a Late-Arriving Subset of Data from a Previously-Offloaded Table¶

In the following example, the SH.CUSTOMERS table has been fully-offloaded using Full Offload. The Predicate-Based Offload command below is used to additionally offload a small set of data for new customers that have been inserted since the original offload took place.

$ $OFFLOAD_HOME/bin/offload -t SH.CUSTOMERS -x \
      --offload-predicate='column(CUST_CREATED_DATE) > datetime(2019-01-15 13:23:34)'

All data matching this predicate will be offloaded but the predicate itself will not be added to the hybrid view or its metadata.

Example 15: Offload a Late-Arriving Subset of Data from a Previously-Offloaded Range Partition¶

In the following example, the SH.SALES table has previously been offloaded using Partition-Based Offload up to and including data for 2019-01-15. The Predicate-Based Offload command below is used to additionally offload a small set of data for product 123 that has arrived late and been loaded into a partition that has already been offloaded.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2019-01-15) and column(PROD_ID) = numeric(123)' \
      --offload-predicate-type=RANGE

All data matching this predicate will be offloaded but the predicate itself will not be added to the offload boundary or metadata for the hybrid view (only the existing Partition-Based Offload boundary will be used).

Intra-Day Predicate-Based Offload¶

This offload pattern can be used to switch between Partition-Based Offload and Predicate-Based Offload as required. It can be useful as a means to offload the majority of a partitioned table by partition boundaries, but swap to offloading subsets of data for a new partition as soon as they arrive (rather than wait for the entire partition to be loaded before offloading). This can ensure that data is available in the target backend earlier than would otherwise be possible. This pattern can only be used with range-partitioned tables or list-partitioned tables offloaded with range partition semantics.

Example 16: Offload a Subset of Data for a New Range Partition (Intra-Day Offloading)¶

In the following example, the historical data for the SH.SALES table is offloaded by Partition-Based Offload (range partitioned by TIME_ID) at time T0 to begin the offload lifecycle for this table. For new data, rather than wait a full day for an entire partition of data to be ready, data is instead offloaded as soon as each product set is loaded into the new partition (3 separate loads at times T1, T2, T3). When all loads and offloads have completed, the table metadata is reset (at time T4) and the table is ready to repeat the same Predicate-Based Offload pattern for the next processing cycle.

Time	Description	Command
T0	Offload all history by range partition	$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-01
T1	Offload first set of products loaded at time T1	$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \ --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(123), numeric(234))' \ --offload-predicate-type=RANGE_AND_PREDICATE
T2	Offload second set of products loaded at time T2	$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \ --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(345), numeric(456))' \ --offload-predicate-type=RANGE_AND_PREDICATE
T3	Offload third set of products loaded at time T3	$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \ --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(567), numeric(678))' \ --offload-predicate-type=RANGE_AND_PREDICATE
T4	All products now loaded and offloaded, reset the metadata for the table	$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-02

The Predicate-Based Offload commands for T1-T3 require that the --offload-predicate option includes a predicate for the Partition-Based Offload partition key and that it matches the value of the new data (TIME_ID=2020-07-01 in this example). The --offload-predicate-type option must also be used (see option reference for valid values).

When the metadata is reset after the end of a sequence of Predicate-Based Offload commands (as per T4 above), either the same pattern of offloading can continue for the next partition or, if preferred, Partition-Based Offload can resume (i.e. wait for the next partition to be fully-loaded before offloading in one command as described in Partition-Based Offload).

Predicate Grammar¶

At present, the offload predicate DSL or grammar used for the --offload-predicate option covers a simple range of possibilities, based on a <column> <operator> <value> model.

Predicate Form¶

Predicates can take several forms:

<column> <operator> <value>
<column> [NOT] IN (<value>[, <value>, ...])
<column> IS [NOT] NULL

The DSL allows specification of any number of predicates in a range of forms, which can be logically combined and nested with AND/OR, for example:

<predicate>) AND|OR <predicate>
(<predicate> AND|OR <predicate>) AND|OR (<predicate>)

Parentheses are required to form logical predicate groups, as shown in the second example above. Whitespace is insignificant.

Column Form¶

Predicates always contain a single column written with the column keyword and a “function-call” syntax. Columns are not case-sensitive and can include aliases. The following examples are all valid:

column(CUST_ID)
column(SALES.CUST_ID)
column(cust_id)

Columns must exist in the table being offloaded and expressions/functions are not supported.

Operator Form¶

The basic SQL operators are supported and are written as in SQL. Member operators and null-value expressions are also supported. The full range is as follows:

=, !=, <, <=, >, >=
[NOT] IN
IS [NOT] NULL

Value Form¶

Values must be qualified with a data type classification and have a “function-call” syntax as follows:

string("This is a string")
numeric(123)
datetime(2020-07-01)

String values use the string keyword and double-quotes as shown above. Double-quote characters can be included in the string value by escaping with a backslash.

Numeric values use the numeric keyword and both signed integral and signed decimal values are supported with the same keyword.

Date/time values use the datetime keyword and can be specified at several levels of precision from dates to timestamps with nanosecond precision. The following are all valid examples of a datetime value:

datetime(2020-07-01)
datetime(2020-07-01 13:01:01)
datetime(2020-07-01 13:01:01.123)
datetime(2020-07-01 13:01:01.123456789)

Value lists for use in the [NOT] IN operator must be surrounded by parentheses and separated by commas. The values in a value list must be specified as normal, for example:

column(cust_id) IN (numeric(120), numeric(121), numeric(122))

The predicate DSL currently supports columns of the following Oracle Database data types:

Table 2: Predicate-Based Offload DSL Data Type Support¶

DSL Data Type Class	Supported Oracle Database Data Types
String	`VARCHAR2`, `CHAR`, `NVARCHAR2`, `NCHAR`
Numeric	`NUMBER`, `FLOAT`, `BINARY_FLOAT`, `BINARY_DOUBLE`
Datetime	`DATE`, `TIMESTAMP`

Offloading Joins¶

Offload enables joins between fully and/or partially-offloaded tables to be offloaded to the backend. This feature allows the processing of joins in a hybrid query to be pushed down from the RDBMS to the backend system where applicable. The join specification is provided using offload with the --offload-join option, based on a simple grammar as described in Offload Join Grammar.

Example 17: Offloading a Join¶

The following example offloads a three-table join to the backend system.

$ $OFFLOAD_HOME/bin/offload -t sh.salescountry_join -x \
      --offload-join "table(sales) alias(s)" \
      --offload-join "table(customers) alias(cu) inner-join(s) join-clauses(cust_id)" \
      --offload-join "table(countries) alias(co) inner-join(cu) join-clauses(country_id)"

In this example, the SH.SALES table is partially offloaded and the other tables are fully offloaded. The offload boundary for the SH.SALES table is used for the Offload Join to ensure that all offloaded data for this table is aligned to a common boundary.

A hybrid view named SALESCOUNTRY_JOIN will be created in the hybrid schema for this join. Hybrid views created by Offload Join operations are not intended to be queried directly, because they are designed for transparent query rewrite. This feature enables users to benefit from the performance of having the joins processed in the backend system, but without having to rewrite any SQL. However, Offload Join hybrid views can be queried directly if required (by replacing the corresponding joins in existing SQL statements with the hybrid view or writing new queries using the hybrid view), but in most cases it should not be necessary.

Offload Join Grammar¶

When offloading joins, each table in the join has its own --offload-join specification, as demonstrated in Example 17 above. The full grammar for this option is provided in Table 3 below.

Table 3: Offload Join Grammar¶

Directive and Syntax	Description
TABLE([<hybrid_schema>.]<table-name>)	The table name to be included in the join This directive is mandatory Can include the hybrid schema name if joining objects from different schemas
ALIAS(<table-alias>)	Alias to be used for the join Defaults to `<table-name>` if not specified
PROJECT(<*\|<column>[, ..]>)	List of columns selected from `<table-name>`. `*` is expanded to all columns. Columns in join clauses are added to the list Partition key columns used are added to the list
INNER-JOIN(<join-alias>)	Alias to join with. Required for all but the first table
JOIN-CLAUSES(<alias>.<column> = <alias>.<column>[, ..]) JOIN-CLAUSES(<column>)	Clause defining the relationship between `<table-name>` and `<join-alias>` Expanded to syntax above assuming `<column>` exists in both tables

Note

Expansion of * projections excludes the OFFLOAD_BUCKET_ID partition column that Offload automatically creates when offloading to Hadoop-based backends. For performance reasons (such as supporting parallel queries), it might be desirable to include OFFLOAD_BUCKET_ID in the final Hadoop table or view for an Offload Join. This can be achieved by explicitly including it in the PROJECT() clause. For example PROJECT(*,offload_bucket_id) will include all table columns plus the OFFLOAD_BUCKET_ID column. This is not required for offloads to Google BigQuery, Snowflake or Azure Synapse Analytics.

Materialized Offload Joins¶

By default, results of an offloaded join are materialized in a backend table of the name provided by the -t option (SH.SALESCOUNTRY_JOIN in Example 17 above). Results do not have to be materialized: if using the --no-materialize-join option, a view for the join will be created in the backend instead of a populated table.

Offload Join and Partially-Offloaded Tables¶

Tables offloaded with Partition-Based Offload, Subpartition-Based Offload or Predicate-Based Offload are all supported by Offload Join. When an Offload Join operation includes a partially-offloaded table, the boundary predicates are included in the hybrid view and, if the join is not materialized to a table, the boundary predicates will also be included in the backend join view that is created in place of a materialized resultset.

Joins can include multiple partially-offloaded tables, but all tables must have the same incremental key type (i.e. be offloaded in the same way) and must have been offloaded to the same partition or data threshold. This means:

Partition-Based Offload tables must have the same partitioning type, partitioning key and offloaded high value. Join column names do not have to match as the Offload Join specification allows for different column names
Subpartition-Based Offload tables must have the same subpartitioning key and offloaded high value. As with Partition-Based Offload, join column names do not need to match
Predicate-Based Offload tables must have the same predicate type and same set of predicates. The predicates for each table must match exactly

Refreshing Offload Joins¶

Offload Join objects are not automatically maintained and under some scenarios will become stale when a member table is further offloaded. The guidelines are as follows:

If the Offload Join is materialized, the join needs to be refreshed
If the Offload Join is not materialized but contains one or more partially-offloaded tables, the join needs to be refreshed
If the Offload Join is not materialized and does not contain any partially-offloaded tables, the join does not need to be refreshed

Refreshing Materialized Offload Joins¶

Materialized Offload Joins always need to be refreshed when the offloaded data for a member table has changed. The method for doing this depends on which of the member tables in an Offload Join has been subject to further offloads:

If a fully-offloaded table is re-offloaded, the dependent materialized Offload Joins require a full refresh. To do this, the original Offload Join offload command can be repeated with the --reset-backend-table option. Note that this will fully reset and re-populate the materialized Offload Join table with the new results of the join
If a partially-offloaded table has a further incremental offload of one or more partitions, subpartitions or predicates, the dependent materialized Offload Joins can be incrementally refreshed. To do this, simply repeat the original Offload Join offload command and Offload will append new data to the materialized join table

Refreshing Non-Materialized Offload Joins¶

Non-materialized Offload Joins only need to be refreshed when the offload boundaries for the partially-offloaded member tables have changed. To refresh the join, the original Offload Join offload command can be repeated but must include the --force or -f option. This ensures that the backend view is re-created with the new boundary predicates of the partially-offloaded tables in the join.

Updating Offloaded Data¶

Gluent Data Platform supports the updating of offloaded data in several ways with Incremental Update. See the Incremental Update guide for details.

Offload Type Considerations¶

As described in the Concepts guide, there are three formats for a hybrid view (see Hybrid View Types) that determine where hybrid queries access their data:

100-0: All offloaded data is accessed from the backend table and none is accessed from the source RDBMS table
90-10: Partially offloaded data is accessed from the backend table and non-offloaded data is accessed from the source RDBMS table
100-10: All data is offloaded but only a some of the offloaded data is accessed from the backend table and some is accessed from the source RDBMS (based on a given boundary)

In most cases, the hybrid view type is created as default for a given offload operation, but the --offload-type option can be used to provide greater flexibility over the initial and subsequent offloads of list-partitioned and range-partitioned tables. See Offload Type for a general description of offload type.

Default Offload Type Behavior¶

The following describes the default behavior for offload type when offloading a range-partitioned or list-partitioned table.

If no --offload-type is specified:
- Use of any of the partition boundary options offloads the table as INCREMENTAL with a default 90-10 hybrid view
- An offload without any of the partition boundary options offloads the table as FULL offload type with a default 100-0 hybrid view
If the --offload-type option is specified:
- Specifying --offload-type=INCREMENTAL with or without any of the partition boundary options offloads the table to match the boundary partition and defines the table as an INCREMENTAL offload type with a supporting 90-10 hybrid view
- Specifying --offload-type=FULL without any of the partition boundary options offloads all data from the table and defines the table as a FULL offload type with a supporting 100-0 hybrid view
- Specifying --offload-type=FULL with any of the partition boundary options offloads all data from the table and defines the table as a FULL offload type with a specialized 100-10 hybrid view
If a table has a DEFAULT or MAXVALUE partition:
- A FULL offload will offload the DEFAULT/MAXVALUE partition
- An INCREMENTAL offload will not offload any DEFAULT/MAXVALUE partition
For subsequent offloads:
- The offload type of a table is automatically picked up on all future offloads, unless overridden
- The offload type of a table can be changed on future offloads by specifying a new --offload-type value (see Changing Offload Type below)

Note

Predicate-Based Offload does not support the 100-10 offload type pattern unless it is Late-Arriving Predicate-Based Offload.

Changing Offload Type¶

It is possible to change the offload type of a table on one or more subsequent offloads of the table (subject to some limitations). As described above, the offload type of a table is set on initial offload and is maintained for all future offloads (unless it is overridden by a new --offload-type value on a later offload).

The rules and behavior for changing offload type from INCREMENTAL to FULL are as follows:

Any partitions not yet offloaded are offloaded (including any DEFAULT/MAXVALUE if applicable), the table is re-defined as a FULL offload type and the hybrid view is re-created accordingly

The rules and behavior for changing offload type from FULL to INCREMENTAL are as follows:

If a partition boundary option is used, any corresponding partitions are offloaded, the table is re-defined as an INCREMENTAL offload type and the hybrid view is re-created accordingly

If the table has a DEFAULT or MAXVALUE partition, it cannot be converted from FULL to INCREMENTAL

If the table has Incremental Update enabled, it cannot be converted from FULL to INCREMENTAL

If the table was offloaded with Subpartition-Based Offload and was offloaded as FULL, it cannot be converted from FULL to INCREMENTAL

Offload Data Types ¶

When offloading data, Offload will map data types from the source RDBMS to the backend database automatically. However, as with most operations in Offload, it is possible to override many of the default data type mappings if required. See Tables 4 and 6 below for details.

Supported Data Types¶

Gluent Data Platform currently supports all of the Oracle Database data types listed in Table 4 below. Oracle Database tables that contain columns of any data type not listed cannot currently be offloaded.

Default Data Type Mappings¶

Table 4 lists the default data type mappings when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform), Google BigQuery, Snowflake or Azure Synapse Analytics.

Table 4: Offload Default Data Type Mappings (Oracle Database)¶

Oracle Database	Impala	Google BigQuery	Snowflake	Azure Synapse Analytics	Comments
`CHAR`	`STRING`	`STRING`	`VARCHAR`	`char`	See Offloading String Data to Azure Synapse Analytics
`NCHAR`	`STRING`	`STRING`	`VARCHAR`	`nchar`
`CLOB`	`STRING`	`STRING`	`VARCHAR`	`varchar(max)`	See Offloading LOB Data Types to Snowflake and Azure Synapse Analytics and Offloading String Data to Azure Synapse Analytics
`NCLOB`	`STRING`	`STRING`	`VARCHAR`	`nvarchar(max)`	See Offloading LOB Data Types to Snowflake and Azure Synapse Analytics
`VARCHAR2`	`STRING`	`STRING`	`VARCHAR`	`varchar`	See Offloading String Data to Azure Synapse Analytics
`NVARCHAR2`	`STRING`	`STRING`	`VARCHAR`	`nvarchar`
`RAW`	`STRING`	`BYTES`	`BINARY`	`varbinary`
`BLOB`	`STRING`	`BYTES`	`BINARY`	`varbinary(max)`	See Offloading LOB Data Types to Snowflake and Azure Synapse Analytics
`NUMBER(<=4,0)`	`BIGINT`	`INT64`	`NUMBER(p,0)`	`smallint`	See Offloading Integral Data Types to Impala
`NUMBER([5-9],0)`	`BIGINT`	`INT64`	`NUMBER(p,0)`	`int`	See Offloading Integral Data Types to Impala
`NUMBER([10-18],0)`	`BIGINT`	`INT64`	`NUMBER(p,0)`	`bigint`	See Offloading Integral Data Types to Impala
`NUMBER(>18,0)`	`DECIMAL(38,0)`	`NUMERIC` `BIGNUMERIC`	`NUMBER(38,0)`	`numeric(38,0)`	See Offloading Numeric Data to Google BigQuery
`NUMBER(,)`	`DECIMAL(38,s)`	`NUMERIC` `BIGNUMERIC`	`NUMBER(p,s)`	`numeric(p,s)`	See Offloading Numeric Data to Google BigQuery
`FLOAT`	`DECIMAL`	`NUMERIC`	`NUMBER(p,s)`	`numeric(38,18)`
`BINARY_FLOAT`	`FLOAT`	-	-	`real`	See Floating Point Data Types in Google BigQuery and Snowflake and Floating Point Data Types in Azure Synapse Analytics
`BINARY_DOUBLE`	`DOUBLE`	`FLOAT64`	`FLOAT`	`float`	See Floating Point Data Types in Azure Synapse Analytics
`DATE`	`TIMESTAMP`	`DATETIME`	`TIMESTAMP_NTZ`	`datetime2`
`TIMESTAMP`	`TIMESTAMP`	`DATETIME`	`TIMESTAMP_NTZ`	`datetime2`	See Offloading High-Precision Timestamp Data to Google BigQuery and Azure Synapse Analytics
`TIMESTAMP WITH TIME ZONE`	`TIMESTAMP`	`TIMESTAMP`	`TIMESTAMP_TZ`	`datetimeoffset`	See Offloading High-Precision Timestamp Data to Google BigQuery and Azure Synapse Analytics and Offloading Time Zoned Data
`INTERVAL DAY TO SECOND`	`STRING`	`STRING`	`VARCHAR`	`varchar`	See Offloading Interval Data Types
`INTERVAL YEAR TO MONTH`	`STRING`	`STRING`	`VARCHAR`	`varchar`	See Offloading Interval Data Types

Data Sampling During Offload¶

Gluent Offload Engine fine-tunes the default data mappings by sampling data for all columns in the source RDBMS table that are either date or timestamp-based or defined as a number without a precision and scale.

By sampling numeric columns, Offload can choose an appropriate precision and scale for Impala DECIMAL data and this can lead to performance improvements due to reduced CPU consumption when querying. This behavior can be disabled by adding the --no-auto-detect-numbers option to the offload command.

The Impala TIMESTAMP data type supports a minimum date of 1400-01-01. Offload samples source date and timestamp-based columns to ensure that the data to be offloaded satisfies the Impala specification. In cases where Offload discovers data below the minimum specification, the column is offloaded to an Impala STRING column. This behavior can be disabled with the --no-auto-detect-dates option, but in this case the source data needs to be corrected before it can be offloaded.

Offload bases the volume of data to be sampled on the size of the RDBMS table. This can be overridden by adding the --data-sample-percent option to the offload command (specifying a percentage between 0 and 100, where 0 disables sampling altogether). Without sampling, all Impala decimal columns are defaulted to DECIMAL(38,18) and no date or timestamp checking is performed.

Offload determines the degree of parallelism to use when sampling data from the value of DATA_SAMPLE_PARALLELISM. This can be overridden by adding the --data-sample-parallelism option to the offload command (specifying a degree of 0 or a positive integer, where 0 disables parallelism)

Offloading Not Null Columns to Google BigQuery, Snowflake and Azure Synapse Analytics¶

Google BigQuery, Snowflake and Azure Synapse Analytics allow columns to be defined as mandatory (i.e. NOT NULL). When offloading a table to one of these systems for the first time, by default Offload will copy mandatory column definitions from the RDBMS to the backend. This behavior is defined globally by the OFFLOAD_NOT_NULL_PROPAGATION configuration option, which defaults to AUTO (i.e. propagate all NOT NULL constraints from the RDBMS to the backend). To avoid propagating any NOT NULL constraints, this option can be set to NONE.

Offload only considers columns to be mandatory if they are defined by Oracle Database as not nullable (DBA_TAB_COLUMNS.NULLABLE = 'NO'). The following NOT NULL Oracle Database columns are not automatically propagated by Offload:

Columns with user-defined check constraints rather than a NOT NULL definition (e.g. CHECK (column_name IS NOT NULL))

Columns defined with NOT NULL ENABLE NOVALIDATE

To override the global OFFLOAD_NOT_NULL_PROPAGATION configuration value, or to include columns with mandatory constraints that are not automatically propagated, the --not-null-columns option can be added to an Offload command. This option accepts a list of one or more valid columns, one or more wildcards, or a combination of the two. Offload will define all specified columns as NOT NULL when creating the backend table, regardless of their status in the RDBMS.

Care should be taken when propagating constraints for columns that cannot be guaranteed to contain no NULLs. During validation of the staging data, Offload will specifically check for NULLs for any column that is defined as mandatory in the backend.

Offloading Integral Data Types to Impala¶

When offloading a table to Impala, integer columns(i.e. NUMBER(p,0)) with a precision in the range of an Impala BIGINT data type are automatically stored as BIGINT. No attempt is made by Offload to downscale the Impala integral type in case the data precision grows beyond its type specification in the future. Offload safely defaults to BIGINT because schema evolution of smaller integrals to BIGINT cannot be actioned by a simple ALTER TABLE statement in Impala (meaning a more complex data migration would be required to upscale the integral type if needed). However, it is possible to override this default behavior and users confident in a column’s future contents (for example, day of month or year) can override the mapping and safely utilize smaller integral types in Impala. See Override Data Type Mappings for details.

Source integer columns with a precision beyond the range of an Impala BIGINT are automatically stored as DECIMAL(38,0).

Offloading Numeric Data to Google BigQuery¶

Google BigQuery provides two decimal data types: NUMERIC and BIGNUMERIC. The NUMERIC data type has a specification of (38,9) with a fixed decimal point, meaning a maximum of 29 digits to the left of the decimal point and a maximum of 9 digits to the right. The BIGNUMERIC data type has a specification of (76,38) with a fixed decimal point, meaning a maximum of 38 digits to the left of the decimal point and a maximum of 38 digits to the right (more precisely, the specification of BIGNUMERIC is (76.76,38), allowing for some numbers with 39-digits to the left of the decimal point).

When offloading numeric data such as decimals or large integrals to Google BigQuery, Offload determines which BigQuery type is most appropriate, based on either the known precision and scale of Oracle columns of type NUMBER(p,s) or from sampled data for Oracle columns of unbounded type NUMBER. Data that offloads to NUMERIC by default can be offloaded to BIGNUMERIC with the --decimal-columns and --decimal-columns-type override options (see Table 6: Offload Override Data Type Mappings (Oracle Database) below).

For numeric data that exceeds the specifications of NUMERIC and BIGNUMERIC, Offload offers two options:

Use decimal rounding options during offload (see Decimal Scale Rounding below)
Offload numeric data to a floating point type (see Converting Numeric Data to Double below)

Important

Both of these options result in some data change and it will be for users to decide if this is a valid course of action.

Offloading String Data to Azure Synapse Analytics¶

At the time of writing, the Azure Synapse Analytics char and varchar data types do not support UTF-8 or Unicode. Depending on the source Oracle Database characterset and the string data to be offloaded, it is possible that not all CHAR, VARCHAR2 or CLOB data can be offloaded to the equivalent Synapse data types without data loss. In such cases, CHAR, VARCHAR2 and CLOB columns will need to be offloaded to nchar or nvarchar as applicable, using the --unicode-string-columns override option.

When offloading string data to Azure Synapse Analytics, it is also important to consider database collations, to ensure that the behavior of the Oracle Database collation (determined by NLS_COMP and NLS_SORT) matches the behavior of the default collation used by the Synapse dedicated SQL pool. The SYNAPSE_COLLATION configuration variable can be used to ensure that all string columns are offloaded and queried by Gluent Data Platform with a collation that matches that of the source RDBMS. By default, this is set to Latin1_General_100_BIN2, which matches the behavior of an Oracle Database NLS_COMP and NLS_SORT setting of BINARY.

Floating Point Data Types in Google BigQuery and Snowflake¶

Google BigQuery and Snowflake both provide a single 64-bit floating point data type (FLOAT64 and FLOAT, respectively). This means that the Oracle Database 32-bit BINARY_FLOAT data type does not have a corresponding type in BigQuery or Snowflake and cannot currently be offloaded. Support for an override mapping to the 64-bit floating point data type will be added in a future release of Gluent Data Platform (although this will be potentially lossy - see Lossy Data Operations below).

Floating Point Data Types in Azure Synapse Analytics¶

Azure Synapse Analytics’ real and float floating point data types do not support NaN or Infinity special values. When offloading BINARY_FLOAT or BINARY_DOUBLE, special values must be converted to NULL to allow the data to load into Synapse. For this reason, Offload will not allow BINARY_FLOAT or BINARY_DOUBLE columns to be offloaded unless an override option is provided (see Converting Floating Point Special Values below).

Important

Allowing floating point columns to be offloaded to Azure Synapse Analytics is potentially lossy (see Lossy Data Operations below for the implications of this).

Offloading High-Precision Timestamp Data to Google BigQuery and Azure Synapse Analytics¶

At the time of writing, Google BigQuery’s DATETIME and TIMESTAMP data types both support microsecond precision (i.e. a precision of 6) and Azure Synapse Analytics datetime2 and datetimeoffset data types support a maximum precision of 7. Oracle Database supports up to nanosecond precision (i.e. a precision of 9) with its TIMESTAMP and TIMESTAMP WITH TIME ZONE data types. When Offload detects a source column with a higher level of precision than that supported by the backend (e.g. when attempting to offload TIMESTAMP(7) to BigQuery or TIMESTAMP(8) to Synapse), it will terminate. In such cases, the --allow-nanosecond-timestamp-columns option will allow the offload operation to continue, but should ideally be used only when the data to be offloaded is known to be at microsecond precision or lower.

Important

Allowing high-precision timestamp columns to be offloaded to Google BigQuery or Azure Synapse Analytics is potentially lossy (see Lossy Data Operations below for the implications of this).

Offloading LOB Data Types to Snowflake and Azure Synapse Analytics¶

Snowflake restricts the size of string data to a maximum of 16MB and binary data to a maximum of 8MB. Oracle LOB data that exceeds the limits of the corresponding Snowflake data type cannot currently be offloaded.

Azure Synapse Analytics tables support LOBs up to 2GB in size with its varchar(max) and varbinary(max) specification. However, the Microsoft PolyBase engine that is invoked when loading external staging data into the target Synapse table has a 1,000,000 byte limit per row. Given that PolyBase treats each character as 2 bytes, this means that offloaded rows with LOBs are currently restricted to under 500KB in size.

Offloading Time Zoned Data¶

For Oracle Database columns of TIMESTAMP WITH TIME ZONE data type, data is normalized to Coordinated Universal Time (UTC) during offload to Hadoop-based backends and to Google BigQuery. When offloading to Snowflake or Azure Synapse Analytics, data is stored with the time zone offset of the source Oracle data. Offload converts named time zones, e.g. 2016-01-14 22:39:44 US/Pacific, to time zone offsets, e.g. 2016-01-14 10:39:44 -8:00 to ensure backend systems can process the values regardless of their time zone database edition. When returning offloaded data for hybrid queries, Smart Connector normalizes all data to UTC, regardless of the backend. For this reason, values may appear different to the eye although they are temporally the same as those in the source RDBMS data.

Offloading Interval Data Types¶

Oracle Database INTERVAL DAY TO SECOND and INTERVAL YEAR TO MONTH data types are offloaded as strings in a normalized format, as shown in Table 5 below.

Table 5: Interval Data Type String Formats¶

Interval Type	Sample Source Data	Impala String Format	Google BigQuery/Snowflake/Azure Synapse Analytics String Format
Day to second (+ve)	+000000001 02:15:30.123456000	1 02:15:30.123456	+1 02:15:30.123456
Day to second (-ve)	-000000001 02:15:30.123456000	-1 02:15:30.123456	-1 02:15:30.123456
Year to month (+ve)	+000000001-01	1-1	+1-1
Year to month (-ve)	-000000001-01	-1-1	-1-1

Smart Connector converts offloaded interval strings back to the correct interval type when executing hybrid queries from the RDBMS. Interval strings can also be presented back to Oracle Database interval types by using the appropriate Present option as required (see the Present guide for further details).

Override Data Type Mappings¶

When initially offloading a table, it is possible to override some of the default data type mappings to change the specification of the backend table. Table 6 lists the override mappings and associated options that are available when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform), Google BigQuery, Snowflake or Azure Synapse Analytics.

Table 6: Offload Override Data Type Mappings (Oracle Database)¶

Oracle Database	Offload Option	Impala	BigQuery	Snowflake	Synapse	Comments
`CHAR`	`--unicode-string-columns`	-	-	-	`nchar`	See Offloading String Data to Azure Synapse Analytics
`VARCHAR2`	`--unicode-string-columns`	-	-	-	`nvarchar`	See Offloading String Data to Azure Synapse Analytics
`CLOB`	`--unicode-string-columns`	-	-	-	`nvarchar(max)`	See Offloading String Data to Azure Synapse Analytics
`DATE`	`--date-columns`	`DATE`	`DATE`	`DATE`	`date`	Impala versions below 3.3 will use `TIMESTAMP`
	`--timestamp-tz-columns`	`TIMESTAMP`	`TIMESTAMP`	`TIMESTAMP_TZ`	`datetimeoffset`
	`--variable-string-columns`	`STRING`	`STRING`	`VARCHAR`	`varchar`
`TIMESTAMP`	`--date-columns`	`DATE`	`DATE`	`DATE`	`date`	Impala versions below 3.3 will use `TIMESTAMP`
	`--timestamp-tz-columns`	`TIMESTAMP`	`TIMESTAMP`	`TIMESTAMP_TZ`	`datetimeoffset`
	`--variable-string-columns`	`STRING`	`STRING`	`VARCHAR`	`varchar`
`NUMBER`	`--integer-1-columns`	`TINYINT`	`INT64`	`NUMBER(3,0)`	`smallint`	Use for `NUMBER([1-2],0)`
	`--integer-2-columns`	`SMALLINT`	`INT64`	`NUMBER(5,0)`	`smallint`	Use for `NUMBER([3-4],0)`
	`--integer-4-columns`	`INT`	`INT64`	`NUMBER(10,0)`	`int`	Use for `NUMBER([5-9],0)`
	`--integer-8-columns`	`BIGINT`	`INT64`	`NUMBER(19,0)`	`bigint`	Use for `NUMBER([10-18],0)`
	`--integer-38-columns`	`DECIMAL(38,0)`	`INT64`	`NUMBER(38,0)`	`numeric(38,0)`	Use for `NUMBER(>18,0)`
	`--decimal-columns`	`DECIMAL(p,s)`	`NUMERIC` `BIGNUMERIC`	`NUMBER(p,s)`	`numeric(p,s)`	Use for `NUMBER(p,s)`
	`--double-columns`	`DOUBLE`	`FLOAT64`	`FLOAT`	`float`	See Converting Numeric Data to Double
`FLOAT`	`--double-columns`	`DOUBLE`	`FLOAT64`	`FLOAT`	`float`	See Converting Numeric Data to Double
`BINARY_FLOAT`	`--double-columns`	`DOUBLE`	-	-	`float`	See Converting Numeric Data to Double

Example 18: Overriding Data Types During an Offload¶

In the following example, the SH.SALES table is offloaded with several data type overrides.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --date-columns=TIME_ID \
      --integer-4-columns=PROD_ID,CHANNEL_ID \
      --decimal-columns-type=10,2 --decimal-columns=AMOUNT_SOLD

Each override can be given one or more columns. The --decimal-columns override can be used to fine-tune the precision and scale of the Impala DECIMAL data type (which can lead to better space utilization and CPU reduction) or the precision and scale of the Snowflake NUMBER data type. It can also be used to choose a BigQuery BIGNUMERIC over NUMERIC (or vice-versa if required). This override option behaves differently to the others in that it must be paired with a --decimal-columns-type option to define the precision and scale for the columns listed. Multiple different names/type pairings can be provided in the same command (see --decimal-columns for further details).

Data type override options support use of the wildcard character *. This is useful when a logical model has column naming rules indicating the data they contain. The example above could use --decimal-columns=AMOUNT* to match multiple columns containing monetary values.

Oracle Database number precision is specified in terms of the number of digits. Impala number precision is specified by the data type. Care should be taken if selecting an override target data type. The DECIMAL data type can be used as a catch-all numeric data type in Impala but users are encouraged to consider native integral data types for reasons of performance and space utilization.

Managing Exceptional Data¶

No two systems align perfectly with their data type specifications and implementations. There are several cases where data in the source Oracle Database table might not be easily offloaded to the backend system. Some have already been described, but include:

Source RDBMS numbers with a precision greater than that supported by the backend (for example, the 38 digit limit of Impala DECIMAL, BigQuery BIGNUMERIC or Snowflake NUMBER)
Source RDBMS numbers with a scale greater than that supported by the backend (for example, Snowflake’s NUMBER scale limit of 37 digits)
Source RDBMS dates or timestamps with a value below Impala’s lower bound of 1400-01-01
Source RDBMS timestamps with a higher level of precision than BigQuery’s microsecond limit
Source RDBMS 32-bit floating point numbers offloading to a backend system without a 32-bit floating point data type (e.g. BigQuery)
Source RDBMS floating point numbers with special NaN or Infinity values that are not supported by the backend system (e.g. Synapse)
Source RDBMS large objects (LOBs) with data that exceeds the limit of the corresponding backend data type (e.g. Snowflake’s 16MB VARCHAR limit and 8MB BINARY limit)

When exceptional data is detected, Offload will in most cases terminate with an exception (and in some cases provide advice on how to proceed). The only exception to this is with offloading outlying dates and timestamps to Impala, which will be automatically offloaded to a lossless string data type (unless this behavior is disabled by the user as described earlier).

Note

Offload will never attempt to automatically offload exceptional data when it would cause a potential data loss to do so. In all cases, the user must provide an explicit course of action by providing the appropriate options with the offload command.

Decimal Scale Rounding¶

The Oracle Database NUMBER data type is extremely flexible and caters for high-scale decimal data that exceeds the limits available in some target backend systems (such as Snowflake). To offload decimal data that exceeds the specification of the backend system, the --allow-decimal-scale-rounding option must be used in conjunction with the --decimal-columns-type and --decimal-columns options to define the target type for the data. By using the rounding option, the user acknowledges that some data loss due to decimal rounding is likely and acceptable.

Converting Numeric Data to Double¶

Using the --double-columns override option to offload high-precision/scale numeric data or 32-bit floating point data to a 64-bit double can lead to data loss and should only be considered when there is no alternative and the consequences are both understood and acceptable.

Converting Floating Point Special Values¶

Not all backend systems support floating point special values such as NaN and Infinity. For such systems, Offload will not allow BINARY_FLOAT or BINARY_DOUBLE columns to be offloaded without the --allow-floating-point-conversions override option. Using this option will enable Offload to convert special floating point values to NULL. This option currently applies only to Azure Synapse Analytics.

Lossy Data Operations¶

To summarize, the potentially-lossy offload operations and their corresponding options are as follows:

Rounding decimals: Using the --allow-decimal-scale-rounding option to round decimals that exceed the backend specification for decimal data
Offloading sub-microsecond timestamps: Using the --allow-nanosecond-timestamp-columns option to offload columns with a sub-microsecond specification to a backend system that doesn’t support the same level of precision
Converting data to a 64-bit floating point number: Using the --double-columns option to offload decimal or 32-bit floating-point data that would not otherwise be possible
Offloading special floating point values (only required for Azure Synapse Analytics): Using the --allow-floating-point-conversions option will allow Offload to convert NaN and Infinity values to NULL

Important

Enabling lossy offloads can cause wrong results with some hybrid queries when data loss has occurred; such as when Smart Connector pushes down an equality or inequality predicate containing values from the lossy column(s). It is not advised to enable lossy data type overrides for columns that are likely to be used in hybrid query predicates or joins unless the offload was known to be lossless.

Offload Transport ¶

Offload Transport describes the part of the offload process that:

Copies the source RDBMS data to a staging location
Validates the staged data
Loads the staged data into the backend target table

The two-phased approach of staging and loading data provides two main benefits:

Atomicity: The load from staged data to target table either fully succeeds or completely fails
Partitioning: For backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table

The transport phase of an offload is split into four main operations:

Transport Data to Staging
Validate Staged Data
Validate Type Conversions
Load Staged Data

Transport Data to Staging¶

Gluent Offload Engine uses one of a number of tools to extract data from the source RDBMS. The appropriate tool will be chosen automatically based on configuration preferences. Each tool reads the source data to offload and attempts to split the source data equally and in a non-overlapping way between concurrent reader processes.

Data is staged in either Avro or Parquet format (depending on the backend platform) and native data types are used where possible. For Avro staging files, STRING is used where there is no identical match between Avro and the source RDBMS table. For example, Oracle Database DATE, TIMESTAMP and NUMBER data types have no direct equivalent in Avro; therefore data is staged as STRING and converted to the appropriate backend data type during the final loading phase.

The data extraction tools available to Offload are:

Apache Sqoop
Apache Spark
Query Import

Apache Sqoop¶

Apache Sqoop is only considered for use when offloading to Hadoop-based backends. Offload will invoke Sqoop on the host defined by OFFLOAD_TRANSPORT_CMD_HOST. Depending on the source RDBMS table, Sqoop will be used in either its standard table-centric mode or using a free-form query defined by Offload. The number of Sqoop mappers used when extracting data is defined by OFFLOAD_TRANSPORT_PARALLELISM or per offload with the --offload-transport-parallelism option. The SQOOP_QUEUE_NAME parameter can be defined to submit the Sqoop job a specific YARN queue.

Default configuration is appropriate for the majority of offloads but occasionally the nature of the source table requires tuning of the following:

--offload-transport-fetch-size can be used on a per-offload basis to override OFFLOAD_TRANSPORT_FETCH_SIZE. This can be useful to manage memory requirements for tables with a large row size (such as when offloading LOBs), by reducing the fetch size
SQOOP_OVERRIDES or the per-offload option --offload-transport-jvm-overrides can be used to inject JVM parameters into the Sqoop command line. This can be useful to override JVM memory defaults, for example, --offload-transport-jvm-overrides="-Dmapreduce.map.java.opts=-Xmx2048m -Dmapreduce.map.memory.mb=2048"

Apache Spark¶

Three interfaces to Spark are supported by Offload: Spark Submit, Spark Thrift Server and Apache Livy. The interface used by Offload is chosen automatically, based on configuration. If multiple interfaces are configured for use, the order of priority is Spark Thrift Server, Apache Livy and Spark Submit. SparkSQL is used in all cases (via a PySpark script for Apache Livy and Spark Submit, or as pure SQL for Spark Thrift Server) to extract the data to be offloaded from the source RDBMS table.

When offloading to Hadoop-based backends it is expected that the Spark cluster and interfaces are configured components of the target Hadoop cluster. When offloading to cloud warehouses such as Google BigQuery, Snowflake or Azure Synapse Analytics, it is typical to use Spark Standalone (Gluent Data Platform includes a Transport package containing Spark Standalone components for this purpose), although an existing Spark cluster can be utilized if available.

Spark Submit is available for use by Offload if OFFLOAD_TRANSPORT_CMD_HOST and OFFLOAD_TRANSPORT_SPARK_SUBMIT_EXECUTABLE are defined. When using a Spark Standalone cluster jobs will be submitted to the cluster defined in OFFLOAD_TRANSPORT_SPARK_SUBMIT_MASTER_URL. In a Hadoop environment, YARN is used as the Spark master. OFFLOAD_TRANSPORT_SPARK_QUEUE_NAME or --offload-transport-queue-name can be used to submit the Spark job a specific YARN queue.

Spark Thrift Server is available for use by Offload if OFFLOAD_TRANSPORT_SPARK_THRIFT_HOST and OFFLOAD_TRANSPORT_SPARK_THRIFT_PORT are defined. The Spark Thrift server can be configured to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible.

Apache Livy is available for use by Offload when offloading to Hadoop-based backends if OFFLOAD_TRANSPORT_LIVY_API_URL is defined. Livy can be configured (using OFFLOAD_TRANSPORT_LIVY_IDLE_SESSION_TIMEOUT) to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible. The maximum number of Livy sessions that Offload will start and use concurrently is defined by OFFLOAD_TRANSPORT_LIVY_MAX_SESSIONS.

The number of tasks in an offload transport Spark job is defined by OFFLOAD_TRANSPORT_PARALLELISM or per offload with the --offload-transport-parallelism option. It should be noted that defining more tasks than there are available Spark executors will result in queuing. Therefore, OFFLOAD_TRANSPORT_PARALLELISM should ideally be no more than the number of available executors.

Default configuration is appropriate for the majority of offloads but occasionally the nature of the source RDBMS table requires tuning of the following:

--offload-transport-fetch-size can be used on a per offload basis to override OFFLOAD_TRANSPORT_FETCH_SIZE. This can be useful to manage memory requirements for tables with a large row size (such as when offloading Oracle LOB data), by reducing the fetch size
OFFLOAD_TRANSPORT_SPARK_OVERRIDES or the per offload option --offload-transport-jvm-overrides can be used to inject JVM parameters into the Spark Submit command line. This has no effect for Spark Thrift Server or Apache Livy because their configuration is managed independently
OFFLOAD_TRANSPORT_SPARK_PROPERTIES or the per offload option --offload-transport-spark-properties can be used to modify Spark attributes. Note that because Spark Thrift Server and Apache Livy configuration is managed independently, some attributes do not have any effect

Query Import¶

Query Import is used for for low-volume offloads. Data is extracted and staged by Offload itself and not using an external tool. This avoids delays incurred when invoking Sqoop or Spark Submit. Non-partitioned tables with a source RDBMS size smaller than --offload-transport-small-table-threshold are eligible for Query Import. Query Import does not run in parallel, i.e. OFFLOAD_TRANSPORT_PARALLELISM is ignored.

RDBMS Options¶

Connections to the source RDBMS are made to the address defined in OFFLOAD_TRANSPORT_DSN. This defaults to ORA_CONN but can be overridden if data needs to be extracted from an alternative address (such as from a replica database).

Concurrent offload transport processes open independent sessions in the source RDBMS. If a high value for OFFLOAD_TRANSPORT_PARALLELISM is required then consideration should be given to any session limits in the RDBMS. The OFFLOAD_TRANSPORT_CONSISTENT_READ parameter or the per offload option --offload-transport-consistent-read can be used to ensure that all concurrent extraction queries reference a specific point in time. When set to true, extraction queries will include an AS OF SCN clause. When the source RDBMS table or input partitions are known to be cold (i.e. not subject to any data modifications), this can be set to false to reduce resource consumption.

If there is a requirement to modify offload transport RDBMS session settings, such as setting an Oracle Database initialization parameter (under the guidance of Gluent Support), then modification of OFFLOAD_TRANSPORT_RDBMS_SESSION_PARAMETERS will be required.

Validate Staged Data¶

Once data has been staged, it is validated to ensure that the number of staged rows matches the number of rows read from the source RDBMS. While doing this Offload might also:

Check for NULL values in any custom partition scheme defined using the --partition-columns option. A positive match results in a warning
Check for NULL values in any column defined as mandatory (i.e. NOT NULL)
Check for any NaN (Not a Number) special values if the source table has floating point numeric data types. This is because source RDBMS and target backend systems do not necessarily treat NaN values consistently. A positive match results in a warning
Check for lossy decimal rounding as described in Decimal Scale Rounding
Check for source data that is invalid for the chosen backend partition scheme (if applicable), such as numeric data outside of any --partition-lower-value/--partition-upper-value range. A positive match results in a warning

Validate Type Conversions¶

Data types used for staging data will rarely match those of the backend target table. Data is converted to the correct type when it is loaded into the final target table. This stage therefore verifies that there will be no invalid conversions when loading. While this is a duplication of type conversions in the Load Staged Data phase, it provides the advantage of checking the data before the more compute-intensive data load and is able to report all columns with data issues in a single pass.

Example 19: Catching Invalid Data Type Conversions¶

In the following example, the SH.SALES table is offloaded to a Hadoop cluster with an invalid data type for two columns: the data in the PROD_ID and CUST_ID columns is not compatible with the user requested single-byte integer data type.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --integer-1-columns=PROD_ID,CUST_ID

This results in the following exception:

CAST() of load data will cause data loss due to lack of precision in target data type in 6887232 rows
Failing casts are:
    (`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
    (`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
The SQL below will assist identification of problem data:
SELECT PROD_ID
,      CUST_ID
FROM   `sh_load`.`sales`
WHERE  (`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
OR     (`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
LIMIT 50

The exception provides three important pieces of information:

The number of rows with issues
The columns/conversions with issues
A SQL statement to use offline to review a sample of the problematic data

Load Staged Data¶

In this phase of an offload staged data is converted to correct data types as described in Validate Type Conversions and inserted into the target backend table. Where available, performance metrics from the backend system are recorded in the Offload log file written to $OFFLOAD_HOME/log.

Offload Transport Chunks¶

The source RDBMS data is offloaded in “chunks”. A chunk can comprise an entire table or a set of one or more partitions. The sequence of four operations described above are executed for each chunk until the whole input set has been offloaded. A final task in this phase, for backend systems that support it, is to update statistics on the freshly offloaded data.

For partitioned RDBMS tables, data is offloaded in sets of partitions. Input partitions are grouped into chunks, based on either the number of partitions or cumulative partition size (including subpartitions where relevant). For non-partitioned tables the table itself is considered a single chunk.

The size limit for a partition chunk is defined by MAX_OFFLOAD_CHUNK_SIZE or per offload with the --max-offload-chunk-size option. The maximum number of partitions in a chunk is defined by MAX_OFFLOAD_CHUNK_COUNT or per offload with the --max-offload-chunk-count option. Both size and count thresholds are active at the same time, therefore the first threshold to be breached is the one that closes the chunk. If the first partition in a chunk breaches the size threshold then it will be included for offload and that chunk will be greater in size than the configured threshold.

Sets of partitions are an effective way of managing resource consumption. For example, if a source RDBMS typically has partitions of 8GB in size and the backend system can comfortably load 64GB of data at a time without impacting other users, MAX_OFFLOAD_CHUNK_SIZE can be set to 64G and roughly 8 partitions will be offloaded per chunk.

When calculating the cumulative size of partitions Offload uses RDBMS segment sizes. This is important to note because compressed data in the RDBMS may increase in size when extracted and staged.

Partitions are considered for chunking in the logical order they are defined in the source RDBMS. For example, partitions from an Oracle Database range-partitioned table are split into chunks while maintaining the order of oldest to most recent partition. Offload does not attempt to optimize chunking by shuffling the order of partitions because that would offload partitions out of sequence and be a risk to atomicity.

Data Storage Options ¶

Gluent Data Platform supports several storage schemes or services for offloaded data, listed in Table 7 below.

Table 7: Offload Storage Options¶

Storage Scheme or Service	Offload Staging Data	Offloaded Data	`OFFLOAD_FS_SCHEME`
HDFS	Hadoop only	Hadoop only	`hdfs`
Amazon Simple Storage Service (S3)	Snowflake only	Hadoop only	`s3a`
Microsoft Azure Data Lake Storage Generation 1	-	Hadoop only	`adl`
Microsoft Azure Data Lake Storage Generation 2	-	Hadoop only (Impala >= 6.2)	`abfs`, `abfss`
Microsoft Azure Blob Storage	Snowflake and Synapse only	-	`wasb`, `wasbs`, `abfs`, `abfss`
Google Cloud Storage (GCS)	BigQuery and Snowflake only	-	`gcs`

Storage preferences such as base HDFS file locations and/or cloud storage buckets can be specified as Offload command-line options, but it is more common to configure them once in the Gluent Data Platform environment file.

Gluent Offload Engine can offload data to cloud storage and requires a small amount of configuration, as follows:

OFFLOAD_FS_SCHEME: The storage scheme to which the offloaded data will be persisted for Hadoop-based backends or used as staging for cloud warehouses such as Google BigQuery, Snowflake and Azure Synapse Analytics). Refer to Table 7: Offload Storage Options. An ad hoc override is available with the --offload-fs-scheme option
OFFLOAD_FS_CONTAINER: The name of the bucket or container to be used for offloads. An ad hoc override is available with the --offload-fs-container option
OFFLOAD_FS_PREFIX: The storage subdirectory defined within the bucket/container (or can be an empty string if preferred). An ad hoc override is available with the --offload-fs-prefix option

When offloading to Hadoop-based backends:

The two-phase loading process stages data in an Avro table before persisting it to the Hadoop database
A value of inherit can be used for OFFLOAD_FS_SCHEME. With this value all tables created by Offload will inherit the location from the parent database. If databases are created using the --create-backend-db option, a default location of HDFS will be used
Specifying a storage scheme in cloud storage will persist the offloaded data to that location, but data must first be staged in HDFS as part of the two-phase loading process
Offloaded data is compressed. The default compression can be overridden with the --storage-compression option
The default file format for all persisted data is Parquet, regardless of the staging storage scheme

When offloading to Google BigQuery:

The two-phase loading process stages data as Avro file(s) in Google Cloud Storage before persisting it to Google BigQuery
Google BigQuery has its own proprietary storage format for the persisted data

When offloading to Snowflake:

The two-phase loading process stages data as Parquet file(s) in the configured cloud storage system before persisting it to Snowflake. The storage format can be be overridden to use Avro with OFFLOAD_STAGING_FORMAT, but the default of Parquet is strongly recommended
Snowflake has its own proprietary storage format for the persisted data

When offloading to Azure Synapse Analytics:

The two-phase loading process stages data as Parquet file(s) in the configured cloud storage system before persisting it to Synapse
Azure Synapse Analytics has its own proprietary storage format for the persisted data

Regardless of the storage provider:

Staged data can be compressed with the --compress-load-table option

Managing Data Distribution ¶

In addition to the variety of cloud storage options supported by Gluent Data Platform, Offload also provides options to manage the offloaded data distribution:

Backend partitioning: See Managing Backend Partitioning
Data distribution: See Backend Data Distribution and Sorting/Clustering

Managing Backend Partitioning¶

When offloading data, backend tables can be partitioned in several ways, depending on the backend platform and user preferences:

Surrogate Partitioning for Bucketing
Inherited Partitioning
User-Defined Partitioning

Note

This section does not apply to Snowflake or Azure Synapse Analytics.

Surrogate Partitioning for Bucketing¶

When offloading to Hadoop-based backends, Offload always includes a surrogate partition key named OFFLOAD_BUCKET_ID, in addition to any other partitioning scheme that might be defined for the table. This surrogate partition key is used internally by Gluent Data Platform to enable Smart Connector to scan backend data in parallel when executing a parallel hybrid query. The number of partition key values defaults to the DEFAULT_BUCKETS configuration but can be overridden with the --num-buckets option.

A hash function is used by Offload to distribute data across the OFFLOAD_BUCKET_ID partitions. Offload will try to pick the most selective column to distribute data evenly across the surrogate partitions, but in the absence of a primary key or column statistics for the source RDBMS table, it will default to the first table column. Users can override the hashing column by including the --bucket-hash-column option with the initial offload command.

The surrogate partition column is not required when offloading to Google BigQuery.

Inherited Partitioning¶

Tables offloaded with Partition-Based Offload or Subpartition-Based Offload are automatically partitioned in the backend with the (sub)partition key of the source table, unless overridden by the user. The granularity of the inherited (sub)partitions can be different to the source RDBMS (sub)partitions if required. In some cases it is mandatory to specify the granularity of the backend partition scheme (see Partition Granularity below for details)

For Hadoop-based backends, the inherited partition key can comprise multiple columns and is used in addition to the surrogate OFFLOAD_BUCKET_ID partition key described above. Backend partition key columns are always implemented by the addition of synthetic partition key columns. See Synthetic Partitioning below for details.

For Google BigQuery, only the leading (sub)partition key column will be used. In some cases, the backend partitioning might be implemented by a synthetic partition key column, but not always. See Synthetic Partitioning below for details.

User-Defined Partitioning¶

Tables can be offloaded with the --partition-columns option to define a custom partitioning scheme for the backend table. This enables non-partitioned RDBMS tables to be partitioned in the backend if required, in addition to allowing (sub)partitioned RDBMS tables to be offloaded with a different partitioning scheme in the backend. It is also possible (and in some cases mandatory) to specify the granularity of the backend partitions (see Partition Granularity for details below) and user-defined partitioning supports more RDBMS data types than inherited partitioning.

For Hadoop-based backends, the user-defined partition key can comprise multiple columns and is added to the surrogate OFFLOAD_BUCKET_ID partition key described above. The following example offloads a partitioned table to a Hadoop backend and subpartitions the offloaded data by year (derived from a date) and product ID (in steps of 1000).

Example 20: Offload with User-Defined Partitioning¶

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-columns=TIME_ID,PROD_ID --partition-granularity=Y,1000

When offloading to a Hadoop-based backend, user-defined partitioning can include date/timestamp, numeric or string columns.

When offloading to Google BigQuery, user-defined partitioning can include date/timestamp, numeric or string columns, but only one partition key column can be defined. In some cases, the backend partitioning will be implemented by a synthetic partition key column, but not always. See Synthetic Partitioning below for details.

Synthetic Partitioning¶

Synthetic columns are used to partition backend tables instead of the corresponding natural columns. Depending on the data type and backend system, Offload will sometimes implement inherited and user-defined partitioning schemes with additional synthetic columns (one per source (sub)partition column). This is usually when:

The backend system has no native partitioning support for the data type of the partition key column
The backend system has no native partitioning support for the change in granularity requested for the offloaded partitions

Synthetic partition keys are used internally by Gluent Query Engine to generate partition-pruning predicates in hybrid queries and by Offload to ensure that the backend partition columns remain consistent across multiple Partition-Based Offload or Subpartition-Based Offload operations.

Synthetic partition key columns are named by Gluent Offload Engine as a derivative of the corresponding source column name (e.g. GL_PART_M_TIME_ID or GL_PART_U0_SOURCE_CODE).

When a table is offloaded with partitioning to a Hadoop-based backend, synthetic partition column(s) are always generated by Offload. The resulting synthetic column(s) are always of type STRING.

When a table is offloaded with partitioning to Google BigQuery, a synthetic partition key will only be generated when the natural partition column is of a NUMERIC, BIGNUMERIC or STRING BigQuery data type, and the resulting synthetic partition key column will be created as an INT64 type (the integral magnitude of the source numeric data must not exceed the INT64 limits). For STRING columns, or for extreme [BIG]NUMERIC data that cannot be reduced to INT64 values with --partition-granularity (i.e. the granularity itself would need to exceed INT64 limits), a custom user-defined function (UDF) must be created and used to enable Gluent Offload Engine to create an INT64 synthetic partition key representation of the source data (see Partition Functions). Native BigQuery partitioning will be used when the natural partition column has a data type of INT64, DATE, DATETIME or TIMESTAMP.

Offload populates synthetic partition key columns with generated data when offloading based on the type and granularity of the data or based on the type and a custom partition function. Gluent Query Engine will generate predicates for the synthetic partition columns when pushing down predicates for the corresponding natural columns.

Partition Granularity¶

Partition granularity defines the range or contents of a single backend partition. In some cases, granularity must be defined (using the --partition-granularity option) when offloading an RDBMS table for the first time, but in many cases, Offload will choose a default granularity based on several factors (including the type of RDBMS partitioning, the RDBMS (sub)partition key type and the backend platform). Any defaults chosen by Offload can be overridden with the --partition-granularity option if required.

Partition Granularity Behavior With Hadoop Backends¶

When offloading with partitioning to a Hadoop-based backend, the following granularity behavior applies.

If the RDBMS table is list-partitioned and the partition scheme is inherited:
- Date/timestamp partition data defaults to a daily granularity
- Numeric partition data defaults to a granularity of 1
- String partition data defaults to the length of the RDBMS string partition column
In all other scenarios:
- Date/timestamp partition data defaults to a monthly granularity
- Numeric partitioning requires the --partition-granularity option
- String partitioning requires the --partition-granularity option

Partition Granularity Behavior With Google BigQuery¶

When offloading with partitioning to Google BigQuery, the following granularity behavior applies.

Date/datetime/timestamp partition data defaults to daily granularity
Google BigQuery’s native hourly granularity for date/datetime/timestamp partitions is not supported by Offload
If the RDBMS table is list-partitioned and the partition scheme is inherited, numeric partitioning defaults to a granularity of 1. In all other scenarios, numeric partitioning requires the --partition-granularity option. In all scenarios, the --partition-lower-value and --partition-upper-value options must be specified
String partition data requires the --partition-functions option along with options --partition-granularity, --partition-lower-value and --partition-upper-value

Partition Functions¶

The Partition Functions feature is only available when offloading to Google BigQuery. This extensibility feature allows users to provide a custom user-defined function (UDF) for Gluent Offload Engine to use when synthetically partitioning offloaded data. It enables users to choose a source partitioning column that would otherwise not be usable as a partition key in the backend system. For example, there is no native partitioning option for STRING data in BigQuery, which means that Gluent Offload Engine’s standard synthetic partitioning cannot be used. Also, some extreme [BIG]NUMERIC data cannot be reduced to INT64 values with Offload’s standard synthetic partitioning scheme (i.e. with the --partition-granularity option). The Partition Functions feature provides a way for users to create an INT64 representation of the source partitioning data to use as a synthetic partition key in both of these cases.

Example 21: Offloading with Partition Functions¶

In the following example, the SH.CUSTOMERS table is offloaded to BigQuery and synthetically partitioned by the CUST_LAST_NAME column. Because the source column is VARCHAR2, the Partition Functions feature is required. First, the BigQuery UDF must be created (or already exist) to enable Offload to generate an INT64 representation of the source string data. In this example, the backend table will be partitioned according to the first letter of the CUST_LAST_NAME data, using the following BigQuery SQL UDF:

CREATE FUNCTION UDFS.CUST_NAME_TO_PART_KEY (str STRING) RETURNS INT64 AS (ASCII(UPPER(str)));

With this UDF, Gluent Offload Engine is able to offload the SH.CUSTOMERS table and add a synthetic partition key with an INT64 representation of the CUST_LAST_NAME source data as follows:

$ $OFFLOAD_HOME/bin/offload -t SH.CUSTOMERS -x \
      --partition-columns=CUST_LAST_NAME \
      --partition-functions=UDFS.CUST_NAME_TO_PART_KEY \
      --partition-lower-value=65 \
      --partition-upper-value=90 \
      --partition-granularity=1

With this command, Offload will apply the UDFS.CUST_NAME_TO_PART_KEY function to the CUST_LAST_NAME column to generate a synthetic INT64 partition key value between 65 and 90 (capital letters) and each backend synthetic partition will have a range of 1.

Note

In addition to Full Offload (as shown in Example 21), Partition Functions can also be used with (Sub)Partition-Based Offload (see Example 7: Offload a Range of String Partitions (BigQuery)) and Predicate-Based Offload.

There are several ways to reference and qualify the name and location of a custom UDF for use as a partition function:

Fully-qualified in the --partition-functions option (e.g. --partition-functions=MY_UDF_DATASET.MY_UDF_NAME)
If the dataset name is not provided in the --partition-functions option (e.g. --partition-functions=MY_UDF_NAME
- If OFFLOAD_UDF_DB is set in the offload.env configuration file, Offload will use this to identify the UDF dataset
- If OFFLOAD_UDF_DB is not set in the offload.env configuration file, Offload will assume the UDF is in the same dataset as the offloaded table

Google BigQuery identifiers are case-sensitive, so in all cases the dataset and UDF values must be provided in the correct case.

When creating a UDF for use with Partition Functions, the following conditions must be met:

The UDF must be a Google BigQuery SQL UDF (JavaScript UDFs are not supported)
The UDF return data type must be INT64
The UDF must have a single parameter of data type STRING, [BIG]NUMERIC or INT64
The UDF must be deterministic (i.e. the same input will always yield the same output)
Synthetic backend partitioning is range partitioning; therefore, the SQL UDF must retain the same ordering relationship and characteristics of the source data. This is especially critical if the source partition column is likely to be queried with range operators (< <= > >= BETWEEN), else the query can yield wrong results. In Example 21 above, the CUST_NAME_TO_PART_KEY function preserves the ordering of the CUST_LAST_NAME data (i.e. all names beginning ‘A’ - ASCII 65 - are less than names beginning with ‘B’ - ASCII 66 and so on)
See Google BigQuery Custom Partition Functions for permissions and other installation requirements

Backend Data Distribution and Sorting/Clustering¶

In addition to partitioning, offloaded data can optionally be distributed according to distribution, sorting or clustering options provided by the backend. This can lead to more efficient data access (such as range scans) for hybrid queries and other users of the offloaded data.

The following data distribution options are provided by the backends:

Impala: For Hadoop-based backends, Offload uses partitioning to hash distribute offloaded data into buckets, as described in Surrogate Partitioning for Bucketing
Azure Synapse Analytics: Tables can be hash distributed by one column

When offloading to Azure Synapse Analytics, most tables will be distributed by either HASH or ROUND_ROBIN (assuming that the table does not contain any data types that prevent clustered columnstore storage). The hashing method depends on the value of OFFLOAD_TRANSPORT_SMALL_TABLE_THRESHOLD globally or the table-level --offload-transport-small-table-threshold override option (smaller tables are distributed by ROUND_ROBIN and larger tables by HASH). The column used to distribute the data is automatically selected by Offload unless a specific column is provided with the --bucket-hash-column option.

The following data sorting and clustering options are provided by the backends:

Impala: Data can be stored sorted by one or more columns
Google BigQuery: Tables can be clustered by up to four columns
Snowflake: Tables can be clustered by more than one column, but it is recommended that no more than four columns are used
Azure Synapse Analytics: Tables can have a clustered columnstore index with more than one numeric column

To enable data sorting and clustering globally, set the OFFLOAD_SORT_ENABLED configuration parameter to true. Alternatively, this can be managed for individual offloads by adding the --offload-sort-enabled option to the offload command (set to true), along with the --sort-columns option to specify the columns for distributing the data in the backend.

Managing Hybrid Query Performance ¶

In addition to creating the database objects and metadata required to enable Gluent Query Engine to execute hybrid queries, Offload also provides a solid foundation for good hybrid query performance. It does this through a combination of standard operations and configurable options, designed to prepare the hybrid environment for Smart Connector optimizations. Several of these are described below.

Managing Parallelism¶

By default, Offload prepares the hybrid environment for parallel query access in several ways, unless disabled by configuration:

Offload Buckets
Hybrid External Table Location Files
Hybrid External Table Degree

Offload Buckets¶

As described earlier, in Hadoop-based environments, Gluent Offload Engine automatically partitions the backend table by a surrogate OFFLOAD_BUCKET_ID column to enable parallel access by multiple Smart Connector sessions. This can be configured globally as a default number of buckets (see DEFAULT_BUCKETS) and/or specified per table with the --num-buckets option. This should be set to the highest degree of parallelism required for tables that need to be accessed by parallel queries. It can be capped globally to avoid creating too many bucket partitions in the backend table. It can also be capped to limit the degree of parallel access (this applies to Impala only).

Small tables below a threshold (configured with the DEFAULT_BUCKETS_THRESHOLD global parameter) are automatically offloaded with a single bucket.

Google BigQuery, Snowflake and Azure Synapse Analytics do not require this Gluent Data Platform data storage optimization. Smart Connector utilizes the BigQuery Storage API for natively-managed parallel data retrieval from Google BigQuery. For parallel data retrieval from Snowflake and Azure Synapse Analytics, Smart Connector first caches and shards the results of the query and then reads the results in parallel.

Hybrid External Table Location Files¶

Hybrid external tables can be created with multiple location files. When doing so, parallel hybrid queries spawn multiple Smart Connector processes for parallel data retrieval. The number of location files for each hybrid external table is determined by either the NUM_LOCATION_FILES global default parameter or per table with the --num-location-files option.

Note that when offloading to Impala, the number of external table location files is always aligned with the number of offload buckets.

To disable parallel data retrieval for an external table, the --num-location-files option should be added to the offload command and should be set to 1.

Hybrid External Table Degree¶

By default, Offload copies the degree of parallelism (DOP) from the source RDBMS table to the hybrid external table when offloading. This ensures that hybrid queries are executed with the same default DOP as the original application queries. This can be configured globally with the HYBRID_EXT_TABLE_DEGREE parameter or set per table with the --ext-table-degree option. In most cases, the default suffices.

This does not apply to hybrid external tables that are created for Advanced Aggregation Pushdown; these are always created with a degree of 1.

Managing Statistics¶

Database statistics enable query optimizers to determine the most efficient way to execute queries. To ensure that hybrid queries benefit from efficient execution plans, Offload manages statistics for both the hybrid external table and the backend target table (where applicable).

Hybrid Object Statistics¶

When offloading an RDBMS table for the first time, Offload copies the table and column statistics from the source table to the base hybrid external table. No prorating or modification of the statistics occurs, even if the source table is partially-offloaded. In cases where partially-offloaded tables continue to be offloaded (for example, a Partition-Based Offload of the latest few partitions), Offload will not overwrite the external table statistics unless the source RDBMS statistics indicate that the source table has grown. This prevents the statistics from being downgraded if the source RDBMS table shrinks due to offloaded partitions being dropped from the RDBMS.

Hybrid external tables that support general, custom or count-only Advanced Aggregation Pushdown have their statistics derived from the backend table statistics. When offloading to Google BigQuery or Snowflake, statistics for these hybrid external tables are derived by Offload.

Backend Table Statistics¶

When offloading to Hadoop-based or Azure Synapse Analytics backends, Offload will compute statistics on the backend target table every time it is loaded.

For Impala, default behavior for partitioned tables is to compute incremental statistics. For very large tables, or tables with a large number of partitions, this can sometimes lead to pressure for Impala resources. In such cases (and to save time on statistics operations in general), Offload provides a mechanism to copy statistics from the source RDBMS table to the backend target table. This is a metadata-only operation and as such is quick to process. The backend statistics are derived from the source RDBMS values. To specify that statistics should be copied rather than computed, the --offload-stats option should be added with a value of COPY to the offload command.

For Azure Synapse Analytics, stats are updated after every offload using a fixed 5% sample, unless disabled with the --offload-stats option (using a value of NONE). Unlike Impala, stats cannot be copied from the RDBMS table to the Azure Synapse Analytics table.

This step is skipped when offloading to either Google BigQuery or Snowflake (BigQuery tables do not have statistics and Snowflake gathers statistics automatically while loading data).

Data Distribution and Sorting/Clustering¶

See Backend Data Distribution and Sorting/Clustering.

Resetting an Offloaded Table ¶

Offload provides a --reset-backend-table option that can be added to an offload command to fully reset a previously offloaded table. This option is useful when refreshing small offloaded tables such as reference tables or dimensions (these are commonly offloaded to support Offload Join operations). This option will cause Offload to drop all offloaded data from the backend, so it should be used carefully.

Important

The --reset-backend-table option will drop the existing backend table and supporting objects. This option must only be used when it is certain that resetting the backend table will not cause data loss. Care must be taken to ensure that this option is never used if some of the previously-offloaded data cannot be replaced (for example, partitions that are dropped from the source RDBMS after offloading to the backend).

Offload Security ¶

Gluent Data Platform interfaces with the security features of the RDBMS and backend data platforms, providing support for authentication, authorization and encryption features.

See the Security guide for details of supported security features and the Installation & Upgrade guide for implementation details.

Documentation Feedback ¶

Send feedback on this documentation to: feedback@gluent.com