Offload

Documentation Conventions

  • Commands and keywords are in this font.

  • $OFFLOAD_HOME is set when the environment file (offload.env) is sourced, unless already set, and refers to the directory named offload that is created when the software is unpacked. This is also referred to as <OFFLOAD_HOME> in sections of this guide where the environment file has not been created/sourced.

  • Third party vendor product names might be aliased or shortened for simplicity. See Third Party Vendor Products for cross-references to full product names and trademarks.

Introduction

Offload is a core component of Gluent Offload Engine that provides a simple and flexible way to copy data from an RDBMS (such as Oracle Database) to a variety of backend data platforms (such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery). This guide describes how to use Offload in a number of common and advanced use-cases. It is assumed that readers are familiar with Gluent Data Platform Concepts prior to reading this Offload guide, particularly the Gluent Offload Engine and Offload concepts.

Overview of the Offload Process

To understand the Offload software and its use-cases, options and operation, it is useful to start with a high-level understanding of an offload process in general. Offloading data from an RDBMS to a backend requires a number of sequential operations to be orchestrated, involving a variety of components, interfaces and parts of the source and target systems. When running an offload, the various steps executed by Offload are visible both on the terminal and in a log file, but a broad description of what these steps are doing is provided below.

An offload process has four broad logical phases:

  1. Phase 1: Setup

  2. Phase 2: Offload Transport

  3. Phase 3: Hybrid Objects

  4. Phase 4: Verification

Phase 1: Setup

In the setup phase, Offload will gather a range of information about the data to be offloaded, such as the structure of the data, its data types, metadata from previous offloads and so on. With this information, Offload is able to prepare the target backend system ready for the Offload Transport phase. This includes the creation of target databases (if required), the creation of a staging location and table for the offloaded data and for first-time offloads, a final table in the target backend database.

Phase 2: Offload Transport

In the transport phase, Offload copies the source RDBMS data to the backend staging location. Once the staged data is validated, it is loaded into the backend target table. This two-phased approach provides two main benefits:

  1. Atomicity: the load from staging table to target table either fully succeeds or completely fails

  2. Partitioning: for backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table

Phase 3: Hybrid Objects

In the hybrid objects phase, Gluent Offload Engine creates a number of database objects in the RDBMS hybrid schema and saves metadata to Gluent Metadata Repository to support both the operation of Gluent Data Platform and provide an interface for users to transparently query offloaded data. Statistics, synonyms and privileges are also created as required. See Offload Objects below for an example of the hybrid objects that are created by a standard Offload.

Where applicable, Offload will also create hybrid copies of any views that are dependent on the offloaded RDBMS table. If any such dependent hybrid views already exist, Offload will also re-generate these with any changes that are required as a result of the latest offload operation.

Phase 4: Verification

The verification phase is the final operation in an offload process. In this phase, Offload queries both the hybrid view and the source RDBMS table to verify that the offloaded data set matches in both databases. This has a secondary benefit of validating the hybrid objects. Following successful verification of the offloaded data, the offload is complete. Any notices or warnings identified during the various phases are logged and Offload exits with success (i.e. an exit code of 0). See the --verify option for available validation methods.

Controlling the Offload Process

The offload process is highly configurable. Many steps in the process can be modified by options or configuration parameters and a number of steps are also optional and can be disabled per offload. In addition to the remainder of this Offload guide (which will describe many of the options), see the Reference guide for a full list of options and configuration parameters and their descriptions.

Offload Objects

The following table lists the standard set of RDBMS and backend database objects that are created by an offload of SCHEMA.TABLE.

Table 1: Objects Created by Offload

Database

Object Type

Object Naming Format

Hybrid Object Type

Backend

Table

SCHEMA_LOAD.TABLE

Temporary staging table

Backend

Table

SCHEMA.TABLE

Offloaded table

RDBMS

View

SCHEMA_H.TABLE

Hybrid view

RDBMS

Table

SCHEMA_H.TABLE_EXT

Hybrid external table

RDBMS

View

SCHEMA_H.TABLE_AGG

Aggregate hybrid view

RDBMS

Table

SCHEMA_H.TABLE_AGG_EXT

Aggregate hybrid external table

RDBMS

Rewrite Equivalence

SCHEMA_H.TABLE_AGG

Aggregate rewrite rule

RDBMS

View

SCHEMA_H.TABLE_CNT_AGG

Aggregate hybrid view

RDBMS

Table

SCHEMA_H.TABLE_CNT_AGG_EXT

Aggregate hybrid external table

RDBMS

Rewrite Equivalence

SCHEMA_H.TABLE_CNT_AGG

Aggregate rewrite rule

Google BigQuery identifier names are case-sensitive. Gluent Offload Engine allows a default case to be set for all Google BigQuery identifiers created by Offload to ensure consistency. See BACKEND_IDENTIFIER_CASE for more details.

Offload adds a variety of suffixes to the offloaded table name when naming the hybrid objects. If the length of a hybrid object name exceeds the RDBMS limit, Offload will truncate the hybrid object name to below the limit and pad the remainder with a unique string to avoid potential clashes with other truncated identifiers.

See also

Glossary for definitions of the hybrid object types in addition to the Concepts and Performance guides for further information on their purpose and usage.

Offloading Data

Offload supports several scenarios for offloading data from the RDBMS:

Offloading Entire Tables

Heap, partitioned and index-organized tables can be fully-offloaded from the RDBMS as follows.

Example 1: Offload a Full Table

$ . $OFFLOAD_HOME/conf/offload.env

$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x

Note

This example includes the sourcing of the Gluent Data Platform environment file for an offload session. All other examples in this guide will assume this step has occurred.

A log file containing all of the steps in the offload process will be generated in $OFFLOAD_HOME/log. Using the -v or --vv option will result in more screen output during the offload.

Backend tables can be optionally partitioned, even if the source RDBMS table is not partitioned. See Managing Backend Partitioning below for more details.

It is possible to rename the target backend table when offloading with the --target-name option. This can be useful for a number of reasons, including when:

  • The source RDBMS table name includes a character that is not supported by the backend (for example, Oracle Database allows $ to be used for identifiers but this is not supported by Impala)

  • The naming standards used for applications in the backend system are different to the source RDBMS application (either for database names or object names)

Example 2: Change the Offload Target Name

The following example offloads a RDBMS table named SH.SALES$ to an Impala backend, meaning that the target table needs to be renamed to remove the $.

$ $OFFLOAD_HOME/bin/offload -t 'SH.SALES$' -x --target-name=SH.SALES

See also

This guide describes a number of Offload options, but a full list is provided in the Offload Parameters section of the Reference manual.

Partition-Based Offload

Partition-Based Offload enables some or all partitions of a partitioned table to be offloaded across one or more offload sessions. A common use of this feature is to initially offload the historical partitions for a table and then periodically offload new partitions as they arrive in the RDBMS, appending the data to the backend version of the table.

Partition-Based Offload can be used for the following scenarios:

Offloading Range-Partitioned Tables

Tables that are range-partitioned on DATE, TIMESTAMP or NUMBER columns can be fully or partially offloaded. Partitions can be offloaded contiguously up to a boundary by specifying a high water mark in the offload command. Partition boundaries can be increased in subsequent offloads to append additional partitions without affecting historical partitions.

Partition Boundary Options

To offload a contiguous range of partitions from a range-partitioned table, one of the following boundary options must be used:

--older-than-date=2015-10-01
  • --older-than-days: Offload partitions older than this number of days (exclusive, i.e. the boundary partition is not offloaded). This option is suitable for keeping data up to a certain age in the source table and is an alternative to the --older-than-date option. If both are supplied, --older-than-date will be used, e.g.

--older-than-days=90
  • --less-than-value: Offload partitions with a high water mark less than this value. Will accept integer or date values and allows intraday date and timestamp values to be specified, e.g.

--less-than-value=100000000
--less-than-value=2015-01-01
--less-than-value="2015-01-01 12:00:00"
  • --partition-names: Offload partitions with a high water mark matching that of the named partition, e.g.

--partition-names=P201509

Boundary options for range-partitioned tables are exclusive and the partition that contains data for the specified high water mark will not be offloaded. Furthermore, Offload will not offload partial partitions. For example, suppose a partition has data with a value range of 2000-01-01 to 2000-01-31. Using --older-than-date=2000-01-15 will not offload data from 2000-01-01 to 2000-01-14 because the partition has the potential for data up to and including 2000-01-31. Instead, data will be offloaded up to the boundary of the previous range partition. To offload the example partition, an --older-than-date=2000-02-01 value must be used.

The following is an example of offloading all partitions below a given date boundary.

Example 3: Offload a Range of Date Partitions
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-10-01

By default, Example 3 above will use the RDBMS partition key column as the backend partition column (they can be different if required; see User-Defined Partitioning for details). It is common to use a different granularity for the backend partitions, especially useful when the backend system better supports either larger or fewer partitions. For example, an RDBMS table partitioned by day might be better offloaded to monthly partitions in a Hadoop-based backend. See --partition-granularity for details of the default granularities for various scenarios and backends.

For range-partitioned tables with numeric partition keys, the partition granularity of the backend table must be specified. The command syntax differs slightly according to the target backend, as the following examples demonstrate.

Example 4: Offload a Range of Numeric Partitions (Hadoop)

This example offloads all partitions with a partition key value of less than 5,000,000,000 and creates backend partitions with a range of 10,000,000. This applies when offloading to Hadoop-based platforms.

$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
      --less-than-value=5000000000 \
      --partition-granularity=10000000

Note

It is also possible to specify a limit to the size of numeric partition keys (see --partition-digits).

Example 5: Offload a Range of Numeric Partitions (BigQuery)

For Google BigQuery, the full range of potential partition key values must be specified when creating a table with numeric partitions, hence the additional Offload options in the following example.

$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
      --less-than-value=5000000000 \
      --partition-granularity=10000000 \
      --partition-lower-value=1 \
      --partition-upper-value=99999999999999999
Fully Offloading Range-Partitioned Tables with Offload Type

Range-partitioned tables are also supported for full offload, using either the --offload-type option with a value of FULL or by excluding any of the partition boundary options described above. The following option combinations will fully offload a range-partitioned table:

  • --offload-type=FULL: Offloads all partitions and creates a simple hybrid view that queries all the data from the backend

  • --offload-type=FULL plus one of the boundary options: Offloads all partitions and data to the backend but creates a composite hybrid view that queries the data below the specified boundary from the backend and data above the specified boundary from the RDBMS

  • None of the offload type or partition boundary options described above: This is the same as specifying --offload-type=FULL

See also

See Offload Type Considerations for more details on support for partial and full offloads of partitioned tables.

Moving the Range High Water Mark in a Hybrid View

The partition boundary in a hybrid view can be modified by running a subsequent offload with a boundary option matching a partition value that has already been offloaded. No data will be copied from the RDBMS to the backend but the associated hybrid view(s) will be reconfigured to reflect the new settings. If no partition is matched by the boundary option values then the partition boundary will be aligned with the data that has already been offloaded.

Moving the partition boundary does not affect any data that has already been offloaded. It is a metadata operation that simply changes where hybrid queries source their results from above and below the boundary. As described in Offload Type Considerations, this technique can be useful to get the best trade-off between making current data available in the backend system as soon as possible but having the most frequently accessed data queried from the RDBMS until it becomes less-used.

MAXVALUE Partition Considerations

Offloads for range-partitioned tables with a MAXVALUE partition as the latest partition will behave differently depending on the offload options used, as follows:

  • If neither the --offload-type option nor any of the partition boundary options are specified, the MAXVALUE partition will be offloaded

  • If --offload-type=FULL is specified, either with or without a partition boundary option, the MAXVALUE partition will be offloaded

  • If --offload-type=INCREMENTAL is specified, either with or without a partition boundary option, the MAXVALUE partition will not be offloaded

  • If a partition boundary option is specified but not --offload-type, the MAXVALUE partition will not be offloaded

Note

Offloading the MAXVALUE partition for a table will prevent any further partition-based offloads for that table.

Multi-Column Partitioning

Oracle Database supports multi-column partitioning, whereby a table is partitioned on a set of columns rather than a single partition key (for example, PARTITION BY RANGE (yr, mon)). Offload supports this for range-partitioned tables. Any boundary condition must be provided via the --less-than-value option as comma-separated-values. For example:

--less-than-value=2015,10

Or:

--less-than-value=2015-10-31,200

By default, when offloading a multi-column partitioned table to Hadoop-based backends, the generated Hadoop table will be partitioned and subpartitioned to match the columns used in the RDBMS partitioning scheme. For example, an Oracle Database table partitioned by YR, MON will be partitioned in Hadoop by yr and then subpartitioned by mon. This can also be controlled using the --partition-columns option.

When offloading a multi-column partitioned table to Google BigQuery, the generated BigQuery table will be partitioned according to the leading partition key column only. The --partition-columns option will only allow one column to be specified when offloading to Google BigQuery.

Offloading Interval-Partitioned Tables

Offload supports interval-partitioned tables in exactly the same way as range-partitioned tables.

Offloading List-Partitioned Tables

Tables that are list-partitioned on DATE, TIMESTAMP, NUMBER, [N]VARCHAR2 or [N]CHAR columns can be partially or fully offloaded. Discrete sets of one or more partitions can be offloaded by specifying either partition key values or partition names in the offload command. Additional partitions can be offloaded in subsequent offloads without affecting historical partitions.

List Partition Specification

Partitions in a list-partitioned table can be offloaded using either of the options below:

  • --equal-to-values: Offload partitions by partition key value, accepting partition key literals that match the RDBMS partitions, e.g.

--equal-to-values=2015-01-01
--equal-to-values=US --equal-to-values=AP --equal-to-values=EU
--equal-to-values=201501,201502,201503
--partition-names=TRANSACTIONS_P2015Q1,TRANSACTIONS_P2015Q2

When using --equal-to-values, each partition must have its own option specification, as shown above and in Example 6 below.

Example 6: Offloading Partitions from a List-Partitioned Table

The following example offloads two partitions from a date list-partitioned table:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --equal-to-values=2015-01-01 --equal-to-values=2015-01-02
Example 7: Offloading Multi-Valued Partitions from a List-Partitioned Table

When using --equal-to-values, each option must equal the full high values specification for a single partition. The following example offloads two multi-valued list partitions where the partition key is a numeric representation of year-month:

$ $OFFLOAD_HOME/bin/offload -t SH.TRANSACTIONS -x --equal-to-values=201501,201502,201503 --equal-to-values=201504,201505,201506

Offloads for tables with numeric list-partition keys does not require the --partition-granularity option, but for offloading to Google BigQuery the --partition-lower-value and --partition-upper-value options must still be used. See Managing Backend Partitioning for further details.

Example 8: Offloading List-Partitions by Partition Name

The --partition-names option is an alternative way to specify multiple partitions to offload and can be useful if partition names are well-formed and known. The following example offloads four partitions from a table that is list-partitioned on financial quarter:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-names=SALES_P2015Q1,SALES_P2015Q2,SALES_P2015Q3,SALES_P2015Q4
Resetting the List High Water Mark in a Hybrid View

List partition offloading is additive. A partition that has previously been offloaded will not be re-offloaded, even if it is specified in multiple offload commands. If there is a need to modify the hybrid view so that offloaded partitions are accessed from the RDBMS rather than the backend system, the --reset-hybrid-view option can be used with the next offload. Any previously offloaded partitions matched by partition identification options will not be re-offloaded but will be reflected in the hybrid view predicates.

DEFAULT Partition Considerations

Offloads for list-partitioned tables with a DEFAULT partition will behave differently depending on the offload options used, as follows:

  • If neither the --offload-type nor any of the partition identification options (--equal-to-values, --partition-names) are specified, the DEFAULT partition will be offloaded

  • If --offload-type=FULL is specified, either with or without a partition identification option, the DEFAULT partition will be offloaded

  • If --offload-type=INCREMENTAL is specified, either with or without a partition identification option, the DEFAULT partition will not be offloaded

  • If a partition identification option is specified but not --offload-type, the DEFAULT partition will not be offloaded

Note

Offloading the DEFAULT partition for a table will prevent any further partition-based offloads for that table.

Offloading List-Partitioned Tables as Range

In cases where a list-partitioned table has been structured to mimic range partitioning, the table can be offloaded exactly as described in Offloading Range-Partitioned Tables. To use this feature the table should adhere to the following:

  • Each partition must have a single literal as its high value

  • All new partitions must be added with higher/newer high values than those that have already been offloaded

  • Supported data types must match those referenced in Offloading Range-Partitioned Tables

In this scenario, a DEFAULT list partition will be treated in the same way that a MAXVALUE partition is treated for range partition offload (see MAXVALUE Partition Considerations for details).

Backend partition granularities differ between range-partitioned tables and list-partitioned tables offloaded as range. See Managing Backend Partitioning for details.

Example 9: Offload a Set of List Partitions Using a Date Range Partition Boundary

The following example offloads all list partitions with a partition key value of less than 2015-02-01.

An example using --older-than-date:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-02-01

Boundary options are exclusive and will not offload a partition with a key that equals the supplied value.

If using the --partition-names option to specify the partition to use as an offload boundary, the --offload-predicate-type option must also be specified with a value of LIST_AS_RANGE, otherwise Offload will treat the table as a basic list-partitioned table rather than use range-partitioned table behavior.

Subpartition-Based Offload

Tables that are range-subpartitioned on DATE, TIMESTAMP or NUMBER columns can be fully or partially offloaded. With Subpartition-Based Offload, partial offloading is supported even if the top-level partitioning scheme of a table is unsupported, because the subpartition high values are used to determine the offload boundary. To offload a range-subpartitioned table by subpartition, the --offload-by-subpartition option must be used.

Subpartition-Based Offload is useful for tables that are organized with ranged data (such as a time series or business date) at subpartition level, rather than partition level. In such cases, the subpartition boundaries can be used to determine the data to be offloaded, even when the partition-level data is also ranged.

When offloading a subpartitioned table with Partition-Based Offload, all subpartitions within the identified range or list partition(s) are offloaded as standard.

Example 10: Offload a Set of Range Subpartitions Using a Date Range Boundary

The following example demonstrates offloading all subpartitions below a date threshold of 2015-07-01:

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-07-01 --offload-by-subpartition

Subpartition Boundary Options

The Offload options for specifying subpartition boundaries are the same as for range partition offloading. See Partition Boundary Options for details.

Subpartition-Based Offload has an additional restriction, such that the boundary option value must guarantee that no subpartitions have potential for more data to be added. For example, suppose every partition in a table contains two range subpartitions with high values of 90 and 100. In this case, the boundary option of --less-than-value=100 identifies a common subpartition boundary across the entire partition set and this guarantees that all subpartitions below this threshold are “complete” and ready for offloading. However, suppose just one of the partitions has range subpartitions with high values of 90 and 110. In this case, --less-than-value=100 option would not be able to offload all subpartitions below this threshold because the boundary is not common (i.e. the subpartition with a high value of 110 cannot be guaranteed to be “complete”).

In cases where the specified boundary is not valid, Offload will terminate with a warning and recommend that a common boundary is identified and used. In many cases it is possible to project the boundary value forward until a value is found that guarantees the safe offloading and appending of “complete” subpartitions.

Boundary options are exclusive and will not offload subpartitions with a key that equals the supplied value.

Other Subpartition-Based Offload Behavior

Subpartition-Based Offload behavior is the same as Partition-Based Offload in almost all cases. Considerations such as MAXVALUE subpartitions, full offloading with --offload-type, moving the high water mark in the hybrid view, using multi-column subpartition keys and so on are the same for range subpartitions as they are for range partitions. See Offloading Range-Partitioned Tables for details. Note that any examples would need the --offload-by-subpartition option to apply the same principle to range-subpartitioned tables.

Predicate-Based Offload

Offload supports the offloading of subsets of data (or an entire table) by predicate. This feature applies to any table type (partitioned or non-partitioned) and is known as Predicate-Based Offload.

With this feature, three offloading patterns are possible:

  1. Simple Predicate-Based Offload: Offload one or more non-overlapping subsets of data from a table

  2. Late-Arriving Predicate-Based Offload: Offload one or more late-arriving subsets of data from a partition that has already been offloaded with Partition-Based Offload

  3. Intra-Day Predicate-Based Offload: Offload a new partition in non-overlapping subsets as soon as each subset of data is loaded

Predicates are provided using a simple grammar as described in Predicate Grammar.

Simple Predicate-Based Offload

Example 11: Offload a Subset of Data from a Table Using a Predicate

The following example demonstrates a simple Predicate-Based Offload scenario by offloading all data for the ‘Electronics` category.

$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x --offload-predicate='column(PROD_CATEGORY) = string("Electronics")'

All data matching this predicate will be offloaded and the predicate will be added to the offload boundary in the hybrid view. Predicates are additive and any predicate that has been offloaded will not be re-offloaded, even if specified in future offload commands. If there is a need to modify the hybrid view so that offloaded data matching one or more predicates is accessed from the RDBMS rather than the backend system, the --reset-hybrid-view option can be used with the next offload.

Backend tables can be optionally partitioned when using simple Predicate-Based Offload. See Managing Backend Partitioning for more details.

Late-Arriving Predicate-Based Offload

The late-arriving scenario caters for tables that are primarily offloaded by partition, but sometimes have small subsets of data arriving after the partition has been offloaded. This scenario benefits from the efficiencies of Partition-Based Offload but with the flexibility of Predicate-Based Offload for “topping-up” the offloaded data for one or more partitions. This pattern can only be used with range-partitioned tables or list-partitioned tables offloaded with range partition semantics.

Example 12: Offload a Late-Arriving Subset of Data from a Previously-Offloaded Range Partition

In the following example, the SH.SALES table has previously been offloaded using Partition-Based Offloading up to and including data for 2019-01-15. The Predicate-Based Offload command below is used to additionally offload a small set of data for product 123` that has arrived late and been loaded into a partition that has already been offloaded.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2019-01-15) and column(PROD_ID) = numeric(123)' \
      --offload-predicate-type=RANGE

To offload late-arriving data by predicate, the --offload-predicate option must include the current offloaded boundary value for the partition key (TIME_ID = 2019-01-15 in this example), in addition to the predicate that identifies the late-arriving data (PROD_ID = 123). All data matching this predicate will be offloaded but the predicate itself will not be added to the offload boundary or metadata for the hybrid view (only the existing Partition-Based Offload boundary will be used).

The --offload-predicate-type option must also be used for this scenario, but as stated above, the table must be offloaded as a RANGE or LIST_AS_RANGE offload predicate type.

Intra-Day Predicate-Based Offload

This offload pattern can be used to switch between Partition-Based Offload and Predicate-Based Offload as required. It can be useful as a means to offload the majority of a partitioned table by partition boundaries, but swap to offloading subsets of data for a new partition as soon as they arrive (rather than wait for the entire partition to be loaded before offloading). This can ensure that data is available in the target backend earlier than would otherwise be possible. This pattern can only be used with range-partitioned tables or list-partitioned tables offloaded with range partition semantics.

Example 13: Offload a Subset of Data for a New Range Partition (Intra-Day Offloading)

In the following example, the historical data for the SH.SALES table is offloaded by Partition-Based Offload (range partitioned by TIME_ID) at time T0 to begin the offload lifecycle for this table. For new data, rather than wait a full day for an entire partition of data to be ready, data is instead offloaded as soon as each product set is loaded into the new partition (3 separate loads at times T1, T2, T3). When all loads and offloads have completed, the table metadata is reset (at time T4) and the table is ready to repeat the same Predicate-Based Offload pattern for the next processing cycle.

Time

Description

Command

T0

Offload all history by range partition

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-01

T1

Offload first set of products loaded at time T1

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(123), numeric(234))' \
      --offload-predicate-type=RANGE_AND_PREDICATE

T2

Offload second set of products loaded at time T2

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(345), numeric(456))' \
      --offload-predicate-type=RANGE_AND_PREDICATE

T3

Offload third set of products loaded at time T3

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(567), numeric(678))' \
      --offload-predicate-type=RANGE_AND_PREDICATE

T4

All products now loaded and offloaded, reset the metadata for the table

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-02

The Predicate-Based Offload commands for T1-T3 require that the --offload-predicate option includes a predicate for the Partition-Based Offload partition key and that it matches the value of the new data (TIME_ID=2020-07-01 in this example). The --offload-predicate-type option must also be used (see option reference for valid values).

When the metadata is reset after the end of a sequence of Predicate-Based Offload commands (as per T4 above), either the same pattern of offloading can continue for the next partition or, if preferred, Partition-Based Offload can resume (i.e. wait for the next partition to be fully-loaded before offloading in one command as described in Partition-Based Offload).

Predicate Grammar

At present, the offload predicate DSL or grammar used for the --offload-predicate option covers a simple range of possibilities, based on a <column> <operator> <value> model.

Predicate Form

Predicates can take several forms:

  • <column> <operator> <value>

  • <column> [NOT] IN (<value>[, <value>, ...])

  • <column> IS [NOT] NULL

The DSL allows specification of any number of predicates in a range of forms, which can be logically combined and nested with AND/OR, for example:

  • <predicate>) AND|OR <predicate>

  • (<predicate> AND|OR <predicate>) AND|OR (<predicate>)

Parentheses are required to form logical predicate groups, as shown in the second example above. Whitespace is insignificant.

Column Form

Predicates always contain a single column written with the column keyword and a “function-call” syntax. Columns are not case-sensitive and can include aliases. The following examples are all valid:

  • column(CUST_ID)

  • column(SALES.CUST_ID)

  • column(cust_id)

Columns must exist in the table being offloaded and expressions/functions are not supported.

Operator Form

The basic SQL operators are supported and are written as in SQL. Member operators and null-value expressions are also supported. The full range is as follows:

  • =, !=, <, <=, >, >=

  • [NOT] IN

  • IS [NOT] NULL

Value Form

Values must be qualified with a data type classification and have a “function-call” syntax as follows:

  • string("This is a string")

  • numeric(123)

  • datetime(2020-07-01)

String values use the string keyword and double-quotes as shown above. Double-quote characters can be included in the string value by escaping with a backslash.

Numeric values use the numeric keyword and both signed integral and signed decimal values are supported with the same keyword.

Date/time values use the datetime keyword and can be specified at several levels of precision from dates to timestamps with nanosecond precision. The following are all valid examples of a datetime value:

  • datetime(2020-07-01)

  • datetime(2020-07-01 13:01:01)

  • datetime(2020-07-01 13:01:01.123)

  • datetime(2020-07-01 13:01:01.123456789)

Value lists for use in the [NOT] IN operator must be surrounded by parentheses and separated by commas. The values in a value list must be specified as normal, for example:

  • column(cust_id) IN (numeric(120), numeric(121), numeric(122))

The predicate DSL currently supports columns of the following Oracle Database data types:

Table 2: Predicate-Based Offload DSL Data Type Support

DSL Data Type Class

Supported Oracle Database Data Types

String

VARCHAR2, CHAR, NVARCHAR2, NCHAR

Numeric

NUMBER, FLOAT, BINARY_FLOAT, BINARY_DOUBLE

Datetime

DATE, TIMESTAMP

Offloading Joins

Offload enables joins between fully and/or partially-offloaded tables to be offloaded to the backend. This feature allows the processing of joins in a hybrid query to be pushed down from the RDBMS to the backend system where applicable. The join specification is provided using offload with the --offload-join option, based on a simple grammar as described in Offload Join Grammar.

Example 14: Offloading a Join

The following example offloads a three-table join to the backend system.

$ $OFFLOAD_HOME/bin/offload -t sh.salescountry_join -x \
      --offload-join "table(sales) alias(s)" \
      --offload-join "table(customers) alias(cu) inner-join(s) join-clauses(cust_id)" \
      --offload-join "table(countries) alias(co) inner-join(cu) join-clauses(country_id)"

In this example, the SH.SALES table is partially offloaded and the other tables are fully offloaded. The offload boundary for the SH.SALES table is used for the Offload Join to ensure that all offloaded data for this table is aligned to a common boundary.

A hybrid view named SALESCOUNTRY_JOIN will be created in the hybrid schema for this join. Hybrid views created by Offload Join operations are not intended to be queried directly, because they are designed for transparent query rewrite. This feature enables users to benefit from the performance of having the joins processed in the backend system, but without having to rewrite any SQL. However, Offload Join hybrid views can be queried directly if required (by replacing the corresponding joins in existing SQL statements with the hybrid view or writing new queries using the hybrid view), but in most cases it should not be necessary.

Offload Join Grammar

When offloading joins, each table in the join has its own --offload-join specification, as demonstrated in Example 14 above. The full grammar for this option is provided in Table 3 below.

Table 3: Offload Join Grammar

Directive and Syntax

Description

TABLE([<hybrid_schema>.]<table-name>)
  • The table name to be included in the join

  • This directive is mandatory

  • Can include the hybrid schema name if joining objects from different schemas

ALIAS(<table-alias>)
  • Alias to be used for the join

  • Defaults to <table-name> if not specified

PROJECT(<*|<column>[, ..]>)
  • List of columns selected from <table-name>. * is expanded to all columns.

  • Columns in join clauses are added to the list

  • Partition key columns used are added to the list

INNER-JOIN(<join-alias>)
  • Alias to join with. Required for all but the first table

JOIN-CLAUSES(<alias>.<column>
           = <alias>.<column>[, ..])

JOIN-CLAUSES(<column>)
  • Clause defining the relationship between <table-name> and <join-alias>

  • Expanded to syntax above assuming <column> exists in both tables

Note

Expansion of * projections excludes the OFFLOAD_BUCKET_ID partition column that Offload automatically creates when offloading to Hadoop-based backends. For performance reasons (such as supporting parallel queries), it might be desirable to include OFFLOAD_BUCKET_ID in the final Hadoop table or view for an Offload Join. This can be achieved by explicitly including it in the PROJECT() clause. For example PROJECT(*,offload_bucket_id) will include all table columns plus the OFFLOAD_BUCKET_ID column. This is not required for offloads to Google BigQuery.

Materialized Offload Joins

By default, results of an offloaded join are materialized in a backend table of the name provided by the -t option (SH.SALESCOUNTRY_JOIN in Example 14 above). Results do not have to be materialized: if using the --no-materialize-join option, a view for the join will be created in the backend instead of a populated table.

Offload Join and Partially-Offloaded Tables

Tables offloaded with Partition-Based Offload, Subpartition-Based Offload or Predicate-Based Offload are all supported by Offload Join. When an Offload Join operation includes a partially-offloaded table, the boundary predicates are included in the hybrid view and, if the join is not materialized to a table, the boundary predicates will also be included in the backend join view that is created in place of a materialized resultset.

Joins can include multiple partially-offloaded tables, but all tables must have the same incremental key type (i.e. be offloaded in the same way) and must have been offloaded to the same partition or data threshold. This means:

  • Partition-Based Offload tables must have the same partitioning type, partitioning key and offloaded high value. Join column names do not have to match as the Offload Join specification allows for different column names

  • Subpartition-Based Offload tables must have the same subpartitioning key and offloaded high value. As with Partition-Based Offload, join column names do not need to match

  • Predicate-Based Offload tables must have the same predicate type and same set of predicates. The predicates for each table must match exactly

Refreshing Offload Joins

Offload Join objects are not automatically maintained and under some scenarios will become stale when a member table is further offloaded. The guidelines are as follows:

  • If the Offload Join is materialized, the join needs to be refreshed

  • If the Offload Join is not materialized but contains one or more partially-offloaded tables, the join needs to be refreshed

  • If the Offload Join is not materialized and does not contain any partially-offloaded tables, the join does not need to be refreshed

Refreshing Materialized Offload Joins

Materialized Offload Joins always need to be refreshed when the offloaded data for a member table has changed. The method for doing this depends on which of the member tables in an Offload Join has been subject to further offloads:

  • If a fully-offloaded table is re-offloaded, the dependent materialized Offload Joins require a full refresh. To do this, the original Offload Join offload command can be repeated with the --reset-backend-table option. Note that this will fully reset and re-populate the materialized Offload Join table with the new results of the join

  • If a partially-offloaded table has a further incremental offload of one or more partitions, subpartitions or predicates, the dependent materialized Offload Joins can be incrementally refreshed. To do this, simply repeat the original Offload Join offload command and Offload will append new data to the materialized join table

Refreshing Non-Materialized Offload Joins

Non-materialized Offload Joins only need to be refreshed when the offload boundaries for the partially-offloaded member tables have changed. To refresh the join, the original Offload Join offload command can be repeated but must include the --force or -f option. This ensures that the backend view is re-created with the new boundary predicates of the partially-offloaded tables in the join.

Updating Offloaded Data

Gluent Data Platform supports the updating of offloaded data in several ways with Incremental Update. See the Incremental Update guide for details.

Offload Type Considerations

As described in the Concepts guide, there are three formats for a hybrid view (see Hybrid View Types) that determine where hybrid queries access their data:

  • 100-0: All offloaded data is accessed from the backend table and none is accessed from the source RDBMS table

  • 90-10: Partially offloaded data is accessed from the backend table and non-offloaded data is accessed from the source RDBMS table

  • 100-10: All data is offloaded but only a some of the offloaded data is accessed from the backend table and some is accessed from the source RDBMS (based on a given boundary)

In most cases, the hybrid view type is created as default for a given offload operation, but the --offload-type option can be used to provide greater flexibility over the initial and subsequent offloads of list-partitioned and range-partitioned tables. See Offload Type for a general description of offload type.

Default Offload Type Behavior

The following describes the default behavior for offload type when offloading a range-partitioned or list-partitioned table.

  • If no --offload-type is specified:

    • Use of any of the partition boundary options offloads the table as INCREMENTAL with a default 90-10 hybrid view

    • An offload without any of the partition boundary options offloads the table as FULL offload type with a default 100-0 hybrid view

  • If the --offload-type option is specified:

    • Specifying --offload-type=INCREMENTAL with or without any of the partition boundary options offloads the table to match the boundary partition and defines the table as an INCREMENTAL offload type with a supporting 90-10 hybrid view

    • Specifying --offload-type=FULL without any of the partition boundary options offloads all data from the table and defines the table as a FULL offload type with a supporting 100-0 hybrid view

    • Specifying --offload-type=FULL with any of the partition boundary options offloads all data from the table and defines the table as a FULL offload type with a specialized 100-10 hybrid view

  • If a table has a DEFAULT or MAXVALUE partition:

    • A FULL offload will offload the DEFAULT/MAXVALUE partition

    • An INCREMENTAL offload will not offload any DEFAULT/MAXVALUE partition

  • For subsequent offloads:

    • The offload type of a table is automatically picked up on all future offloads, unless overridden

    • The offload type of a table can be changed on future offloads by specifying a new --offload-type value (see Changing Offload Type below)

Note

Predicate-Based Offload does not support the 100-10 offload type pattern unless it is Late-Arriving Predicate-Based Offload.

Changing Offload Type

It is possible to change the offload type of a table on one or more subsequent offloads of the table (subject to some limitations). As described above, the offload type of a table is set on initial offload and is maintained for all future offloads (unless it is overridden by a new --offload-type value on a later offload).

The rules and behavior for changing offload type from INCREMENTAL to FULL are as follows:

  • Any partitions not yet offloaded are offloaded (including any DEFAULT/MAXVALUE if applicable), the table is re-defined as a FULL offload type and the hybrid view is re-created accordingly

The rules and behavior for changing offload type from FULL to INCREMENTAL are as follows:

  • If a partition boundary option is used, any corresponding partitions are offloaded, the table is re-defined as an INCREMENTAL offload type and the hybrid view is re-created accordingly

  • If the table has a DEFAULT or MAXVALUE partition, it cannot be converted from FULL to INCREMENTAL

  • If the table has Incremental Update enabled, it cannot be converted from FULL to INCREMENTAL

Offload Data Types

When offloading data, Offload will map data types from the source RDBMS to the backend database automatically. However, as with most operations in Offload, it is possible to override many of the default data type mappings if required. See Tables 4 and 6 below for details.

Supported Data Types

Gluent Data Platform currently supports all of the Oracle Database data types listed in Table 4 below. Oracle Database tables that contain columns of any data type not listed cannot currently be offloaded.

Default Data Type Mappings

Table 4 lists the default data type mappings when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform) or Google BigQuery.

Table 4: Offload Default Data Type Mappings (Oracle Database)

Oracle Database

Impala

Google BigQuery

Comments

CHAR

STRING

STRING

NCHAR

STRING

STRING

CLOB

STRING

STRING

NCLOB

STRING

STRING

VARCHAR2

STRING

STRING

NVARCHAR2

STRING

STRING

RAW

STRING

BYTES

BLOB

STRING

BYTES

NUMBER(<=18, 0)

BIGINT

INT64

See Offloading Integral Data Types to Impala

NUMBER(>18, 0)

DECIMAL(38,0)

NUMERIC

See Offloading to Google BigQuery Numeric

NUMBER(*, *)

DECIMAL(38,*)

NUMERIC

See Offloading to Google BigQuery Numeric

FLOAT

DECIMAL

NUMERIC

BINARY_FLOAT

FLOAT

-

See Google BigQuery Floating Point Data Types

BINARY_DOUBLE

DOUBLE

FLOAT64

DATE

TIMESTAMP

DATETIME

TIMESTAMP

TIMESTAMP

DATETIME

See Offloading to Google BigQuery Timestamp

TIMESTAMP WITH TIME ZONE

TIMESTAMP

TIMESTAMP

See Offloading to Google BigQuery Timestamp




and Offloading Time Zoned Data

INTERVAL DAY TO SECOND

STRING

STRING

See Offloading Interval Data Types

INTERVAL YEAR TO MONTH

STRING

STRING

See Offloading Interval Data Types

Data Sampling During Offload

Gluent Offload Engine fine-tunes the default data mappings by sampling data for all columns in the source RDBMS table that are either date or timestamp-based or defined as a number without a precision and scale.

By sampling numeric columns, Offload can choose an appropriate precision and scale for Impala DECIMAL data and this can lead to performance improvements due to reduced CPU consumption when querying. This behavior can be disabled by adding the --no-auto-detect-numbers option to the offload command.

The Impala TIMESTAMP data type supports a minimum date of 1400-01-01. Offload samples source date and timestamp-based columns to ensure that the data to be offloaded satisfies the Impala specification. In cases where Offload discovers data below the minimum specification, the column is offloaded to an Impala STRING column. This behavior can be disabled with the --no-auto-detect-dates option, but in this case the source data needs to be corrected before it can be offloaded.

Offload bases the volume of data to be sampled on the size of the RDBMS table. This can be overridden by adding the --data-sample-percent option to the offload command (specifying a percentage between 0 and 100, where 0 disables sampling altogether). Without sampling, all Impala decimal columns are defaulted to DECIMAL(38,18) and no date or timestamp checking is performed.

Offloading Integral Data Types to Impala

When offloading a table to Impala, integer columns(i.e. NUMBER(*, 0)) with a precision in the range of an Impala BIGINT data type are automatically stored as BIGINT. No attempt is made by Offload to downscale the Impala integral type in case the data precision grows beyond its type specification in the future. Offload safely defaults to BIGINT because schema evolution of smaller integrals to BIGINT cannot be actioned by a simple ALTER TABLE statement in Impala (meaning a more complex data migration would be required to upscale the integral type if needed). However, it is possible to override this default behavior and users confident in a column’s future contents (for example, day of month or year) can override the mapping and safely utilize smaller integral types in Impala. See Override Data Type Mappings for details.

Source integer columns with a precision beyond the range of an Impala BIGINT are automatically stored as DECIMAL(38,0).

Offloading to Google BigQuery Numeric

The Google BigQuery NUMERIC data type has a specification of (38,9). This means that a BigQuery NUMERIC can have a maximum of 29 digits to the left of the decimal point and a maximum of 9 digits to the right. For numeric data that exceeds this specification, Offload offers two options:

  1. Use decimal rounding options during offload (see Decimal Scale Rounding below)

  2. Offload numeric data to a floating point type (see Converting Numeric Data to Double below)

Important

Both of these options result in some data change and it will be for users to decide if this is a valid course of action.

Google BigQuery Floating Point Data Types

Google BigQuery provides a single 64-bit floating point data type: FLOAT64. This means that the Oracle Database 32-bit BINARY_FLOAT data type does not have a corresponding type in BigQuery and cannot currently be offloaded. Support for an override mapping to FLOAT64 will be added in a future release of Gluent Data Platform (although this will be potentially lossy - see Lossy Data Operations below).

Offloading to Google BigQuery Timestamp

At the time of writing, Google BigQuery’s DATETIME and TIMESTAMP data types both support microsecond precision, whereas Oracle Database supports up to nanosecond precision with its TIMESTAMP and TIMESTAMP WITH TIME ZONE data types. When Offload detects a source column with a higher level of precision than that supported by the backend (e.g. when attempting to offload TIMESTAMP(7) to Google BigQuery), it will terminate. In such cases, the --allow-nanosecond-timestamp-columns option will allow the offload operation to continue, but should ideally be used only when the data to be offloaded is known to be at microsecond precision or lower.

Important

Allowing sub-microsecond timestamp columns to be offloaded to Google BigQuery is potentially lossy - see Lossy Data Operations below for the implications of this).

Offloading Time Zoned Data

For Oracle Database columns of TIMESTAMP WITH TIME ZONE data type, data is normalized to Coordinated Universal Time (UTC) during offload. For this reason, values returned from Smart Connector may appear different to the eye although they are temporally the same as those in the source RDBMS data. Offload converts numeric offsets, e.g. 2016-01-14 10:39:44 +5:30, and relies on backend functionality to normalize named time zones, e.g. 2016-01-14 22:39:44 US/Pacific.

Warning

Named timezones unknown to the backend distribution in use are silently lost when normalizing data to UTC. This can leave an offloaded timestamp value which is no longer temporally the same as the source. For this reason, Offload will, by default, reject tables with timezone-sensitive columns. If a user is confident that timezone data will not be lost, the --allow-time-zone-columns option can be used to allow the offload to proceed. See Lossy Data Operations for the potential implications of lossy offloads.

Offloading Interval Data Types

Oracle Database INTERVAL DAY TO SECOND and INTERVAL YEAR TO MONTH data types are offloaded as strings in a normalized format, as shown in Table 5 below.

Table 5: Interval Data Type String Formats

Interval Type

Sample Source Data

Impala String Format

Google BigQuery String Format

Day to second (+ve)

+000000001 02:15:30.123456000

1 02:15:30.123456

+1 02:15:30.123456

Day to second (-ve)

-000000001 02:15:30.123456000

-1 02:15:30.123456

-1 02:15:30.123456

Year to month (+ve)

+000000001-01

1-1

+1-1

Year to month (-ve)

-000000001-01

-1-1

-1-1

Smart Connector converts offloaded interval strings back to the correct interval type when executing hybrid queries from the RDBMS. Interval strings can also be presented back to Oracle Database interval types by using the appropriate Present option as required (see the Present guide for further details).

Override Data Type Mappings

When initially offloading a table, it is possible to override some of the default data type mappings to change the specification of the backend table. Table 6 lists the override mappings and associated options that are available when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform) or Google BigQuery.

Table 6: Offload Override Data Type Mappings (Oracle Database)

Oracle Database

Offload Option

Impala

BigQuery

Comments

DATE

--date-columns

DATE

DATE

Impala versions below 3.3 will use TIMESTAMP

--timestamp-tz-columns

TIMESTAMP

TIMESTAMP

--variable-string-columns

STRING

STRING

TIMESTAMP

--date-columns

DATE

DATE

Impala versions below 3.3 will use TIMESTAMP

--timestamp-tz-columns

TIMESTAMP

TIMESTAMP

--variable-string-columns

STRING

STRING

NUMBER

--integer-1-columns

TINYINT

INT64

Use for NUMBER([1-2], 0)

--integer-2-columns

SMALLINT

INT64

Use for NUMBER([3-4], 0)

--integer-4-columns

INT

INT64

Use for NUMBER([5-9], 0)

--integer-8-columns

BIGINT

INT64

Use for NUMBER([10-18], 0)

--integer-38-columns

DECIMAL(38, 0)

INT64

Use for NUMBER(>18, 0)

--decimal-columns

DECIMAL(*, *)

NUMERIC

Use for NUMBER(*, *)

--double-columns

DOUBLE

FLOAT64

See Converting Numeric Data to Double

FLOAT

--double-columns

DOUBLE

FLOAT64

See Converting Numeric Data to Double

BINARY_FLOAT

--double-columns

DOUBLE

-

See Converting Numeric Data to Double

Example 15: Overriding Data Types During an Offload

In the following example, the SH.SALES table is offloaded with several data type overrides.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
      --date-columns=TIME_ID \
      --integer-4-columns=PROD_ID,CHANNEL_ID \
      --decimal-columns-type=10,2 --decimal-columns=AMOUNT_SOLD

Each override can be given one or more columns. The --decimal-columns override can be used to fine-tune the precision and scale of the Impala DECIMAL data type, which can lead to better space utilization and CPU reduction. This override option behaves differently to the others in that it must be paired with a --decimal-columns-type option to define the precision and scale for the columns listed. Multiple different names/type pairings can be provided in the same command (see --decimal-columns for further details).

Oracle Database number precision is specified in terms of the number of digits. Impala number precision is specified by the data type. Care should be taken if selecting an override target data type. The DECIMAL data type can be used as a catch-all numeric data type in Impala but users are encouraged to consider native integral data types for reasons of performance and space utilization.

Managing Exceptional Data

No two systems align perfectly with their data type specifications and implementations. There are several cases where data in the source Oracle Database table might not be easily offloaded to the backend system. Some have already been described, but include:

  • Source RDBMS numbers with a precision greater than that supported by the backend (for example, BigQuery’s NUMERIC limit of 29 digits to the left of the decimal point)

  • Source RDBMS numbers with a scale greater than that supported by the backend (for example, Impala’s DECIMAL scale limit of 38 digits or BigQuery’s NUMERIC scale limit of 9)

  • Source RDBMS dates or timestamps with a value below Impala’s lower bound of 1400-01-01

  • Source RDBMS timestamps with a higher level of precision than BigQuery’s microsecond limit

  • Source RDBMS 32-bit floating point numbers offloading to a backend system without a 32-bit floating point data type (e.g. BigQuery)

When exceptional data is detected, Offload will in most cases terminate with an exception (and in some cases provide advice on how to proceed). The only exception to this is with offloading outlying dates and timestamps to Impala, which will be automatically offloaded to a lossless string data type (unless this behavior is disabled by the user as described earlier).

Note

Offload will never attempt to automatically offload exceptional data when it would cause a potential data loss to do so. In all cases, the user must provide an explicit course of action by providing the appropriate options with the offload command.

Decimal Scale Rounding

The Oracle Database NUMBER data type is extremely flexible and caters for high-scale decimal data that exceeds the limits available in most target backend systems. To offload decimal data that exceeds the specification of the backend system, the --allow-decimal-scale-rounding option must be used in conjunction with the --decimal-columns-type and --decimal-columns options to define the target type for the data. By using the rounding option, the user acknowledges that some data loss due to decimal rounding is likely and acceptable.

Converting Numeric Data to Double

Using the --double-columns override option to offload high-precision/scale numeric data or 32-bit floating point data to a 64-bit double can lead to data loss and should only be considered when there is no alternative and the consequences are both understood and acceptable.

Lossy Data Operations

To summarize, the potentially-lossy offload operations and their corresponding options are as follows:

  • Rounding decimals: Using the --allow-decimal-scale-rounding option to round decimals that exceed the backend specification for decimal data

  • Offloading sub-microsecond timestamps: Using the --allow-nanosecond-timestamp-columns option to offload columns with a sub-microsecond specification to a backend system that doesn’t support the same level of precision

  • Offloading time zoned data: Using the --allow-time-zone-columns option to enable time zoned data to be offloaded

  • Converting data to a 64-bit floating point number: Using the --double-columns option to offload decimal or 32-bit floating-point data that would not otherwise be possible

Important

Enabling lossy offloads can cause wrong results with some hybrid queries when data loss has occurred; such as when Smart Connector pushes down an equality or inequality predicate containing values from the lossy column(s). It is not advised to enable lossy data type overrides for columns that are likely to be used in hybrid query predicates or joins unless the offload was known to be lossless.

Offload Transport

Offload Transport describes the part of the offload process that:

  • Copies the source RDBMS data to a staging location

  • Validates the staged data

  • Loads the staged data into the backend target table

The two-phased approach of staging and loading data provides two main benefits:

  1. Atomicity: The load from staging table to target table either fully succeeds or completely fails

  2. Partitioning: For backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table

The transport phase of an offload is split into four main operations:

Transport Data to Staging

Gluent Offload Engine uses one of a number of tools to extract data from the source RDBMS. The appropriate tool will be chosen automatically based on configuration preferences. Each tool reads the source data to offload and attempts to split the source data equally and in a non-overlapping way between concurrent reader processes.

Data is staged in Avro format, native data types are used where possible and Avro STRING is used where there is no identical match between Avro and the source RDBMS table. For example, Oracle Database DATE, TIMESTAMP and NUMBER data types have no direct equivalent in Avro; therefore data is staged as STRING and converted to the appropriate backend data type during the final loading phase.

The data extraction tools available to Offload are:

Apache Sqoop

Apache Sqoop is only considered for use when offloading to Hadoop-based backends. Offload will invoke Sqoop on the host defined by OFFLOAD_TRANSPORT_CMD_HOST. Depending on the source RDBMS table, Sqoop will be used in either its standard table-centric mode or using a free-form query defined by Offload. The number of Sqoop mappers used when extracting data is defined by OFFLOAD_TRANSPORT_PARALLELISM or per offload with the --offload-transport-parallelism option. The SQOOP_QUEUE_NAME parameter can be defined to submit the Sqoop job a specific YARN queue.

Default configuration is appropriate for the majority of offloads but occasionally the nature of the source table requires tuning of the following:

  • --offload-transport-fetch-size can be used on a per-offload basis to override OFFLOAD_TRANSPORT_FETCH_SIZE. This can be useful to manage memory requirements for tables with a large row size (such as when offloading LOBs), by reducing the fetch size

  • SQOOP_OVERRIDES or the per-offload option --offload-transport-jvm-overrides can be used to inject JVM parameters into the Sqoop command line. This can be useful to override JVM memory defaults, for example, --offload-transport-jvm-overrides="-Dmapreduce.map.java.opts=-Xmx2048m -Dmapreduce.map.memory.mb=2048"

Apache Spark

Three interfaces to Spark are supported by Offload: Spark Submit, Spark Thrift Server and Apache Livy. The interface used by Offload is chosen automatically, based on configuration. If multiple interfaces are configured for use, the order of priority is Spark Thrift Server, Apache Livy and Spark Submit. SparkSQL is used in all cases (via a PySpark script for Apache Livy and Spark Submit, or as pure SQL for Spark Thrift Server) to extract the data to be offloaded from the source RDBMS table.

When offloading to Hadoop-based backends it is expected that the Spark cluster and interfaces are configured components of the target Hadoop cluster. When offloading to cloud warehouses such as Google BigQuery it is typical to use Spark Standalone (Gluent Data Platform includes a Transport package containing Spark Standalone components for this purpose), although an existing Spark cluster can be utilized if available.

Spark Submit is available for use by Offload if OFFLOAD_TRANSPORT_CMD_HOST and OFFLOAD_TRANSPORT_SPARK_SUBMIT_EXECUTABLE are defined. When using a Spark Standalone cluster jobs will be submitted to the cluster defined in OFFLOAD_TRANSPORT_SPARK_SUBMIT_MASTER_URL. In a Hadoop environment, YARN is used as the Spark master. OFFLOAD_TRANSPORT_SPARK_QUEUE_NAME or --offload-transport-queue-name can be used to submit the Spark job a specific YARN queue.

Spark Thrift Server is available for use by Offload if OFFLOAD_TRANSPORT_SPARK_THRIFT_HOST and OFFLOAD_TRANSPORT_SPARK_THRIFT_PORT are defined. The Spark Thrift server can be configured to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible.

Apache Livy is available for use by Offload when offloading to Hadoop-based backends if OFFLOAD_TRANSPORT_LIVY_API_URL is defined. Livy can be configured (using OFFLOAD_TRANSPORT_LIVY_IDLE_SESSION_TIMEOUT) to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible. The maximum number of Livy sessions that Offload will start and use concurrently is defined by OFFLOAD_TRANSPORT_LIVY_MAX_SESSIONS.

The number of tasks in an offload transport Spark job is defined by OFFLOAD_TRANSPORT_PARALLELISM or per offload with the --offload-transport-parallelism option. It should be noted that defining more tasks than there are available Spark executors will result in queuing. Therefore, OFFLOAD_TRANSPORT_PARALLELISM should ideally be no more than the number of available executors.

Default configuration is appropriate for the majority of offloads but occasionally the nature of the source RDBMS table requires tuning of the following:

Query Import

Query Import is used for for low-volume offloads. Data is extracted and staged by Offload itself and not using an external tool. This avoids delays incurred when invoking Sqoop or Spark Submit. Non-partitioned tables with a source RDBMS size smaller than --offload-transport-small-table-threshold are eligible for Query Import. Query Import does not run in parallel, i.e. OFFLOAD_TRANSPORT_PARALLELISM is ignored.

RDBMS Options

Connections to the source RDBMS are made to the address defined in OFFLOAD_TRANSPORT_DSN. This defaults to ORA_CONN but can be overridden if data needs to be extracted from an alternative address (such as from a replica database).

Concurrent offload transport processes open independent sessions in the source RDBMS. If a high value for OFFLOAD_TRANSPORT_PARALLELISM is required then consideration should be given to any session limits in the RDBMS. The OFFLOAD_TRANSPORT_CONSISTENT_READ parameter or the per offload option --offload-transport-consistent-read can be used to ensure that all concurrent extraction queries reference a specific point in time. When set to true, extraction queries will include an AS OF SCN clause. When the source RDBMS table or input partitions are known to be cold (i.e. not subject to any data modifications), this can be set to false to reduce resource consumption.

If there is a requirement to modify offload transport RDBMS session settings, such as setting an Oracle Database initialization parameter (under the guidance of Gluent Support), then modification of OFFLOAD_TRANSPORT_RDBMS_SESSION_PARAMETERS will be required.

Validate Staged Data

Once data has been staged, it is validated to ensure that the number of staged rows matches the number of rows read from the source RDBMS. While doing this Offload might also:

  • Check for NULL values in any custom partition scheme defined using the --partition-columns option. A positive match results in a warning

  • Check for any NaN (Not a Number) values if the source table has floating point numeric data types. This is because source RDBMS and target backend systems do not necessarily treat NaN values consistently. A positive match results in a warning

  • Check for lossy decimal rounding as described in Decimal Scale Rounding

  • Check for source data that is invalid for the chosen backend partition scheme, such as numeric data outside of any --partition-lower-value/--partition-upper-value range. A positive match results in an exception

Validate Type Conversions

Data types used for staging data will rarely match those of the backend target table. Data is converted to the correct type when it is loaded into the final target table. This stage therefore verifies that there will be no invalid conversions when loading. While this is a duplication of type conversions in the Load Staged Data phase, it provides the advantage of checking the data before the more compute-intensive data load and is able to report all columns with data issues in a single pass.

Example 16: Catching Invalid Data Type Conversions

In the following example, the SH.SALES table is offloaded to a Hadoop cluster with an invalid data type for two columns: the data in the PROD_ID and CUST_ID columns is not compatible with the user requested single-byte integer data type.

$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --integer-1-columns=PROD_ID,CUST_ID

This results in the following exception:

CAST() of load data will cause data loss due to lack of precision in target data type in 6887232 rows
Failing casts are:
    (`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
    (`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
The SQL below will assist identification of problem data:
SELECT PROD_ID
,      CUST_ID
FROM   `sh_load`.`sales`
WHERE  (`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
OR     (`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
LIMIT 50

The exception provides three important pieces of information:

  • The number of rows with issues

  • The columns/conversions with issues

  • A SQL statement to use offline to review a sample of the problematic data

Load Staged Data

In this phase of an offload staged data is converted to correct data types as described in Validate Type Conversions and inserted into the target backend table. Where available, performance metrics from the backend system are recorded in the Offload log file written to $OFFLOAD_HOME/log.

Offload Transport Chunks

The source RDBMS data is offloaded in “chunks”. A chunk can comprise an entire table or a set of one or more partitions. The sequence of four operations described above are executed for each chunk until the whole input set has been offloaded. A final task in this phase, for backend systems that support it, is to update statistics on the freshly offloaded data.

For partitioned RDBMS tables, data is offloaded in sets of partitions. Input partitions are grouped into chunks, based on either the number of partitions or cumulative partition size (including subpartitions where relevant). For non-partitioned tables the table itself is considered a single chunk.

The size limit for a partition chunk is defined by MAX_OFFLOAD_CHUNK_SIZE or per offload with the --max-offload-chunk-size option. The maximum number of partitions in a chunk is defined by MAX_OFFLOAD_CHUNK_COUNT or per offload with the --max-offload-chunk-count option. Both size and count thresholds are active at the same time, therefore the first threshold to be breached is the one that closes the chunk. If the first partition in a chunk breaches the size threshold then it will be included for offload and that chunk will be greater in size than the configured threshold.

Sets of partitions are an effective way of managing resource consumption. For example, if a source RDBMS typically has partitions of 8GB in size and the backend system can comfortably load 64GB of data at a time without impacting other users, MAX_OFFLOAD_CHUNK_SIZE can be set to 64G and roughly 8 partitions will be offloaded per chunk.

When calculating the cumulative size of partitions Offload uses RDBMS segment sizes. This is important to note because compressed data in the RDBMS may increase in size when extracted and staged.

Partitions are considered for chunking in the logical order they are defined in the source RDBMS. For example, partitions from an Oracle Database range-partitioned table are split into chunks while maintaining the order of oldest to most recent partition. Offload does not attempt to optimize chunking by shuffling the order of partitions because that would offload partitions out of sequence and be a risk to atomicity.

Data Storage Options

Gluent Data Platform supports several storage schemes for offloaded data, listed in Table 7 below.

Table 7: Offload Storage Options

Storage Scheme

Offload Staging Data

Offloaded Data

OFFLOAD_FS_SCHEME

HDFS

Hadoop only

Hadoop only

hdfs

Amazon Simple Storage Service (S3)

No

Hadoop only

s3a

Microsoft Azure Data Lake Storage Generation 1 (ADLS)

No

Hadoop only

adl

Microsoft Azure Data Lake Storage Generation 2 (ABFS)

No

Hadoop only

abfs, abfss

Google Cloud Storage (GCS)

BigQuery only

No

gcs

Storage preferences such as base HDFS file locations and/or cloud storage buckets can be specified as Offload command-line options, but it is more common to configure them once in the Gluent Data Platform environment file.

Gluent Offload Engine can offload data to cloud storage and requires a small amount of configuration, as follows:

When offloading to Hadoop-based backends:

  • A value of inherit can be used for OFFLOAD_FS_SCHEME. With this value all tables created by Offload will inherit the location from the parent database. If databases are created using the --create-backend-db option, a default location of HDFS will be used

  • Specifying a storage scheme in cloud storage will persist the offloaded data to that location, but data must first be staged in HDFS as part of the two-phase loading process

  • Offloaded data is compressed. The default compression can be overridden with the --storage-compression option

  • The default file format for all storage schemes is Parquet

When offloading to Google BigQuery:

  • The two-phase loading process stages data in Google Cloud Storage before persisting it to Google BigQuery

  • Google BigQuery has its own proprietary storage format

Regardless of the storage provider:

  • The default file format for temporary staging data is Avro

  • Staged data can be compressed with the --compress-load-table option

Managing Data Distribution

In addition to the variety of cloud storage options supported by Gluent Data Platform, Offload also provides options to manage the offloaded data distribution:

Managing Backend Partitioning

When offloading data, backend tables can be partitioned in several ways, depending on the backend platform and user preferences:

Surrogate Partitioning for Bucketing

When offloading to Hadoop-based backends, Offload always includes a surrogate partition key named OFFLOAD_BUCKET_ID, in addition to any other partitioning scheme that might be defined for the table. This surrogate partition key is used internally by Gluent Data Platform to enable Smart Connector to scan backend data in parallel when executing a parallel hybrid query. The number of partition key values defaults to the DEFAULT_BUCKETS configuration but can be overridden with the --num-buckets option.

A hash function is used by Offload to distribute data across the OFFLOAD_BUCKET_ID partitions. Offload will try to pick the most selective column to distribute data evenly across the surrogate partitions, but in the absence of a primary key or column statistics for the source RDBMS table, it will default to the first table column. Users can override the hashing column by including the --bucket-hash-column option with the initial offload command.

The surrogate partition column is not required when offloading to Google BigQuery.

Inherited Partitioning

Tables offloaded with Partition-Based Offload or Subpartition-Based Offload are automatically partitioned in the backend with the (sub)partition key of the source table, unless overridden by the user. The granularity of the inherited (sub)partitions can be different to the source RDBMS (sub)partitions if required. In some cases it is mandatory to specify the granularity of the backend partition scheme (see Partition Granularity below for details)

For Hadoop-based backends, the inherited partition key can comprise multiple columns and is used in addition to the surrogate OFFLOAD_BUCKET_ID partition key described above. Backend partition key columns are always implemented by the addition of synthetic partition key columns. See Synthetic Partitioning below for details.

For Google BigQuery, only the leading (sub)partition key column will be used. In some cases, the backend partitioning might be implemented by a synthetic partition key column, but not always. See Synthetic Partitioning below for details.

User-Defined Partitioning

Tables can be offloaded with the --partition-columns option to define a custom partitioning scheme for the backend table. This enables non-partitioned RDBMS tables to be partitioned in the backend if required, in addition to allowing (sub)partitioned RDBMS tables to be offloaded with a different partitioning scheme in the backend. It is also possible (and in some cases mandatory) to specify the granularity of the backend partitions (see Partition Granularity for details below) and user-defined partitioning supports more RDBMS data types than inherited partitioning.

For Hadoop-based backends, the user-defined partition key can comprise multiple columns and is added to the surrogate OFFLOAD_BUCKET_ID partition key described above. The following example offloads a partitioned table to a Hadoop backend and subpartitions the offloaded data by year (derived from a date) and product ID (in steps of 1000).

Example 17: Offload with User-Defined Partitioning
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-columns=TIME_ID,PROD_ID --partition-granularity=Y,1000

When offloading to a Hadoop-based backend, user-defined partitioning can include date/timestamp, numeric or string columns.

When offloading to Google BigQuery, string-based or decimal (i.e. non-integral numeric) columns cannot be used for user-defined partition keys and only one partition key column can be defined.

Synthetic Partitioning

Depending on the data type and backend system, Offload usually implements inherited and user-defined partitioning schemes with additional synthetic columns (one per source (sub)partition column). Synthetic columns are used to partition backend tables instead of the corresponding natural columns. Synthetic partition keys are used internally by Gluent Query Engine to generate partition-pruning predicates in hybrid queries and by Offload to ensure that the backend partition columns remain consistent across multiple Partition-Based Offload or Subpartition-Based Offload operations.

Synthetic partition key columns are named as a derivative of the granularity of the data and the name of the corresponding RDBMS column. The naming convention for each synthetic column is GL_PART_<granularity>_<column name>. For example, a synthetic partition key column for an RDBMS table offloaded with a TIME_ID partition key and a monthly granularity will be named GL_PART_M_TIME_ID.

When a table is offloaded with partitioning to a Hadoop-based backend, synthetic partition column(s) are always generated by Offload. The resulting synthetic column(s) are always of type STRING.

When a table is offloaded with partitioning to Google BigQuery, a synthetic partition key will be generated when:

  • The natural partition column is of a DATE or TIMESTAMP BigQuery data type and the partition granularity is monthly or yearly. The synthetic partition key column will be of the same data type as its corresponding natural partition column

  • The natural partition column is of a BigQuery DATETIME data type

In all other cases (i.e. the BigQuery column is an INT64 or it is a DATE or TIMESTAMP with a daily granularity), a synthetic column will not be generated and the existing table column will be used with native BigQuery partitioning.

Offload populates synthetic partition key columns with generated data when offloading based on the type and granularity of the data. Gluent Query Engine will generate predicates for the synthetic partition columns when pushing down predicates for the corresponding natural columns.

Partition Granularity

Partition granularity defines the range or contents of a single backend partition. In some cases, granularity must be defined (using the --partition-granularity option) when offloading an RDBMS table for the first time, but in many cases, Offload will choose a default granularity based on several factors (including the type of RDBMS partitioning, the RDBMS (sub)partition key type and the backend platform).

Partition Granularity Behavior With Hadoop Backends

When offloading with partitioning to a Hadoop-based backend, the following granularity behavior applies.

  • If the RDBMS table is list-partitioned and the partition scheme is inherited:

    • Date/timestamp partition data defaults to a daily granularity

    • Numeric partition data defaults to a granularity of 1

    • String partition data defaults to the length of the RDBMS string partition column

  • In all other scenarios:

Partition Granularity Behavior With Google BigQuery

When offloading with partitioning to Google BigQuery, the following granularity behavior applies.

  • Date/datetime/timestamp partition data defaults to daily granularity

  • If the RDBMS table is list-partitioned and the partition scheme is inherited, numeric partitioning defaults to a granularity of 1. In all other scenarios, numeric partitioning requires the --partition-granularity option. In all scenarios, the --partition-lower-value and --partition-upper-value options must be specified

  • String partitioning is not supported in Google BigQuery

Backend Data Sorting/Clustering

In addition to partitioning, offloaded data can optionally be distributed according to sorting or clustering options provided by the backend. This can lead to more efficient data access (such as range scans) for hybrid queries and other users of the offloaded data. The following data distribution options are provided by the backends:

  • Impala: Data can be stored sorted by one or more columns

  • Google BigQuery: Tables can be clustered by up to four columns

To enable data distribution globally, set the OFFLOAD_SORT_ENABLED configuration parameter to true. Alternatively, this can be managed for individual offloads by adding the --offload-sort-enabled option to the offload command (set to true), along with the --sort-columns option to specify the columns for distributing the data in the backend.

Managing Hybrid Query Performance

In addition to creating the database objects and metadata required to enable Gluent Query Engine to execute hybrid queries, Offload also provides a solid foundation for good hybrid query performance. It does this through a combination of standard operations and configurable options, designed to prepare the hybrid environment for Smart Connector optimizations. Several of these are described below.

Managing Parallelism

By default, Offload prepares the hybrid environment for parallel query access in several ways, unless disabled by configuration:

Offload Buckets

As described earlier, in Hadoop-based environments, Gluent Offload Engine automatically partitions the backend table by a surrogate OFFLOAD_BUCKET_ID column to enable parallel access by multiple Smart Connector sessions. This can be configured globally as a default number of buckets (see DEFAULT_BUCKETS) and/or specified per table with the --num-buckets option. This should be set to the highest degree of parallelism required for tables that need to be accessed by parallel queries. It can be capped globally to avoid creating too many bucket partitions in the backend table. It can also be capped to limit the degree of parallel access (this applies to Impala only).

Small tables below a threshold (configured with the DEFAULT_BUCKETS_THRESHOLD global parameter) are automatically offloaded with a single bucket.

Google BigQuery does not require this Gluent Data Platform data access optimization. Smart Connector utilizes the BigQuery Storage API for natively-managed parallel data retrieval.

Hybrid External Table Location Files

Hybrid external tables can be created with multiple location files. When doing so, parallel hybrid queries spawn multiple Smart Connector processes for parallel data retrieval. The number of location files for each hybrid external table is determined by either the NUM_LOCATION_FILES global default parameter or per table with the --num-location-files option.

Note that when offloading to Impala, the number of external table location files is always aligned with the number of offload buckets.

To disable parallel data retrieval for an external table, the --num-location-files option should be added to the offload command and should be set to 1.

Hybrid External Table Degree

By default, Offload copies the degree of parallelism (DOP) from the source RDBMS table to the hybrid external table when offloading. This ensures that hybrid queries are executed with the same default DOP as the original application queries. This can be configured globally with the HYBRID_EXT_TABLE_DEGREE parameter or set per table with the --ext-table-degree option. In most cases, the default suffices.

This does not apply to hybrid external tables that are created for Advanced Aggregation Pushdown; these are always created with a degree of 1.

Managing Statistics

Database statistics enable query optimizers to determine the most efficient way to execute queries. To ensure that hybrid queries benefit from efficient execution plans, Offload manages statistics for both the hybrid external table and the backend target table (where applicable).

Hybrid Object Statistics

When offloading an RDBMS table for the first time, Offload copies the table and column statistics from the source table to the base hybrid external table. No prorating or modification of the statistics occurs, even if the source table is partially-offloaded. In cases where partially-offloaded tables continue to be offloaded (for example, a Partition-Based Offload of the latest few partitions), Offload will not overwrite the external table statistics unless the source RDBMS statistics indicate that the source table has grown. This prevents the statistics from being downgraded if the source RDBMS table shrinks due to offloaded partitions being dropped from the RDBMS.

When offloading to Hadoop-based backends, hybrid external tables that support general and custom Advanced Aggregation Pushdown have their statistics derived from the backend table statistics. When offloading to Google BigQuery, these external tables currently have no statistics.

Regardless of backend, hybrid external tables that support count-only Advanced Aggregation Pushdown have their statistics derived by Offload.

Backend Table Statistics

When offloading to Hadoop-based backends, Offload will compute statistics on the backend target table every time it is loaded.

For Impala, default behavior for partitioned tables is to compute incremental statistics. For very large tables, or tables with a large number of partitions, this can sometimes lead to pressure for Impala resources. In such cases (and to save time on statistics operations in general), Offload provides a mechanism to copy statistics from the source RDBMS table to the backend target table. This is a metadata-only operation and as such is quick to process. The backend statistics are derived from the source RDBMS values. To specify that statistics should be copied rather than computed, the --offload-stats option should be added with a value of COPY to the offload command.

Google BigQuery tables do not have statistics, so this step is skipped when offloading to the BigQuery platform.

Data Sorting/Clustering

See Backend Data Sorting/Clustering.

Resetting an Offloaded Table

Offload provides a --reset-backend-table option that can be added to an offload command to fully reset a previously offloaded table. This option is useful when refreshing small offloaded tables such as reference tables or dimensions (these are commonly offloaded to support Offload Join operations). This option will cause Offload to drop all offloaded data from the backend, so it should be used carefully.

Important

The --reset-backend-table option will drop the existing backend table and supporting objects. This option must only be used when it is certain that resetting the backend table will not cause data loss. Care must be taken to ensure that this option is never used if some of the previously-offloaded data cannot be replaced (for example, partitions that are dropped from the source RDBMS after offloading to the backend).

Offload Security

Gluent Data Platform interfaces with the security features of the RDBMS and backend data platforms, providing support for authentication, authorization and encryption features.

See the Security guide for details of supported security features and the Installation & Upgrade guide for implementation details.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com