Offload¶
Table of Contents
Documentation Conventions¶
Commands and keywords are in this
font
.$OFFLOAD_HOME
is set when the environment file (offload.env
) is sourced, unless already set, and refers to the directory namedoffload
that is created when the software is unpacked. This is also referred to as<OFFLOAD_HOME>
in sections of this guide where the environment file has not been created/sourced.Third party vendor product names might be aliased or shortened for simplicity. See Third Party Vendor Products for cross-references to full product names and trademarks.
Introduction¶
Offload is a core component of Gluent Offload Engine that provides a simple and flexible way to copy data from an RDBMS (such as Oracle Database) to a variety of backend data platforms (such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery). This guide describes how to use Offload in a number of common and advanced use-cases. It is assumed that readers are familiar with Gluent Data Platform Concepts prior to reading this Offload guide, particularly the Gluent Offload Engine and Offload concepts.
Overview of the Offload Process¶
To understand the Offload software and its use-cases, options and operation, it is useful to start with a high-level understanding of an offload process in general. Offloading data from an RDBMS to a backend requires a number of sequential operations to be orchestrated, involving a variety of components, interfaces and parts of the source and target systems. When running an offload, the various steps executed by Offload are visible both on the terminal and in a log file, but a broad description of what these steps are doing is provided below.
An offload process has four broad logical phases:
Phase 1: Setup¶
In the setup phase, Offload will gather a range of information about the data to be offloaded, such as the structure of the data, its data types, metadata from previous offloads and so on. With this information, Offload is able to prepare the target backend system ready for the Offload Transport phase. This includes the creation of target databases (if required), the creation of a staging location and table for the offloaded data and for first-time offloads, a final table in the target backend database.
Phase 2: Offload Transport¶
In the transport phase, Offload copies the source RDBMS data to the backend staging location. Once the staged data is validated, it is loaded into the backend target table. This two-phased approach provides two main benefits:
Atomicity: the load from staging table to target table either fully succeeds or completely fails
Partitioning: for backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table
Phase 3: Hybrid Objects¶
In the hybrid objects phase, Gluent Offload Engine creates a number of database objects in the RDBMS hybrid schema and saves metadata to Gluent Metadata Repository to support both the operation of Gluent Data Platform and provide an interface for users to transparently query offloaded data. Statistics, synonyms and privileges are also created as required. See Offload Objects below for an example of the hybrid objects that are created by a standard Offload.
Where applicable, Offload will also create hybrid copies of any views that are dependent on the offloaded RDBMS table. If any such dependent hybrid views already exist, Offload will also re-generate these with any changes that are required as a result of the latest offload operation.
Phase 4: Verification¶
The verification phase is the final operation in an offload process. In this phase, Offload queries both the hybrid view and the source RDBMS table to verify that the offloaded data set matches in both databases. This has a secondary benefit of validating the hybrid objects. Following successful verification of the offloaded data, the offload is complete. Any notices or warnings identified during the various phases are logged and Offload exits with success (i.e. an exit code of 0). See the --verify
option for available validation methods.
Controlling the Offload Process¶
The offload process is highly configurable. Many steps in the process can be modified by options or configuration parameters and a number of steps are also optional and can be disabled per offload. In addition to the remainder of this Offload guide (which will describe many of the options), see the Reference guide for a full list of options and configuration parameters and their descriptions.
Offload Objects¶
The following table lists the standard set of RDBMS and backend database objects that are created by an offload of SCHEMA.TABLE.
Table 1: Objects Created by Offload¶
Database |
Object Type |
Object Naming Format |
Hybrid Object Type |
---|---|---|---|
Backend |
Table |
SCHEMA_LOAD.TABLE |
Temporary staging table |
Backend |
Table |
SCHEMA.TABLE |
Offloaded table |
RDBMS |
View |
SCHEMA_H.TABLE |
Hybrid view |
RDBMS |
Table |
SCHEMA_H.TABLE_EXT |
Hybrid external table |
RDBMS |
View |
SCHEMA_H.TABLE_AGG |
Aggregate hybrid view |
RDBMS |
Table |
SCHEMA_H.TABLE_AGG_EXT |
Aggregate hybrid external table |
RDBMS |
Rewrite Equivalence |
SCHEMA_H.TABLE_AGG |
Aggregate rewrite rule |
RDBMS |
View |
SCHEMA_H.TABLE_CNT_AGG |
Aggregate hybrid view |
RDBMS |
Table |
SCHEMA_H.TABLE_CNT_AGG_EXT |
Aggregate hybrid external table |
RDBMS |
Rewrite Equivalence |
SCHEMA_H.TABLE_CNT_AGG |
Aggregate rewrite rule |
Google BigQuery identifier names are case-sensitive. Gluent Offload Engine allows a default case to be set for all Google BigQuery identifiers created by Offload to ensure consistency. See BACKEND_IDENTIFIER_CASE
for more details.
Offload adds a variety of suffixes to the offloaded table name when naming the hybrid objects. If the length of a hybrid object name exceeds the RDBMS limit, Offload will truncate the hybrid object name to below the limit and pad the remainder with a unique string to avoid potential clashes with other truncated identifiers.
See also
Glossary for definitions of the hybrid object types in addition to the Concepts and Performance guides for further information on their purpose and usage.
Offloading Data¶
Offload supports several scenarios for offloading data from the RDBMS:
Offloading Entire Tables¶
Heap, partitioned and index-organized tables can be fully-offloaded from the RDBMS as follows.
Example 1: Offload a Full Table¶
$ . $OFFLOAD_HOME/conf/offload.env
$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x
Note
This example includes the sourcing of the Gluent Data Platform environment file for an offload session. All other examples in this guide will assume this step has occurred.
A log file containing all of the steps in the offload process will be generated in $OFFLOAD_HOME/log
. Using the -v
or --vv
option will result in more screen output during the offload.
Backend tables can be optionally partitioned, even if the source RDBMS table is not partitioned. See Managing Backend Partitioning below for more details.
It is possible to rename the target backend table when offloading with the --target-name
option. This can be useful for a number of reasons, including when:
The source RDBMS table name includes a character that is not supported by the backend (for example, Oracle Database allows
$
to be used for identifiers but this is not supported by Impala)The naming standards used for applications in the backend system are different to the source RDBMS application (either for database names or object names)
Example 2: Change the Offload Target Name¶
The following example offloads a RDBMS table named SH.SALES$ to an Impala backend, meaning that the target table needs to be renamed to remove the $
.
$ $OFFLOAD_HOME/bin/offload -t 'SH.SALES$' -x --target-name=SH.SALES
See also
This guide describes a number of Offload options, but a full list is provided in the Offload Parameters section of the Reference manual.
Partition-Based Offload¶
Partition-Based Offload enables some or all partitions of a partitioned table to be offloaded across one or more offload sessions. A common use of this feature is to initially offload the historical partitions for a table and then periodically offload new partitions as they arrive in the RDBMS, appending the data to the backend version of the table.
Partition-Based Offload can be used for the following scenarios:
Offloading Range-Partitioned Tables¶
Tables that are range-partitioned on DATE
, TIMESTAMP
or NUMBER
columns can be fully or partially offloaded. Partitions can be offloaded contiguously up to a boundary by specifying a high water mark in the offload
command. Partition boundaries can be increased in subsequent offloads to append additional partitions without affecting historical partitions.
Partition Boundary Options¶
To offload a contiguous range of partitions from a range-partitioned table, one of the following boundary options must be used:
--older-than-date
: OffloadDATE
orTIMESTAMP
partitions with a high water mark less than this value. Synonym for--less-than-value
, e.g.
--older-than-date=2015-10-01
--older-than-days
: Offload partitions older than this number of days (exclusive, i.e. the boundary partition is not offloaded). This option is suitable for keeping data up to a certain age in the source table and is an alternative to the--older-than-date
option. If both are supplied,--older-than-date
will be used, e.g.
--older-than-days=90
--less-than-value
: Offload partitions with a high water mark less than this value. Will accept integer or date values and allows intraday date and timestamp values to be specified, e.g.
--less-than-value=100000000
--less-than-value=2015-01-01
--less-than-value="2015-01-01 12:00:00"
--partition-names
: Offload partitions with a high water mark matching that of the named partition, e.g.
--partition-names=P201509
Boundary options for range-partitioned tables are exclusive and the partition that contains data for the specified high water mark will not be offloaded. Furthermore, Offload will not offload partial partitions. For example, suppose a partition has data with a value range of 2000-01-01 to 2000-01-31. Using --older-than-date=2000-01-15
will not offload data from 2000-01-01 to 2000-01-14 because the partition has the potential for data up to and including 2000-01-31. Instead, data will be offloaded up to the boundary of the previous range partition. To offload the example partition, an --older-than-date=2000-02-01
value must be used.
The following is an example of offloading all partitions below a given date boundary.
Example 3: Offload a Range of Date Partitions¶
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-10-01
By default, Example 3 above will use the RDBMS partition key column as the backend partition column (they can be different if required; see User-Defined Partitioning for details). It is common to use a different granularity for the backend partitions, especially useful when the backend system better supports either larger or fewer partitions. For example, an RDBMS table partitioned by day might be better offloaded to monthly partitions in a Hadoop-based backend. See --partition-granularity
for details of the default granularities for various scenarios and backends.
For range-partitioned tables with numeric partition keys, the partition granularity of the backend table must be specified. The command syntax differs slightly according to the target backend, as the following examples demonstrate.
Example 4: Offload a Range of Numeric Partitions (Hadoop)¶
This example offloads all partitions with a partition key value of less than 5,000,000,000 and creates backend partitions with a range of 10,000,000. This applies when offloading to Hadoop-based platforms.
$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
--less-than-value=5000000000 \
--partition-granularity=10000000
Note
It is also possible to specify a limit to the size of numeric partition keys (see --partition-digits
).
Example 5: Offload a Range of Numeric Partitions (BigQuery)¶
For Google BigQuery, the full range of potential partition key values must be specified when creating a table with numeric partitions, hence the additional Offload options in the following example.
$ $OFFLOAD_HOME/bin/offload -t SH.NUMERIC_PARTITIONED_FACT -x \
--less-than-value=5000000000 \
--partition-granularity=10000000 \
--partition-lower-value=1 \
--partition-upper-value=99999999999999999
See also
Fully Offloading Range-Partitioned Tables with Offload Type¶
Range-partitioned tables are also supported for full offload, using either the --offload-type
option with a value of FULL
or by excluding any of the partition boundary options described above. The following option combinations will fully offload a range-partitioned table:
--offload-type=FULL
: Offloads all partitions and creates a simple hybrid view that queries all the data from the backend--offload-type=FULL
plus one of the boundary options: Offloads all partitions and data to the backend but creates a composite hybrid view that queries the data below the specified boundary from the backend and data above the specified boundary from the RDBMSNone of the offload type or partition boundary options described above: This is the same as specifying
--offload-type=FULL
See also
See Offload Type Considerations for more details on support for partial and full offloads of partitioned tables.
Moving the Range High Water Mark in a Hybrid View¶
The partition boundary in a hybrid view can be modified by running a subsequent offload with a boundary option matching a partition value that has already been offloaded. No data will be copied from the RDBMS to the backend but the associated hybrid view(s) will be reconfigured to reflect the new settings. If no partition is matched by the boundary option values then the partition boundary will be aligned with the data that has already been offloaded.
Moving the partition boundary does not affect any data that has already been offloaded. It is a metadata operation that simply changes where hybrid queries source their results from above and below the boundary. As described in Offload Type Considerations, this technique can be useful to get the best trade-off between making current data available in the backend system as soon as possible but having the most frequently accessed data queried from the RDBMS until it becomes less-used.
MAXVALUE Partition Considerations¶
Offloads for range-partitioned tables with a MAXVALUE
partition as the latest partition will behave differently depending on the offload options used, as follows:
If neither the
--offload-type
option nor any of the partition boundary options are specified, theMAXVALUE
partition will be offloadedIf
--offload-type=FULL
is specified, either with or without a partition boundary option, theMAXVALUE
partition will be offloadedIf
--offload-type=INCREMENTAL
is specified, either with or without a partition boundary option, theMAXVALUE
partition will not be offloadedIf a partition boundary option is specified but not
--offload-type
, theMAXVALUE
partition will not be offloaded
Note
Offloading the MAXVALUE
partition for a table will prevent any further partition-based offloads for that table.
Multi-Column Partitioning¶
Oracle Database supports multi-column partitioning, whereby a table is partitioned on a set of columns rather than a single partition key (for example, PARTITION BY RANGE (yr, mon)
). Offload supports this for range-partitioned tables. Any boundary condition must be provided via the --less-than-value
option as comma-separated-values. For example:
--less-than-value=2015,10
Or:
--less-than-value=2015-10-31,200
By default, when offloading a multi-column partitioned table to Hadoop-based backends, the generated Hadoop table will be partitioned and subpartitioned to match the columns used in the RDBMS partitioning scheme. For example, an Oracle Database table partitioned by YR, MON
will be partitioned in Hadoop by yr
and then subpartitioned by mon
. This can also be controlled using the --partition-columns
option.
When offloading a multi-column partitioned table to Google BigQuery, the generated BigQuery table will be partitioned according to the leading partition key column only. The --partition-columns
option will only allow one column to be specified when offloading to Google BigQuery.
Offloading Interval-Partitioned Tables¶
Offload supports interval-partitioned tables in exactly the same way as range-partitioned tables.
Offloading List-Partitioned Tables¶
Tables that are list-partitioned on DATE
, TIMESTAMP
, NUMBER
, [N]VARCHAR2
or [N]CHAR
columns can be partially or fully offloaded. Discrete sets of one or more partitions can be offloaded by specifying either partition key values or partition names in the offload
command. Additional partitions can be offloaded in subsequent offloads without affecting historical partitions.
List Partition Specification¶
Partitions in a list-partitioned table can be offloaded using either of the options below:
--equal-to-values
: Offload partitions by partition key value, accepting partition key literals that match the RDBMS partitions, e.g.
--equal-to-values=2015-01-01
--equal-to-values=US --equal-to-values=AP --equal-to-values=EU
--equal-to-values=201501,201502,201503
--partition-names
: Offload partitions by partition name, e.g.
--partition-names=TRANSACTIONS_P2015Q1,TRANSACTIONS_P2015Q2
When using --equal-to-values
, each partition must have its own option specification, as shown above and in Example 6 below.
Example 6: Offloading Partitions from a List-Partitioned Table¶
The following example offloads two partitions from a date list-partitioned table:
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --equal-to-values=2015-01-01 --equal-to-values=2015-01-02
Example 7: Offloading Multi-Valued Partitions from a List-Partitioned Table¶
When using --equal-to-values
, each option must equal the full high values specification for a single partition. The following example offloads two multi-valued list partitions where the partition key is a numeric representation of year-month:
$ $OFFLOAD_HOME/bin/offload -t SH.TRANSACTIONS -x --equal-to-values=201501,201502,201503 --equal-to-values=201504,201505,201506
Offloads for tables with numeric list-partition keys does not require the --partition-granularity
option, but for offloading to Google BigQuery the --partition-lower-value
and --partition-upper-value
options must still be used. See Managing Backend Partitioning for further details.
Example 8: Offloading List-Partitions by Partition Name¶
The --partition-names
option is an alternative way to specify multiple partitions to offload and can be useful if partition names are well-formed and known. The following example offloads four partitions from a table that is list-partitioned on financial quarter:
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-names=SALES_P2015Q1,SALES_P2015Q2,SALES_P2015Q3,SALES_P2015Q4
Resetting the List High Water Mark in a Hybrid View¶
List partition offloading is additive. A partition that has previously been offloaded will not be re-offloaded, even if it is specified in multiple offload
commands. If there is a need to modify the hybrid view so that offloaded partitions are accessed from the RDBMS rather than the backend system, the --reset-hybrid-view
option can be used with the next offload. Any previously offloaded partitions matched by partition identification options will not be re-offloaded but will be reflected in the hybrid view predicates.
DEFAULT Partition Considerations¶
Offloads for list-partitioned tables with a DEFAULT
partition will behave differently depending on the offload options used, as follows:
If neither the
--offload-type
nor any of the partition identification options (--equal-to-values
,--partition-names
) are specified, theDEFAULT
partition will be offloadedIf
--offload-type=FULL
is specified, either with or without a partition identification option, theDEFAULT
partition will be offloadedIf
--offload-type=INCREMENTAL
is specified, either with or without a partition identification option, theDEFAULT
partition will not be offloadedIf a partition identification option is specified but not
--offload-type
, theDEFAULT
partition will not be offloaded
Note
Offloading the DEFAULT
partition for a table will prevent any further partition-based offloads for that table.
Offloading List-Partitioned Tables as Range¶
In cases where a list-partitioned table has been structured to mimic range partitioning, the table can be offloaded exactly as described in Offloading Range-Partitioned Tables. To use this feature the table should adhere to the following:
Each partition must have a single literal as its high value
All new partitions must be added with higher/newer high values than those that have already been offloaded
Supported data types must match those referenced in Offloading Range-Partitioned Tables
In this scenario, a DEFAULT
list partition will be treated in the same way that a MAXVALUE
partition is treated for range partition offload (see MAXVALUE Partition Considerations for details).
Backend partition granularities differ between range-partitioned tables and list-partitioned tables offloaded as range. See Managing Backend Partitioning for details.
Example 9: Offload a Set of List Partitions Using a Date Range Partition Boundary¶
The following example offloads all list partitions with a partition key value of less than 2015-02-01.
An example using --older-than-date
:
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-02-01
Boundary options are exclusive and will not offload a partition with a key that equals the supplied value.
If using the --partition-names
option to specify the partition to use as an offload boundary, the --offload-predicate-type
option must also be specified with a value of LIST_AS_RANGE
, otherwise Offload will treat the table as a basic list-partitioned table rather than use range-partitioned table behavior.
Subpartition-Based Offload¶
Tables that are range-subpartitioned on DATE
, TIMESTAMP
or NUMBER
columns can be fully or partially offloaded. With Subpartition-Based Offload, partial offloading is supported even if the top-level partitioning scheme of a table is unsupported, because the subpartition high values are used to determine the offload boundary. To offload a range-subpartitioned table by subpartition, the --offload-by-subpartition
option must be used.
Subpartition-Based Offload is useful for tables that are organized with ranged data (such as a time series or business date) at subpartition level, rather than partition level. In such cases, the subpartition boundaries can be used to determine the data to be offloaded, even when the partition-level data is also ranged.
When offloading a subpartitioned table with Partition-Based Offload, all subpartitions within the identified range or list partition(s) are offloaded as standard.
Example 10: Offload a Set of Range Subpartitions Using a Date Range Boundary¶
The following example demonstrates offloading all subpartitions below a date threshold of 2015-07-01:
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2015-07-01 --offload-by-subpartition
Subpartition Boundary Options¶
The Offload options for specifying subpartition boundaries are the same as for range partition offloading. See Partition Boundary Options for details.
Subpartition-Based Offload has an additional restriction, such that the boundary option value must guarantee that no subpartitions have potential for more data to be added. For example, suppose every partition in a table contains two range subpartitions with high values of 90 and 100. In this case, the boundary option of --less-than-value=100
identifies a common subpartition boundary across the entire partition set and this guarantees that all subpartitions below this threshold are “complete” and ready for offloading. However, suppose just one of the partitions has range subpartitions with high values of 90 and 110. In this case, --less-than-value=100
option would not be able to offload all subpartitions below this threshold because the boundary is not common (i.e. the subpartition with a high value of 110 cannot be guaranteed to be “complete”).
In cases where the specified boundary is not valid, Offload will terminate with a warning and recommend that a common boundary is identified and used. In many cases it is possible to project the boundary value forward until a value is found that guarantees the safe offloading and appending of “complete” subpartitions.
Boundary options are exclusive and will not offload subpartitions with a key that equals the supplied value.
Other Subpartition-Based Offload Behavior¶
Subpartition-Based Offload behavior is the same as Partition-Based Offload in almost all cases. Considerations such as MAXVALUE
subpartitions, full offloading with --offload-type
, moving the high water mark in the hybrid view, using multi-column subpartition keys and so on are the same for range subpartitions as they are for range partitions. See Offloading Range-Partitioned Tables for details. Note that any examples would need the --offload-by-subpartition
option to apply the same principle to range-subpartitioned tables.
Predicate-Based Offload¶
Offload supports the offloading of subsets of data (or an entire table) by predicate. This feature applies to any table type (partitioned or non-partitioned) and is known as Predicate-Based Offload.
With this feature, three offloading patterns are possible:
Simple Predicate-Based Offload: Offload one or more non-overlapping subsets of data from a table
Late-Arriving Predicate-Based Offload: Offload one or more late-arriving subsets of data from a partition that has already been offloaded with Partition-Based Offload
Intra-Day Predicate-Based Offload: Offload a new partition in non-overlapping subsets as soon as each subset of data is loaded
Predicates are provided using a simple grammar as described in Predicate Grammar.
Simple Predicate-Based Offload¶
Example 11: Offload a Subset of Data from a Table Using a Predicate¶
The following example demonstrates a simple Predicate-Based Offload scenario by offloading all data for the ‘Electronics` category.
$ $OFFLOAD_HOME/bin/offload -t SH.PRODUCTS -x --offload-predicate='column(PROD_CATEGORY) = string("Electronics")'
All data matching this predicate will be offloaded and the predicate will be added to the offload boundary in the hybrid view. Predicates are additive and any predicate that has been offloaded will not be re-offloaded, even if specified in future offload
commands. If there is a need to modify the hybrid view so that offloaded data matching one or more predicates is accessed from the RDBMS rather than the backend system, the --reset-hybrid-view
option can be used with the next offload.
Backend tables can be optionally partitioned when using simple Predicate-Based Offload. See Managing Backend Partitioning for more details.
Late-Arriving Predicate-Based Offload¶
The late-arriving scenario caters for tables that are primarily offloaded by partition, but sometimes have small subsets of data arriving after the partition has been offloaded. This scenario benefits from the efficiencies of Partition-Based Offload but with the flexibility of Predicate-Based Offload for “topping-up” the offloaded data for one or more partitions. This pattern can only be used with range-partitioned tables or list-partitioned tables offloaded with range partition semantics.
Example 12: Offload a Late-Arriving Subset of Data from a Previously-Offloaded Range Partition¶
In the following example, the SH.SALES table has previously been offloaded using Partition-Based Offloading up to and including data for 2019-01-15
. The Predicate-Based Offload command below is used to additionally offload a small set of data for product 123`
that has arrived late and been loaded into a partition that has already been offloaded.
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
--offload-predicate='column(TIME_ID) = datetime(2019-01-15) and column(PROD_ID) = numeric(123)' \
--offload-predicate-type=RANGE
To offload late-arriving data by predicate, the --offload-predicate
option must include the current offloaded boundary value for the partition key (TIME_ID = 2019-01-15
in this example), in addition to the predicate that identifies the late-arriving data (PROD_ID = 123
). All data matching this predicate will be offloaded but the predicate itself will not be added to the offload boundary or metadata for the hybrid view (only the existing Partition-Based Offload boundary will be used).
The --offload-predicate-type
option must also be used for this scenario, but as stated above, the table must be offloaded as a RANGE
or LIST_AS_RANGE
offload predicate type.
Intra-Day Predicate-Based Offload¶
This offload pattern can be used to switch between Partition-Based Offload and Predicate-Based Offload as required. It can be useful as a means to offload the majority of a partitioned table by partition boundaries, but swap to offloading subsets of data for a new partition as soon as they arrive (rather than wait for the entire partition to be loaded before offloading). This can ensure that data is available in the target backend earlier than would otherwise be possible. This pattern can only be used with range-partitioned tables or list-partitioned tables offloaded with range partition semantics.
Example 13: Offload a Subset of Data for a New Range Partition (Intra-Day Offloading)¶
In the following example, the historical data for the SH.SALES table is offloaded by Partition-Based Offload (range partitioned by TIME_ID) at time T0 to begin the offload lifecycle for this table. For new data, rather than wait a full day for an entire partition of data to be ready, data is instead offloaded as soon as each product set is loaded into the new partition (3 separate loads at times T1, T2, T3). When all loads and offloads have completed, the table metadata is reset (at time T4) and the table is ready to repeat the same Predicate-Based Offload pattern for the next processing cycle.
Time |
Description |
Command |
---|---|---|
T0 |
Offload all history by range partition |
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-01
|
T1 |
Offload first set of products loaded at time T1 |
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
--offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(123), numeric(234))' \
--offload-predicate-type=RANGE_AND_PREDICATE
|
T2 |
Offload second set of products loaded at time T2 |
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
--offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(345), numeric(456))' \
--offload-predicate-type=RANGE_AND_PREDICATE
|
T3 |
Offload third set of products loaded at time T3 |
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
--offload-predicate='column(TIME_ID) = datetime(2020-07-01) and column(PROD_ID) in (numeric(567), numeric(678))' \
--offload-predicate-type=RANGE_AND_PREDICATE
|
T4 |
All products now loaded and offloaded, reset the metadata for the table |
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --older-than-date=2020-07-02
|
The Predicate-Based Offload commands for T1-T3 require that the --offload-predicate
option includes a predicate for the Partition-Based Offload partition key and that it matches the value of the new data (TIME_ID=2020-07-01
in this example). The --offload-predicate-type
option must also be used (see option reference for valid values).
When the metadata is reset after the end of a sequence of Predicate-Based Offload commands (as per T4 above), either the same pattern of offloading can continue for the next partition or, if preferred, Partition-Based Offload can resume (i.e. wait for the next partition to be fully-loaded before offloading in one command as described in Partition-Based Offload).
Predicate Grammar¶
At present, the offload predicate DSL or grammar used for the --offload-predicate
option covers a simple range of possibilities, based on a <column> <operator> <value>
model.
Predicate Form¶
Predicates can take several forms:
<column> <operator> <value>
<column> [NOT] IN (<value>[, <value>, ...])
<column> IS [NOT] NULL
The DSL allows specification of any number of predicates in a range of forms, which can be logically combined and nested with AND
/OR
, for example:
<predicate>) AND|OR <predicate>
(<predicate> AND|OR <predicate>) AND|OR (<predicate>)
Parentheses are required to form logical predicate groups, as shown in the second example above. Whitespace is insignificant.
Column Form¶
Predicates always contain a single column written with the column
keyword and a “function-call” syntax. Columns are not case-sensitive and can include aliases. The following examples are all valid:
column(CUST_ID)
column(SALES.CUST_ID)
column(cust_id)
Columns must exist in the table being offloaded and expressions/functions are not supported.
Operator Form¶
The basic SQL operators are supported and are written as in SQL. Member operators and null-value expressions are also supported. The full range is as follows:
=
,!=
,<
,<=
,>
,>=
[NOT] IN
IS [NOT] NULL
Value Form¶
Values must be qualified with a data type classification and have a “function-call” syntax as follows:
string("This is a string")
numeric(123)
datetime(2020-07-01)
String values use the string
keyword and double-quotes as shown above. Double-quote characters can be included in the string value by escaping with a backslash.
Numeric values use the numeric
keyword and both signed integral and signed decimal values are supported with the same keyword.
Date/time values use the datetime
keyword and can be specified at several levels of precision from dates to timestamps with nanosecond precision. The following are all valid examples of a datetime
value:
datetime(2020-07-01)
datetime(2020-07-01 13:01:01)
datetime(2020-07-01 13:01:01.123)
datetime(2020-07-01 13:01:01.123456789)
Value lists for use in the [NOT] IN
operator must be surrounded by parentheses and separated by commas. The values in a value list must be specified as normal, for example:
column(cust_id) IN (numeric(120), numeric(121), numeric(122))
The predicate DSL currently supports columns of the following Oracle Database data types:
Table 2: Predicate-Based Offload DSL Data Type Support¶
DSL Data Type Class |
Supported Oracle Database Data Types |
---|---|
String |
|
Numeric |
|
Datetime |
|
Offloading Joins¶
Offload enables joins between fully and/or partially-offloaded tables to be offloaded to the backend. This feature allows the processing of joins in a hybrid query to be pushed down from the RDBMS to the backend system where applicable. The join specification is provided using offload
with the --offload-join
option, based on a simple grammar as described in Offload Join Grammar.
Example 14: Offloading a Join¶
The following example offloads a three-table join to the backend system.
$ $OFFLOAD_HOME/bin/offload -t sh.salescountry_join -x \
--offload-join "table(sales) alias(s)" \
--offload-join "table(customers) alias(cu) inner-join(s) join-clauses(cust_id)" \
--offload-join "table(countries) alias(co) inner-join(cu) join-clauses(country_id)"
In this example, the SH.SALES table is partially offloaded and the other tables are fully offloaded. The offload boundary for the SH.SALES table is used for the Offload Join to ensure that all offloaded data for this table is aligned to a common boundary.
A hybrid view named SALESCOUNTRY_JOIN will be created in the hybrid schema for this join. Hybrid views created by Offload Join operations are not intended to be queried directly, because they are designed for transparent query rewrite. This feature enables users to benefit from the performance of having the joins processed in the backend system, but without having to rewrite any SQL. However, Offload Join hybrid views can be queried directly if required (by replacing the corresponding joins in existing SQL statements with the hybrid view or writing new queries using the hybrid view), but in most cases it should not be necessary.
Offload Join Grammar¶
When offloading joins, each table in the join has its own --offload-join
specification, as demonstrated in Example 14 above. The full grammar for this option is provided in Table 3 below.
Table 3: Offload Join Grammar¶
Directive and Syntax |
Description |
---|---|
TABLE([<hybrid_schema>.]<table-name>)
|
|
ALIAS(<table-alias>)
|
|
PROJECT(<*|<column>[, ..]>)
|
|
INNER-JOIN(<join-alias>)
|
|
JOIN-CLAUSES(<alias>.<column>
= <alias>.<column>[, ..])
JOIN-CLAUSES(<column>)
|
|
Note
Expansion of *
projections excludes the OFFLOAD_BUCKET_ID partition column that Offload automatically creates when offloading to Hadoop-based backends. For performance reasons (such as supporting parallel queries), it might be desirable to include OFFLOAD_BUCKET_ID in the final Hadoop table or view for an Offload Join. This can be achieved by explicitly including it in the PROJECT()
clause. For example PROJECT(*,offload_bucket_id)
will include all table columns plus the OFFLOAD_BUCKET_ID column. This is not required for offloads to Google BigQuery.
Materialized Offload Joins¶
By default, results of an offloaded join are materialized in a backend table of the name provided by the -t
option (SH.SALESCOUNTRY_JOIN in Example 14 above). Results do not have to be materialized: if using the --no-materialize-join
option, a view for the join will be created in the backend instead of a populated table.
Offload Join and Partially-Offloaded Tables¶
Tables offloaded with Partition-Based Offload, Subpartition-Based Offload or Predicate-Based Offload are all supported by Offload Join. When an Offload Join operation includes a partially-offloaded table, the boundary predicates are included in the hybrid view and, if the join is not materialized to a table, the boundary predicates will also be included in the backend join view that is created in place of a materialized resultset.
Joins can include multiple partially-offloaded tables, but all tables must have the same incremental key type (i.e. be offloaded in the same way) and must have been offloaded to the same partition or data threshold. This means:
Partition-Based Offload tables must have the same partitioning type, partitioning key and offloaded high value. Join column names do not have to match as the Offload Join specification allows for different column names
Subpartition-Based Offload tables must have the same subpartitioning key and offloaded high value. As with Partition-Based Offload, join column names do not need to match
Predicate-Based Offload tables must have the same predicate type and same set of predicates. The predicates for each table must match exactly
Refreshing Offload Joins¶
Offload Join objects are not automatically maintained and under some scenarios will become stale when a member table is further offloaded. The guidelines are as follows:
If the Offload Join is materialized, the join needs to be refreshed
If the Offload Join is not materialized but contains one or more partially-offloaded tables, the join needs to be refreshed
If the Offload Join is not materialized and does not contain any partially-offloaded tables, the join does not need to be refreshed
Refreshing Materialized Offload Joins¶
Materialized Offload Joins always need to be refreshed when the offloaded data for a member table has changed. The method for doing this depends on which of the member tables in an Offload Join has been subject to further offloads:
If a fully-offloaded table is re-offloaded, the dependent materialized Offload Joins require a full refresh. To do this, the original Offload Join
offload
command can be repeated with the--reset-backend-table
option. Note that this will fully reset and re-populate the materialized Offload Join table with the new results of the joinIf a partially-offloaded table has a further incremental offload of one or more partitions, subpartitions or predicates, the dependent materialized Offload Joins can be incrementally refreshed. To do this, simply repeat the original Offload Join
offload
command and Offload will append new data to the materialized join table
Refreshing Non-Materialized Offload Joins¶
Non-materialized Offload Joins only need to be refreshed when the offload boundaries for the partially-offloaded member tables have changed. To refresh the join, the original Offload Join offload
command can be repeated but must include the --force
or -f
option. This ensures that the backend view is re-created with the new boundary predicates of the partially-offloaded tables in the join.
Updating Offloaded Data¶
Gluent Data Platform supports the updating of offloaded data in several ways with Incremental Update. See the Incremental Update guide for details.
Offload Type Considerations¶
As described in the Concepts guide, there are three formats for a hybrid view (see Hybrid View Types) that determine where hybrid queries access their data:
100-0: All offloaded data is accessed from the backend table and none is accessed from the source RDBMS table
90-10: Partially offloaded data is accessed from the backend table and non-offloaded data is accessed from the source RDBMS table
100-10: All data is offloaded but only a some of the offloaded data is accessed from the backend table and some is accessed from the source RDBMS (based on a given boundary)
In most cases, the hybrid view type is created as default for a given offload operation, but the --offload-type
option can be used to provide greater flexibility over the initial and subsequent offloads of list-partitioned and range-partitioned tables. See Offload Type for a general description of offload type.
Default Offload Type Behavior¶
The following describes the default behavior for offload type when offloading a range-partitioned or list-partitioned table.
If no
--offload-type
is specified:Use of any of the partition boundary options offloads the table as
INCREMENTAL
with a default 90-10 hybrid viewAn offload without any of the partition boundary options offloads the table as
FULL
offload type with a default 100-0 hybrid view
If the
--offload-type
option is specified:Specifying
--offload-type=INCREMENTAL
with or without any of the partition boundary options offloads the table to match the boundary partition and defines the table as anINCREMENTAL
offload type with a supporting 90-10 hybrid viewSpecifying
--offload-type=FULL
without any of the partition boundary options offloads all data from the table and defines the table as aFULL
offload type with a supporting 100-0 hybrid viewSpecifying
--offload-type=FULL
with any of the partition boundary options offloads all data from the table and defines the table as aFULL
offload type with a specialized 100-10 hybrid view
If a table has a
DEFAULT
orMAXVALUE
partition:A
FULL
offload will offload theDEFAULT
/MAXVALUE
partitionAn
INCREMENTAL
offload will not offload anyDEFAULT
/MAXVALUE
partition
For subsequent offloads:
The offload type of a table is automatically picked up on all future offloads, unless overridden
The offload type of a table can be changed on future offloads by specifying a new
--offload-type
value (see Changing Offload Type below)
Note
Predicate-Based Offload does not support the 100-10 offload type pattern unless it is Late-Arriving Predicate-Based Offload.
Changing Offload Type¶
It is possible to change the offload type of a table on one or more subsequent offloads of the table (subject to some limitations). As described above, the offload type of a table is set on initial offload and is maintained for all future offloads (unless it is overridden by a new --offload-type
value on a later offload).
The rules and behavior for changing offload type from INCREMENTAL
to FULL
are as follows:
Any partitions not yet offloaded are offloaded (including any
DEFAULT
/MAXVALUE
if applicable), the table is re-defined as aFULL
offload type and the hybrid view is re-created accordingly
The rules and behavior for changing offload type from FULL
to INCREMENTAL
are as follows:
If a partition boundary option is used, any corresponding partitions are offloaded, the table is re-defined as an
INCREMENTAL
offload type and the hybrid view is re-created accordinglyIf the table has a
DEFAULT
orMAXVALUE
partition, it cannot be converted fromFULL
toINCREMENTAL
If the table has Incremental Update enabled, it cannot be converted from
FULL
toINCREMENTAL
Offload Data Types¶
When offloading data, Offload will map data types from the source RDBMS to the backend database automatically. However, as with most operations in Offload, it is possible to override many of the default data type mappings if required. See Tables 4 and 6 below for details.
Supported Data Types¶
Gluent Data Platform currently supports all of the Oracle Database data types listed in Table 4 below. Oracle Database tables that contain columns of any data type not listed cannot currently be offloaded.
Default Data Type Mappings¶
Table 4 lists the default data type mappings when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform) or Google BigQuery.
Table 4: Offload Default Data Type Mappings (Oracle Database)¶
Oracle Database |
Impala |
Google BigQuery |
Comments |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Data Sampling During Offload¶
Gluent Offload Engine fine-tunes the default data mappings by sampling data for all columns in the source RDBMS table that are either date or timestamp-based or defined as a number without a precision and scale.
By sampling numeric columns, Offload can choose an appropriate precision and scale for Impala DECIMAL
data and this can lead to performance improvements due to reduced CPU consumption when querying. This behavior can be disabled by adding the --no-auto-detect-numbers
option to the offload
command.
The Impala TIMESTAMP
data type supports a minimum date of 1400-01-01. Offload samples source date and timestamp-based columns to ensure that the data to be offloaded satisfies the Impala specification. In cases where Offload discovers data below the minimum specification, the column is offloaded to an Impala STRING
column. This behavior can be disabled with the --no-auto-detect-dates
option, but in this case the source data needs to be corrected before it can be offloaded.
Offload bases the volume of data to be sampled on the size of the RDBMS table. This can be overridden by adding the --data-sample-percent
option to the offload
command (specifying a percentage between 0 and 100, where 0 disables sampling altogether). Without sampling, all Impala decimal columns are defaulted to DECIMAL(38,18)
and no date or timestamp checking is performed.
Offloading Integral Data Types to Impala¶
When offloading a table to Impala, integer columns(i.e. NUMBER(*, 0)
) with a precision in the range of an Impala BIGINT
data type are automatically stored as BIGINT
. No attempt is made by Offload to downscale the Impala integral type in case the data precision grows beyond its type specification in the future. Offload safely defaults to BIGINT
because schema evolution of smaller integrals to BIGINT
cannot be actioned by a simple ALTER TABLE
statement in Impala (meaning a more complex data migration would be required to upscale the integral type if needed). However, it is possible to override this default behavior and users confident in a column’s future contents (for example, day of month or year) can override the mapping and safely utilize smaller integral types in Impala. See Override Data Type Mappings for details.
Source integer columns with a precision beyond the range of an Impala BIGINT
are automatically stored as DECIMAL(38,0)
.
Offloading to Google BigQuery Numeric¶
The Google BigQuery NUMERIC
data type has a specification of (38,9)
. This means that a BigQuery NUMERIC
can have a maximum of 29 digits to the left of the decimal point and a maximum of 9 digits to the right. For numeric data that exceeds this specification, Offload offers two options:
Use decimal rounding options during offload (see Decimal Scale Rounding below)
Offload numeric data to a floating point type (see Converting Numeric Data to Double below)
Important
Both of these options result in some data change and it will be for users to decide if this is a valid course of action.
Google BigQuery Floating Point Data Types¶
Google BigQuery provides a single 64-bit floating point data type: FLOAT64
. This means that the Oracle Database 32-bit BINARY_FLOAT
data type does not have a corresponding type in BigQuery and cannot currently be offloaded. Support for an override mapping to FLOAT64
will be added in a future release of Gluent Data Platform (although this will be potentially lossy - see Lossy Data Operations below).
Offloading to Google BigQuery Timestamp¶
At the time of writing, Google BigQuery’s DATETIME
and TIMESTAMP
data types both support microsecond precision, whereas Oracle Database supports up to nanosecond precision with its TIMESTAMP
and TIMESTAMP WITH TIME ZONE
data types. When Offload detects a source column with a higher level of precision than that supported by the backend (e.g. when attempting to offload TIMESTAMP(7)
to Google BigQuery), it will terminate. In such cases, the --allow-nanosecond-timestamp-columns
option will allow the offload operation to continue, but should ideally be used only when the data to be offloaded is known to be at microsecond precision or lower.
Important
Allowing sub-microsecond timestamp columns to be offloaded to Google BigQuery is potentially lossy - see Lossy Data Operations below for the implications of this).
Offloading Time Zoned Data¶
For Oracle Database columns of TIMESTAMP WITH TIME ZONE
data type, data is normalized to Coordinated Universal Time (UTC) during offload. For this reason, values returned from Smart Connector may appear different to the eye although they are temporally the same as those in the source RDBMS data. Offload converts numeric offsets, e.g. 2016-01-14 10:39:44 +5:30
, and relies on backend functionality to normalize named time zones, e.g. 2016-01-14 22:39:44 US/Pacific
.
Warning
Named timezones unknown to the backend distribution in use are silently lost when normalizing data to UTC. This can leave an offloaded timestamp value which is no longer temporally the same as the source. For this reason, Offload will, by default, reject tables with timezone-sensitive columns. If a user is confident that timezone data will not be lost, the --allow-time-zone-columns
option can be used to allow the offload to proceed. See Lossy Data Operations for the potential implications of lossy offloads.
Offloading Interval Data Types¶
Oracle Database INTERVAL DAY TO SECOND
and INTERVAL YEAR TO MONTH
data types are offloaded as strings in a normalized format, as shown in Table 5 below.
Table 5: Interval Data Type String Formats¶
Interval Type |
Sample Source Data |
Impala String Format |
Google BigQuery String Format |
---|---|---|---|
Day to second (+ve) |
+000000001 02:15:30.123456000 |
1 02:15:30.123456 |
+1 02:15:30.123456 |
Day to second (-ve) |
-000000001 02:15:30.123456000 |
-1 02:15:30.123456 |
-1 02:15:30.123456 |
Year to month (+ve) |
+000000001-01 |
1-1 |
+1-1 |
Year to month (-ve) |
-000000001-01 |
-1-1 |
-1-1 |
Smart Connector converts offloaded interval strings back to the correct interval type when executing hybrid queries from the RDBMS. Interval strings can also be presented back to Oracle Database interval types by using the appropriate Present option as required (see the Present guide for further details).
Override Data Type Mappings¶
When initially offloading a table, it is possible to override some of the default data type mappings to change the specification of the backend table. Table 6 lists the override mappings and associated options that are available when offloading data from Oracle Database to Impala (Cloudera Data Hub, Cloudera Data Platform) or Google BigQuery.
Table 6: Offload Override Data Type Mappings (Oracle Database)¶
Oracle Database |
Offload Option |
Impala |
BigQuery |
Comments |
---|---|---|---|---|
|
|
|
|
Impala versions below 3.3 will use |
|
|
|
||
|
|
|
||
|
|
|
|
Impala versions below 3.3 will use |
|
|
|
||
|
|
|
||
|
|
|
|
Use for |
|
|
|
Use for |
|
|
|
|
Use for |
|
|
|
|
Use for |
|
|
|
|
Use for |
|
|
|
|
Use for |
|
|
|
|
||
|
|
|
|
|
|
|
|
- |
Example 15: Overriding Data Types During an Offload¶
In the following example, the SH.SALES table is offloaded with several data type overrides.
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x \
--date-columns=TIME_ID \
--integer-4-columns=PROD_ID,CHANNEL_ID \
--decimal-columns-type=10,2 --decimal-columns=AMOUNT_SOLD
Each override can be given one or more columns. The --decimal-columns
override can be used to fine-tune the precision and scale of the Impala DECIMAL
data type, which can lead to better space utilization and CPU reduction. This override option behaves differently to the others in that it must be paired with a --decimal-columns-type
option to define the precision and scale for the columns listed. Multiple different names/type pairings can be provided in the same command (see --decimal-columns
for further details).
Oracle Database number precision is specified in terms of the number of digits. Impala number precision is specified by the data type. Care should be taken if selecting an override target data type. The DECIMAL
data type can be used as a catch-all numeric data type in Impala but users are encouraged to consider native integral data types for reasons of performance and space utilization.
Managing Exceptional Data¶
No two systems align perfectly with their data type specifications and implementations. There are several cases where data in the source Oracle Database table might not be easily offloaded to the backend system. Some have already been described, but include:
Source RDBMS numbers with a precision greater than that supported by the backend (for example, BigQuery’s
NUMERIC
limit of 29 digits to the left of the decimal point)Source RDBMS numbers with a scale greater than that supported by the backend (for example, Impala’s
DECIMAL
scale limit of 38 digits or BigQuery’sNUMERIC
scale limit of 9)Source RDBMS dates or timestamps with a value below Impala’s lower bound of 1400-01-01
Source RDBMS timestamps with a higher level of precision than BigQuery’s microsecond limit
Source RDBMS 32-bit floating point numbers offloading to a backend system without a 32-bit floating point data type (e.g. BigQuery)
When exceptional data is detected, Offload will in most cases terminate with an exception (and in some cases provide advice on how to proceed). The only exception to this is with offloading outlying dates and timestamps to Impala, which will be automatically offloaded to a lossless string data type (unless this behavior is disabled by the user as described earlier).
Note
Offload will never attempt to automatically offload exceptional data when it would cause a potential data loss to do so. In all cases, the user must provide an explicit course of action by providing the appropriate options with the offload
command.
Decimal Scale Rounding¶
The Oracle Database NUMBER
data type is extremely flexible and caters for high-scale decimal data that exceeds the limits available in most target backend systems. To offload decimal data that exceeds the specification of the backend system, the --allow-decimal-scale-rounding
option must be used in conjunction with the --decimal-columns-type
and --decimal-columns
options to define the target type for the data. By using the rounding option, the user acknowledges that some data loss due to decimal rounding is likely and acceptable.
Converting Numeric Data to Double¶
Using the --double-columns
override option to offload high-precision/scale numeric data or 32-bit floating point data to a 64-bit double can lead to data loss and should only be considered when there is no alternative and the consequences are both understood and acceptable.
Lossy Data Operations¶
To summarize, the potentially-lossy offload operations and their corresponding options are as follows:
Rounding decimals: Using the
--allow-decimal-scale-rounding
option to round decimals that exceed the backend specification for decimal dataOffloading sub-microsecond timestamps: Using the
--allow-nanosecond-timestamp-columns
option to offload columns with a sub-microsecond specification to a backend system that doesn’t support the same level of precisionOffloading time zoned data: Using the
--allow-time-zone-columns
option to enable time zoned data to be offloadedConverting data to a 64-bit floating point number: Using the
--double-columns
option to offload decimal or 32-bit floating-point data that would not otherwise be possible
Important
Enabling lossy offloads can cause wrong results with some hybrid queries when data loss has occurred; such as when Smart Connector pushes down an equality or inequality predicate containing values from the lossy column(s). It is not advised to enable lossy data type overrides for columns that are likely to be used in hybrid query predicates or joins unless the offload was known to be lossless.
Offload Transport¶
Offload Transport describes the part of the offload process that:
Copies the source RDBMS data to a staging location
Validates the staged data
Loads the staged data into the backend target table
The two-phased approach of staging and loading data provides two main benefits:
Atomicity: The load from staging table to target table either fully succeeds or completely fails
Partitioning: For backend tables that use Gluent Offload Engine’s synthetic partitioning scheme, Offload can synthesize partition key values when loading the table
The transport phase of an offload is split into four main operations:
Transport Data to Staging¶
Gluent Offload Engine uses one of a number of tools to extract data from the source RDBMS. The appropriate tool will be chosen automatically based on configuration preferences. Each tool reads the source data to offload and attempts to split the source data equally and in a non-overlapping way between concurrent reader processes.
Data is staged in Avro format, native data types are used where possible and Avro STRING
is used where there is no identical match between Avro and the source RDBMS table. For example, Oracle Database DATE
, TIMESTAMP
and NUMBER
data types have no direct equivalent in Avro; therefore data is staged as STRING
and converted to the appropriate backend data type during the final loading phase.
The data extraction tools available to Offload are:
Apache Sqoop¶
Apache Sqoop is only considered for use when offloading to Hadoop-based backends. Offload will invoke Sqoop on the host defined by OFFLOAD_TRANSPORT_CMD_HOST
. Depending on the source RDBMS table, Sqoop will be used in either its standard table-centric mode or using a free-form query defined by Offload. The number of Sqoop mappers used when extracting data is defined by OFFLOAD_TRANSPORT_PARALLELISM
or per offload with the --offload-transport-parallelism
option. The SQOOP_QUEUE_NAME
parameter can be defined to submit the Sqoop job a specific YARN queue.
Default configuration is appropriate for the majority of offloads but occasionally the nature of the source table requires tuning of the following:
--offload-transport-fetch-size
can be used on a per-offload basis to overrideOFFLOAD_TRANSPORT_FETCH_SIZE
. This can be useful to manage memory requirements for tables with a large row size (such as when offloading LOBs), by reducing the fetch sizeSQOOP_OVERRIDES
or the per-offload option--offload-transport-jvm-overrides
can be used to inject JVM parameters into the Sqoop command line. This can be useful to override JVM memory defaults, for example,--offload-transport-jvm-overrides="-Dmapreduce.map.java.opts=-Xmx2048m -Dmapreduce.map.memory.mb=2048"
Apache Spark¶
Three interfaces to Spark are supported by Offload: Spark Submit, Spark Thrift Server and Apache Livy. The interface used by Offload is chosen automatically, based on configuration. If multiple interfaces are configured for use, the order of priority is Spark Thrift Server, Apache Livy and Spark Submit. SparkSQL is used in all cases (via a PySpark script for Apache Livy and Spark Submit, or as pure SQL for Spark Thrift Server) to extract the data to be offloaded from the source RDBMS table.
When offloading to Hadoop-based backends it is expected that the Spark cluster and interfaces are configured components of the target Hadoop cluster. When offloading to cloud warehouses such as Google BigQuery it is typical to use Spark Standalone (Gluent Data Platform includes a Transport package containing Spark Standalone components for this purpose), although an existing Spark cluster can be utilized if available.
Spark Submit is available for use by Offload if OFFLOAD_TRANSPORT_CMD_HOST
and OFFLOAD_TRANSPORT_SPARK_SUBMIT_EXECUTABLE
are defined. When using a Spark Standalone cluster jobs will be submitted to the cluster defined in OFFLOAD_TRANSPORT_SPARK_SUBMIT_MASTER_URL
. In a Hadoop environment, YARN is used as the Spark master. OFFLOAD_TRANSPORT_SPARK_QUEUE_NAME
or --offload-transport-queue-name
can be used to submit the Spark job a specific YARN queue.
Spark Thrift Server is available for use by Offload if OFFLOAD_TRANSPORT_SPARK_THRIFT_HOST
and OFFLOAD_TRANSPORT_SPARK_THRIFT_PORT
are defined. The Spark Thrift server can be configured to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible.
Apache Livy is available for use by Offload when offloading to Hadoop-based backends if OFFLOAD_TRANSPORT_LIVY_API_URL
is defined. Livy can be configured (using OFFLOAD_TRANSPORT_LIVY_IDLE_SESSION_TIMEOUT
) to keep Spark executors alive between offloads, thereby removing process startup costs and providing low-latency offloads. At higher volumes this benefit becomes negligible. The maximum number of Livy sessions that Offload will start and use concurrently is defined by OFFLOAD_TRANSPORT_LIVY_MAX_SESSIONS
.
The number of tasks in an offload transport Spark job is defined by OFFLOAD_TRANSPORT_PARALLELISM
or per offload with the --offload-transport-parallelism
option. It should be noted that defining more tasks than there are available Spark executors will result in queuing. Therefore, OFFLOAD_TRANSPORT_PARALLELISM
should ideally be no more than the number of available executors.
Default configuration is appropriate for the majority of offloads but occasionally the nature of the source RDBMS table requires tuning of the following:
--offload-transport-fetch-size
can be used on a per offload basis to overrideOFFLOAD_TRANSPORT_FETCH_SIZE
. This can be useful to manage memory requirements for tables with a large row size (such as when offloading LOBs), by reducing the fetch sizeOFFLOAD_TRANSPORT_SPARK_OVERRIDES
or the per offload option--offload-transport-jvm-overrides
can be used to inject JVM parameters into the Spark Submit command line. This has no effect for Spark Thrift Server or Apache Livy because their configuration is managed independentlyOFFLOAD_TRANSPORT_SPARK_PROPERTIES
or the per offload option--offload-transport-spark-properties
can be used to modify Spark attributes. Note that because Spark Thrift Server and Apache Livy configuration is managed independently, some attributes do not have any effect
Query Import¶
Query Import is used for for low-volume offloads. Data is extracted and staged by Offload itself and not using an external tool. This avoids delays incurred when invoking Sqoop or Spark Submit. Non-partitioned tables with a source RDBMS size smaller than --offload-transport-small-table-threshold
are eligible for Query Import. Query Import does not run in parallel, i.e. OFFLOAD_TRANSPORT_PARALLELISM
is ignored.
RDBMS Options¶
Connections to the source RDBMS are made to the address defined in OFFLOAD_TRANSPORT_DSN
. This defaults to ORA_CONN
but can be overridden if data needs to be extracted from an alternative address (such as from a replica database).
Concurrent offload transport processes open independent sessions in the source RDBMS. If a high value for OFFLOAD_TRANSPORT_PARALLELISM
is required then consideration should be given to any session limits in the RDBMS. The OFFLOAD_TRANSPORT_CONSISTENT_READ
parameter or the per offload option --offload-transport-consistent-read
can be used to ensure that all concurrent extraction queries reference a specific point in time. When set to true
, extraction queries will include an AS OF SCN
clause. When the source RDBMS table or input partitions are known to be cold (i.e. not subject to any data modifications), this can be set to false
to reduce resource consumption.
If there is a requirement to modify offload transport RDBMS session settings, such as setting an Oracle Database initialization parameter (under the guidance of Gluent Support), then modification of OFFLOAD_TRANSPORT_RDBMS_SESSION_PARAMETERS
will be required.
Validate Staged Data¶
Once data has been staged, it is validated to ensure that the number of staged rows matches the number of rows read from the source RDBMS. While doing this Offload might also:
Check for
NULL
values in any custom partition scheme defined using the--partition-columns
option. A positive match results in a warningCheck for any
NaN
(Not a Number) values if the source table has floating point numeric data types. This is because source RDBMS and target backend systems do not necessarily treat NaN values consistently. A positive match results in a warningCheck for lossy decimal rounding as described in Decimal Scale Rounding
Check for source data that is invalid for the chosen backend partition scheme, such as numeric data outside of any
--partition-lower-value
/--partition-upper-value
range. A positive match results in an exception
Validate Type Conversions¶
Data types used for staging data will rarely match those of the backend target table. Data is converted to the correct type when it is loaded into the final target table. This stage therefore verifies that there will be no invalid conversions when loading. While this is a duplication of type conversions in the Load Staged Data phase, it provides the advantage of checking the data before the more compute-intensive data load and is able to report all columns with data issues in a single pass.
Example 16: Catching Invalid Data Type Conversions¶
In the following example, the SH.SALES table is offloaded to a Hadoop cluster with an invalid data type for two columns: the data in the PROD_ID and CUST_ID columns is not compatible with the user requested single-byte integer data type.
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --integer-1-columns=PROD_ID,CUST_ID
This results in the following exception:
CAST() of load data will cause data loss due to lack of precision in target data type in 6887232 rows
Failing casts are:
(`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
(`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
The SQL below will assist identification of problem data:
SELECT PROD_ID
, CUST_ID
FROM `sh_load`.`sales`
WHERE (`prod_id` IS NOT NULL AND CAST(`prod_id` AS TINYINT) IS NULL)
OR (`cust_id` IS NOT NULL AND CAST(`cust_id` AS TINYINT) IS NULL)
LIMIT 50
The exception provides three important pieces of information:
The number of rows with issues
The columns/conversions with issues
A SQL statement to use offline to review a sample of the problematic data
Load Staged Data¶
In this phase of an offload staged data is converted to correct data types as described in Validate Type Conversions and inserted into the target backend table. Where available, performance metrics from the backend system are recorded in the Offload log file written to $OFFLOAD_HOME/log
.
Offload Transport Chunks¶
The source RDBMS data is offloaded in “chunks”. A chunk can comprise an entire table or a set of one or more partitions. The sequence of four operations described above are executed for each chunk until the whole input set has been offloaded. A final task in this phase, for backend systems that support it, is to update statistics on the freshly offloaded data.
For partitioned RDBMS tables, data is offloaded in sets of partitions. Input partitions are grouped into chunks, based on either the number of partitions or cumulative partition size (including subpartitions where relevant). For non-partitioned tables the table itself is considered a single chunk.
The size limit for a partition chunk is defined by MAX_OFFLOAD_CHUNK_SIZE
or per offload with the --max-offload-chunk-size
option. The maximum number of partitions in a chunk is defined by MAX_OFFLOAD_CHUNK_COUNT
or per offload with the --max-offload-chunk-count
option. Both size and count thresholds are active at the same time, therefore the first threshold to be breached is the one that closes the chunk. If the first partition in a chunk breaches the size threshold then it will be included for offload and that chunk will be greater in size than the configured threshold.
Sets of partitions are an effective way of managing resource consumption. For example, if a source RDBMS typically has partitions of 8GB in size and the backend system can comfortably load 64GB of data at a time without impacting other users, MAX_OFFLOAD_CHUNK_SIZE
can be set to 64G
and roughly 8 partitions will be offloaded per chunk.
When calculating the cumulative size of partitions Offload uses RDBMS segment sizes. This is important to note because compressed data in the RDBMS may increase in size when extracted and staged.
Partitions are considered for chunking in the logical order they are defined in the source RDBMS. For example, partitions from an Oracle Database range-partitioned table are split into chunks while maintaining the order of oldest to most recent partition. Offload does not attempt to optimize chunking by shuffling the order of partitions because that would offload partitions out of sequence and be a risk to atomicity.
Data Storage Options¶
Gluent Data Platform supports several storage schemes for offloaded data, listed in Table 7 below.
Table 7: Offload Storage Options¶
Storage Scheme |
Offload Staging Data |
Offloaded Data |
|
---|---|---|---|
HDFS |
Hadoop only |
Hadoop only |
|
Amazon Simple Storage Service (S3) |
No |
Hadoop only |
|
Microsoft Azure Data Lake Storage Generation 1 (ADLS) |
No |
Hadoop only |
|
Microsoft Azure Data Lake Storage Generation 2 (ABFS) |
No |
Hadoop only |
|
Google Cloud Storage (GCS) |
BigQuery only |
No |
|
Storage preferences such as base HDFS file locations and/or cloud storage buckets can be specified as Offload command-line options, but it is more common to configure them once in the Gluent Data Platform environment file.
Gluent Offload Engine can offload data to cloud storage and requires a small amount of configuration, as follows:
OFFLOAD_FS_SCHEME
: The storage scheme to which the offloaded data will be persisted for Hadoop-based backends or staged to for Google BigQuery. Refer to Table 7: Offload Storage Options. An ad hoc override is available with the--offload-fs-scheme
optionOFFLOAD_FS_CONTAINER
: The name of the bucket or container to be used for offloads. An ad hoc override is available with the--offload-fs-container
optionOFFLOAD_FS_PREFIX
: The storage subdirectory defined within the bucket/container (or can be an empty string if preferred). An ad hoc override is available with the--offload-fs-prefix
option
When offloading to Hadoop-based backends:
A value of
inherit
can be used forOFFLOAD_FS_SCHEME
. With this value all tables created by Offload will inherit the location from the parent database. If databases are created using the--create-backend-db
option, a default location of HDFS will be usedSpecifying a storage scheme in cloud storage will persist the offloaded data to that location, but data must first be staged in HDFS as part of the two-phase loading process
Offloaded data is compressed. The default compression can be overridden with the
--storage-compression
optionThe default file format for all storage schemes is Parquet
When offloading to Google BigQuery:
The two-phase loading process stages data in Google Cloud Storage before persisting it to Google BigQuery
Google BigQuery has its own proprietary storage format
Regardless of the storage provider:
The default file format for temporary staging data is Avro
Staged data can be compressed with the
--compress-load-table
option
Managing Data Distribution¶
In addition to the variety of cloud storage options supported by Gluent Data Platform, Offload also provides options to manage the offloaded data distribution:
Backend partitioning: See Managing Backend Partitioning
Data distribution: See Backend Data Sorting/Clustering
Managing Backend Partitioning¶
When offloading data, backend tables can be partitioned in several ways, depending on the backend platform and user preferences:
Surrogate Partitioning for Bucketing¶
When offloading to Hadoop-based backends, Offload always includes a surrogate partition key named OFFLOAD_BUCKET_ID, in addition to any other partitioning scheme that might be defined for the table. This surrogate partition key is used internally by Gluent Data Platform to enable Smart Connector to scan backend data in parallel when executing a parallel hybrid query. The number of partition key values defaults to the DEFAULT_BUCKETS
configuration but can be overridden with the --num-buckets
option.
A hash function is used by Offload to distribute data across the OFFLOAD_BUCKET_ID partitions. Offload will try to pick the most selective column to distribute data evenly across the surrogate partitions, but in the absence of a primary key or column statistics for the source RDBMS table, it will default to the first table column. Users can override the hashing column by including the --bucket-hash-column
option with the initial offload
command.
The surrogate partition column is not required when offloading to Google BigQuery.
Inherited Partitioning¶
Tables offloaded with Partition-Based Offload or Subpartition-Based Offload are automatically partitioned in the backend with the (sub)partition key of the source table, unless overridden by the user. The granularity of the inherited (sub)partitions can be different to the source RDBMS (sub)partitions if required. In some cases it is mandatory to specify the granularity of the backend partition scheme (see Partition Granularity below for details)
For Hadoop-based backends, the inherited partition key can comprise multiple columns and is used in addition to the surrogate OFFLOAD_BUCKET_ID partition key described above. Backend partition key columns are always implemented by the addition of synthetic partition key columns. See Synthetic Partitioning below for details.
For Google BigQuery, only the leading (sub)partition key column will be used. In some cases, the backend partitioning might be implemented by a synthetic partition key column, but not always. See Synthetic Partitioning below for details.
User-Defined Partitioning¶
Tables can be offloaded with the --partition-columns
option to define a custom partitioning scheme for the backend table. This enables non-partitioned RDBMS tables to be partitioned in the backend if required, in addition to allowing (sub)partitioned RDBMS tables to be offloaded with a different partitioning scheme in the backend. It is also possible (and in some cases mandatory) to specify the granularity of the backend partitions (see Partition Granularity for details below) and user-defined partitioning supports more RDBMS data types than inherited partitioning.
For Hadoop-based backends, the user-defined partition key can comprise multiple columns and is added to the surrogate OFFLOAD_BUCKET_ID partition key described above. The following example offloads a partitioned table to a Hadoop backend and subpartitions the offloaded data by year (derived from a date) and product ID (in steps of 1000).
Example 17: Offload with User-Defined Partitioning¶
$ $OFFLOAD_HOME/bin/offload -t SH.SALES -x --partition-columns=TIME_ID,PROD_ID --partition-granularity=Y,1000
When offloading to a Hadoop-based backend, user-defined partitioning can include date/timestamp, numeric or string columns.
When offloading to Google BigQuery, string-based or decimal (i.e. non-integral numeric) columns cannot be used for user-defined partition keys and only one partition key column can be defined.
Synthetic Partitioning¶
Depending on the data type and backend system, Offload usually implements inherited and user-defined partitioning schemes with additional synthetic columns (one per source (sub)partition column). Synthetic columns are used to partition backend tables instead of the corresponding natural columns. Synthetic partition keys are used internally by Gluent Query Engine to generate partition-pruning predicates in hybrid queries and by Offload to ensure that the backend partition columns remain consistent across multiple Partition-Based Offload or Subpartition-Based Offload operations.
Synthetic partition key columns are named as a derivative of the granularity of the data and the name of the corresponding RDBMS column. The naming convention for each synthetic column is GL_PART_<granularity>_<column name>
. For example, a synthetic partition key column for an RDBMS table offloaded with a TIME_ID
partition key and a monthly granularity will be named GL_PART_M_TIME_ID
.
When a table is offloaded with partitioning to a Hadoop-based backend, synthetic partition column(s) are always generated by Offload. The resulting synthetic column(s) are always of type STRING
.
When a table is offloaded with partitioning to Google BigQuery, a synthetic partition key will be generated when:
The natural partition column is of a
DATE
orTIMESTAMP
BigQuery data type and the partition granularity is monthly or yearly. The synthetic partition key column will be of the same data type as its corresponding natural partition columnThe natural partition column is of a BigQuery
DATETIME
data type
In all other cases (i.e. the BigQuery column is an INT64
or it is a DATE
or TIMESTAMP
with a daily granularity), a synthetic column will not be generated and the existing table column will be used with native BigQuery partitioning.
Offload populates synthetic partition key columns with generated data when offloading based on the type and granularity of the data. Gluent Query Engine will generate predicates for the synthetic partition columns when pushing down predicates for the corresponding natural columns.
Partition Granularity¶
Partition granularity defines the range or contents of a single backend partition. In some cases, granularity must be defined (using the --partition-granularity
option) when offloading an RDBMS table for the first time, but in many cases, Offload will choose a default granularity based on several factors (including the type of RDBMS partitioning, the RDBMS (sub)partition key type and the backend platform).
Partition Granularity Behavior With Hadoop Backends¶
When offloading with partitioning to a Hadoop-based backend, the following granularity behavior applies.
If the RDBMS table is list-partitioned and the partition scheme is inherited:
Date/timestamp partition data defaults to a daily granularity
Numeric partition data defaults to a granularity of 1
String partition data defaults to the length of the RDBMS string partition column
In all other scenarios:
Date/timestamp partition data defaults to a monthly granularity
Numeric partitioning requires the
--partition-granularity
optionString partitioning requires the
--partition-granularity
option
Partition Granularity Behavior With Google BigQuery¶
When offloading with partitioning to Google BigQuery, the following granularity behavior applies.
Date/datetime/timestamp partition data defaults to daily granularity
If the RDBMS table is list-partitioned and the partition scheme is inherited, numeric partitioning defaults to a granularity of 1. In all other scenarios, numeric partitioning requires the
--partition-granularity
option. In all scenarios, the--partition-lower-value
and--partition-upper-value
options must be specifiedString partitioning is not supported in Google BigQuery
Backend Data Sorting/Clustering¶
In addition to partitioning, offloaded data can optionally be distributed according to sorting or clustering options provided by the backend. This can lead to more efficient data access (such as range scans) for hybrid queries and other users of the offloaded data. The following data distribution options are provided by the backends:
Impala: Data can be stored sorted by one or more columns
Google BigQuery: Tables can be clustered by up to four columns
To enable data distribution globally, set the OFFLOAD_SORT_ENABLED
configuration parameter to true
. Alternatively, this can be managed for individual offloads by adding the --offload-sort-enabled
option to the offload
command (set to true
), along with the --sort-columns
option to specify the columns for distributing the data in the backend.
Managing Hybrid Query Performance¶
In addition to creating the database objects and metadata required to enable Gluent Query Engine to execute hybrid queries, Offload also provides a solid foundation for good hybrid query performance. It does this through a combination of standard operations and configurable options, designed to prepare the hybrid environment for Smart Connector optimizations. Several of these are described below.
Managing Parallelism¶
By default, Offload prepares the hybrid environment for parallel query access in several ways, unless disabled by configuration:
Offload Buckets¶
As described earlier, in Hadoop-based environments, Gluent Offload Engine automatically partitions the backend table by a surrogate OFFLOAD_BUCKET_ID column to enable parallel access by multiple Smart Connector sessions. This can be configured globally as a default number of buckets (see DEFAULT_BUCKETS
) and/or specified per table with the --num-buckets
option. This should be set to the highest degree of parallelism required for tables that need to be accessed by parallel queries. It can be capped globally to avoid creating too many bucket partitions in the backend table. It can also be capped to limit the degree of parallel access (this applies to Impala only).
Small tables below a threshold (configured with the DEFAULT_BUCKETS_THRESHOLD
global parameter) are automatically offloaded with a single bucket.
Google BigQuery does not require this Gluent Data Platform data access optimization. Smart Connector utilizes the BigQuery Storage API for natively-managed parallel data retrieval.
Hybrid External Table Location Files¶
Hybrid external tables can be created with multiple location files. When doing so, parallel hybrid queries spawn multiple Smart Connector processes for parallel data retrieval. The number of location files for each hybrid external table is determined by either the NUM_LOCATION_FILES
global default parameter or per table with the --num-location-files
option.
Note that when offloading to Impala, the number of external table location files is always aligned with the number of offload buckets.
To disable parallel data retrieval for an external table, the --num-location-files
option should be added to the offload
command and should be set to 1.
Hybrid External Table Degree¶
By default, Offload copies the degree of parallelism (DOP) from the source RDBMS table to the hybrid external table when offloading. This ensures that hybrid queries are executed with the same default DOP as the original application queries. This can be configured globally with the HYBRID_EXT_TABLE_DEGREE
parameter or set per table with the --ext-table-degree
option. In most cases, the default suffices.
This does not apply to hybrid external tables that are created for Advanced Aggregation Pushdown; these are always created with a degree of 1.
Managing Statistics¶
Database statistics enable query optimizers to determine the most efficient way to execute queries. To ensure that hybrid queries benefit from efficient execution plans, Offload manages statistics for both the hybrid external table and the backend target table (where applicable).
Hybrid Object Statistics¶
When offloading an RDBMS table for the first time, Offload copies the table and column statistics from the source table to the base hybrid external table. No prorating or modification of the statistics occurs, even if the source table is partially-offloaded. In cases where partially-offloaded tables continue to be offloaded (for example, a Partition-Based Offload of the latest few partitions), Offload will not overwrite the external table statistics unless the source RDBMS statistics indicate that the source table has grown. This prevents the statistics from being downgraded if the source RDBMS table shrinks due to offloaded partitions being dropped from the RDBMS.
When offloading to Hadoop-based backends, hybrid external tables that support general and custom Advanced Aggregation Pushdown have their statistics derived from the backend table statistics. When offloading to Google BigQuery, these external tables currently have no statistics.
Regardless of backend, hybrid external tables that support count-only Advanced Aggregation Pushdown have their statistics derived by Offload.
Backend Table Statistics¶
When offloading to Hadoop-based backends, Offload will compute statistics on the backend target table every time it is loaded.
For Impala, default behavior for partitioned tables is to compute incremental statistics. For very large tables, or tables with a large number of partitions, this can sometimes lead to pressure for Impala resources. In such cases (and to save time on statistics operations in general), Offload provides a mechanism to copy statistics from the source RDBMS table to the backend target table. This is a metadata-only operation and as such is quick to process. The backend statistics are derived from the source RDBMS values. To specify that statistics should be copied rather than computed, the --offload-stats
option should be added with a value of COPY
to the offload
command.
Google BigQuery tables do not have statistics, so this step is skipped when offloading to the BigQuery platform.
Data Sorting/Clustering¶
Resetting an Offloaded Table¶
Offload provides a --reset-backend-table
option that can be added to an offload
command to fully reset a previously offloaded table. This option is useful when refreshing small offloaded tables such as reference tables or dimensions (these are commonly offloaded to support Offload Join operations). This option will cause Offload to drop all offloaded data from the backend, so it should be used carefully.
Important
The --reset-backend-table
option will drop the existing backend table and supporting objects. This option must only be used when it is certain that resetting the backend table will not cause data loss. Care must be taken to ensure that this option is never used if some of the previously-offloaded data cannot be replaced (for example, partitions that are dropped from the source RDBMS after offloading to the backend).
Offload Security¶
Gluent Data Platform interfaces with the security features of the RDBMS and backend data platforms, providing support for authentication, authorization and encryption features.
See the Security guide for details of supported security features and the Installation & Upgrade guide for implementation details.