Present

Documentation Conventions

  • Commands and keywords are in this font.

  • $OFFLOAD_HOME is set when the environment file (offload.env) is sourced, unless already set, and refers to the directory named offload that is created when the software is unpacked. This is also referred to as <OFFLOAD_HOME> in sections of this guide where the environment file has not been created/sourced.

  • Third party vendor product names might be aliased or shortened for simplicity. See Third Party Vendor Products for cross-references to full product names and trademarks.

Introduction

Present is a core component of Gluent Offload Engine that provides a simple and flexible way to query data stored in a backend data platform (such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery, Snowflake) from an RDBMS (such as Oracle Database). This guide describes how to use Present for a variety of common and advanced use-cases. It is assumed that readers are familiar with Gluent Data Platform Concepts prior to reading this Present guide, particularly the Present section.

Overview of the Present Process

To understand the Present software and its use-cases, options and operation, it is useful to start with a high-level understanding of a present process in general. Presenting a table or view from a backend system to an RDBMS requires a number of sequential operations to be orchestrated, involving a variety of components, interfaces and parts of the source and target systems.

In the majority of cases, the present process is limited to the creation of hybrid objects and metadata operations. No data is involved and a present operation is therefore quick and lightweight.

In the less-common case of a materialized Present-Join operation (described in Presenting Joins), the present process will begin by creating and populating a table in the backend based on the join specification, but will then finish with the standard phase of creating hybrid objects and metadata to enable hybrid queries of the backend data. Because data population is involved, this use-case will usually take longer than a regular present process.

The various steps executed by Present are visible both on the terminal and in a log file and the level of verbosity can be configured (see the LOG_LEVEL configuration parameter and -v and --vv options for details).

Controlling the Present Process

The present process is highly configurable. Many steps in the process can be modified by options or configuration parameters and a number of steps are also optional and can be disabled per present. In addition to the remainder of this Present guide (which will describe many of the options), see the Reference guide for a full list of options and configuration parameters and their descriptions.

Present Objects

The following table lists the standard set of RDBMS and backend database objects that are created by a basic present of a backend table named DB.TABLE.

Table 1: Standard Objects Created by Present

Database

Object Type

Object Naming Format

Hybrid Object Type

RDBMS

View

DB_H.TABLE

Hybrid view

RDBMS

Table

DB_H.TABLE_EXT

Hybrid external table

RDBMS

View

DB_H.TABLE_AGG

Aggregate hybrid view

RDBMS

Table

DB_H.TABLE_AGG_EXT

Aggregate hybrid external table

RDBMS

Rewrite Equivalence

DB_H.TABLE_AGG

Aggregate rewrite rule

RDBMS

View

DB_H.TABLE_CNT_AGG

Aggregate hybrid view

RDBMS

Table

DB_H.TABLE_CNT_AGG_EXT

Aggregate hybrid external table

RDBMS

Rewrite Equivalence

DB_H.TABLE_CNT_AGG

Aggregate rewrite rule

All RDBMS hybrid objects are created in a hybrid schema, which itself is pre-created by Gluent Data Platform. In Table 1 above, the backend database is named “DB” and therefore, in the RDBMS, a hybrid schema named “DB_H” is pre-created to own the hybrid objects that Present generates. In most cases, the hybrid schema is created to complement an existing RDBMS application schema (often due to Offload), but Present can be used with hybrid schemas that have no relationship to other application schemas in the RDBMS.

Depending on the present operation, other objects might be created, such as a backend materialized join table or a backend conversion view. These are described later in this guide.

See also

Glossary for definitions of the hybrid object types in addition to the Concepts and Performance guides for further information on their purpose and usage.

Present Use-Cases

The common use-cases supported by Present are:

Create a Hybrid Query Interface to Backend-Resident Data

With Present, it is possible to create a hybrid query interface to tables or views that reside in backend platforms such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery or Snowflake. This feature enables hybrid queries that are executed in the RDBMS to access data that is stored in the backend system. Some examples follow.

Example 1: Presenting a Backend Table

The following example presents a table from a backend system to the RDBMS.

$ . $OFFLOAD_HOME/conf/offload.env

$ $OFFLOAD_HOME/bin/present -t SH.BACKEND_TABLE -x

Note

This example includes the sourcing of the Gluent Data Platform environment file for a present session. All other examples in this guide will assume this step has occurred.

As described in Overview of the Present Process, a basic present operation such as Example 1 involves the creation of hybrid objects and metadata in the RDBMS. No data is involved, making it a very fast operation. The target hybrid schema name is derived from the source database name and in this case is SH_H (there are other ways to specify the target schema, described later). Example 1 will create several hybrid objects in the SH_H hybrid schema, including a hybrid view named SH_H.BACKEND_TABLE that will provide the main query interface for RDBMS queries to use. Additional hybrid objects are created to support Gluent Data Platform optimizations.

A log file containing all of the steps in the present process will be generated in $OFFLOAD_HOME/log. Using the -v or --vv option will result in more screen output during the present.

Source Backend Object Rules

The source backend object can be a table or view and is identified by the -t option. There are some simple rules to correctly specifying the source object name with this option:

  • The source name must always fully match the backend database/dataset/schema and object name. This might seem obvious, but it is possible in some scenarios that the name of the backend database could be different to the target hybrid schema

  • Identifiers are case-sensitive when presenting tables or views from Google BigQuery or Snowflake. This means that the -t option value must be provided in the same identifier case as the source object. This does not apply to Hadoop-based backends

Source/Target Naming Mismatches

Object naming rules can vary between systems and it is possible that a backend table or view to be presented will contain name formats that are not supported by the target RDBMS. For example, Impala, Google BigQuery and Snowflake support long identifier names and/or characters that are not supported on older versions of Oracle Database. Present automatically handles these scenarios by creating an intermediate view in the backend database/dataset/schema with generated identifiers that are supported by the target RDBMS. This view, known as a conversion view, provides the interface between the hybrid external table (used by the hybrid view) and the presented backend table or view. Conversion views have a naming format of <PRESENTED_OBJECT>_CONV.

Overriding the Names of Presented Hybrid Object

By default, the RDBMS hybrid objects created by Present are named according to the source table or view name and the RDBMS hybrid schema name is derived from the source backend database/dataset/schema name. For example, if a table named SH.SALES is presented, a SH_H.SALES hybrid view will be created in the RDBMS, together with a set of other hybrid objects named with SH_H.SALES as a prefix (e.g. SH_H.SALES_EXT, SH_H.SALES_AGG and so on). Present allows users to override the target hybrid schema and/or the default object naming prefix using the --target-name option. This should be used when:

  • The source object contains characters that are not supported by the naming rules of the target RDBMS

  • Gluent Data Platform is configured to use multitenant naming and hence the backend database has an additional prefix in its name that the hybrid schema doesn’t have (see Multitenant Naming Schemes for details)

  • The hybrid schema name is not derived from the backend database/dataset/schema name

  • The naming standards of the target RDBMS application are different to the source backend database/dataset/schema

  • The backend database/dataset/schema name is longer than the schema name limits imposed by the RDBMS and Gluent Offload Engine

  • An aggregation or join is being presented (see Custom Advanced Aggregation Pushdown Rules and Presenting Joins for details)

Examples of the --target-name option are provided in Examples 2 and 3 below and in other applicable use-cases later in this guide.

Example 2: Overriding the Names of Presented Hybrid Objects
$ $OFFLOAD_HOME/bin/present -t SH.SALES -x --target-name=SH_H.SALES_NEW_NAME

This example presents a backend table or view named SH.SALES, but the RDBMS hybrid objects it creates are renamed with the SALES_NEW_NAME target specification (for example, the hybrid view will be named SH_H.SALES_NEW_NAME and the hybrid external table will be named SH_H.SALES_NEW_NAME_EXT and so on).

Example 3: Overriding Hybrid Schema Name

In the following example, a source backend view is being presented from a database named SH to an RDBMS hybrid schema named DATA_LAKE_H. The view name is not modified and will be used as the hybrid object naming prefix in the RDBMS.

$ $OFFLOAD_HOME/bin/present -t SH.SALES_VIEW -x --target-name=DATA_LAKE_H.SALES_VIEW

In this example, Present will create a base hybrid view named DATA_LAKE_H.SALES_VIEW, supported by other hybrid objects prefixed with DATA_LAKE_H.SALES_VIEW.

Note

When an object is presented with an overridden name, any further presents of this base object (such as an aggregation) require the override name to be provided with the --base-name option. See Custom Aggregations and Present Name Overrides for details.

Custom Advanced Aggregation Pushdown Rules

As described earlier, when a table or view is presented to an RDBMS, Present creates a range of hybrid objects, including a default pair of aggregation objects to support a Gluent Data Platform query optimization known as Advanced Aggregation Pushdown. This optimization enables Gluent Query Engine to pushdown aggregations from RDBMS hybrid queries to the backend platform. By default, Present creates a “general” aggregation rule that caters for every column and aggregation possible for the presented table and a simple “count” aggregation rule that optimizes COUNT(*) queries. The same is also true when a table is offloaded from an RDBMS to a backend platform.

With Present, it is possible to create additional Advanced Aggregation Pushdown rules by presenting custom combinations of columns and aggregations to suit the hybrid queries that can benefit from them. Custom aggregation rules can only be created if the source backend table or view has already been presented with a standard present operation or it is being created for an RDBMS table that has already been offloaded.

Presenting custom aggregation rules are useful when:

  • The base table has a large number of columns (and therefore the default general aggregation rule has many columns to project)

  • Critical queries require a much more focused aggregation pushdown that eliminates any potential for aggregating more columns than required

Custom aggregation rules are simple to present, as Example 4 below demonstrates.

Example 4: Present a Custom Aggregate View

The following example presents a simple custom aggregation rule on a previously-offloaded SH.SALES table.

$ $OFFLOAD_HOME/bin/present -t SH.SALES -x --target-name=SH_H.SALES_BY_TIME_PRODUCT --aggregate-by=TIME_ID,PROD_ID

With this example, a custom aggregation is created that includes a hybrid view and aggregation rewrite rule named SH_H.SALES_BY_TIME_PRODUCT. Due to the limited aggregate column specification (provided by the --aggregate-by option), the SH_H.SALES_BY_TIME_PRODUCT hybrid aggregation view will project fewer columns than the general SH_H.SALES_AGG hybrid view created by a standard present.

Fine-Tuning Custom Aggregation Projection

When presenting a custom aggregation, the dimension (or GROUP BY) columns are specified by the --aggregate-by option (2 columns in the Example 4 above). By default, all table columns are included as aggregated columns (measures) with a range of aggregate functions applied to each measure (applicable to the data type of each column). However, it is also possible to define the aggregated measure columns and aggregation functions that are applied to each. The options are:

  • Aggregation columns (dimensions) are specified with the --aggregate-by option (shown in Example 4 above)

  • Aggregated columns (measures) are specified with the --columns option

  • Aggregate functions applied to measures are specified with the --numeric-fns, --date-fns and/or --string-fns options as applicable

Example 5: Present a Limited Projection Custom Aggregate View

The following example repeats Example 4, but in this case the aggregated columns (measures) are limited and the aggregate functions that are applied to the measures are reduced to create a more streamlined aggregation view.

$ $OFFLOAD_HOME/bin/present -t SH.SALES -x --target-name=SH_H.SALES_BY_TIME_PRODUCT \
      --aggregate-by=TIME_ID,PROD_ID --columns=TIME_ID,PROD_ID,QUANTITY_SOLD \
      --numeric-fns=SUM,COUNT --string-fns==COUNT,MAX --date-fns=MIN,MAX

All available options have been used for demonstration purposes (although it’s more common to define a custom aggregation rule without the function overrides). See the Reference guide for a description of each of the options and their accepted values.

Custom Aggregations and Present Name Overrides

When presenting a custom aggregation rule for a table that was presented with an overridden name (see Example 3: Overriding Hybrid Schema Name for an example), the name of the base hybrid view must be provided. To do this, the --base-name option must be included in the present command. Example 6 below demonstrates this option.

Example 6: Presenting A Custom Aggregation with Base Name Override

The following example presents a custom aggregation of the backend SH.SALES table. Assuming that this table was first presented with a --target-name=SH_H.SALES_NEW_NAME option, a custom aggregation can be created as follows.

$ $OFFLOAD_HOME/bin/present -t SH.SALES -x \
      --target-name=SH_H.SALES_BY_TIME_PRODUCT --base-name=SH_H.SALES_NEW_NAME \
      --aggregate-by=TIME_ID,PROD_ID

The base name option enables Present to use the correct base hybrid metadata when creating the hybrid objects and metadata for the custom aggregation rule.

Presenting Joins

Present enables joins between presented tables/views and/or offloaded tables to be presented as custom hybrid join objects to the RDBMS. This feature allows the processing of joins in a hybrid query to be pushed down from the RDBMS to the backend system where applicable. The join specification is provided using Present with the --present-join option, based on a simple grammar as described in Present Join Grammar.

Example 7: Presenting a Join

The following example presents a three-table join to the RDBMS. For this example, assume that the SH.SALES table is partially offloaded from the RDBMS to the backend system, but the other tables are presented from the backend to the RDBMS (i.e. their data resides only in the backend system).

$ $OFFLOAD_HOME/bin/present -t sh.salescountry_join -x \
      --present-join "table(sales) alias(s)" \
      --present-join "table(customers) alias(cu) inner-join(s) join-clauses(cust_id)" \
      --present-join "table(countries) alias(co) inner-join(cu) join-clauses(country_id)"

This command will create two views named SALESCOUNTRY_JOIN:

  1. SH.SALESCOUNTRY_JOIN: a logical view of the join created in the backend SH database

  2. SH_H.SALESCOUNTRY_JOIN: a hybrid view created in the RDBMS hybrid schema to support the pushdown of equivalent joins in hybrid queries to the backend. Other supporting hybrid objects will also be created (such as a hybrid external table named SH_H.SALESCOUNTRY_JOIN_EXT)

Hybrid views created by Present Join operations are not intended to be queried directly, because they are designed for transparent query rewrite. This feature enables users to benefit from the performance of having the joins processed in the backend system, but without having to rewrite any SQL. However, Present Join hybrid views can be queried directly if required (by replacing the corresponding joins in existing SQL statements with the hybrid view or writing new queries using the hybrid view), but in most cases it should not be necessary.

When to Present a Join

When at least one table in the join resides only in the backend, use Present with the --present-join grammar option. By default, this will create a logical view of the join in the backend database (although this can be materialized as a table with the --materialize-join option). Note that any backend-only tables will have to be presented to the RDBMS before they can be used in a presented join.

When to Offload a Join

When all tables in the join have been offloaded from the RDBMS to the backend system, use Offload with the --offload-join grammar option. By default, this will materialize the results of the join in a backend table (this can be disabled with the --no-materialize-join option and a logical view will be created in the backend instead). See Offloading Joins for more details.

Present Join Grammar

When presenting joins, each table or view in the join has its own --present-join specification, as demonstrated in Example 7 above. The full grammar for this option is provided in Table 2 below.

Table 2: Present Join Grammar

Directive and Syntax

Description

TABLE([<hybrid_schema>.]<table-name>)
  • The table name to be included in the join

  • This directive is mandatory

  • Can include the hybrid schema name if joining objects from different schemas

ALIAS(<table-alias>)
  • Alias to be used for the join

  • Defaults to <table-name> if not specified

PROJECT(<*|<column>[, ..]>)
  • List of columns selected from <table-name>. * is expanded to all columns.

  • Columns in join clauses are added to the list

  • Partition key columns used are added to the list

INNER-JOIN(<join-alias>)
  • Alias to join with. Required for all but the first table

JOIN-CLAUSES(<alias>.<column>
           = <alias>.<column>[, ..])

JOIN-CLAUSES(<column>)
  • Clause defining the relationship between <table-name> and <join-alias>

  • Expanded to syntax above assuming <column> exists in both tables

Materialized Present Joins

By default, a presented join is represented in the backend by a view of the name provided by the -t option (SH.SALESCOUNTRY_JOIN in Example 7 above). Results can instead be materialized to a table by using the --materialize-join option.

Present Join and Partially-Offloaded Tables

Tables offloaded with Partition-Based Offload, Subpartition-Based Offload or Predicate-Based Offload are all supported by Present Join. When a Present Join operation includes a partially-offloaded table, the boundary predicates are included in the hybrid view and, if the join is not materialized to a table, the boundary predicates will also be included in the supporting backend join view.

Joins can include multiple partially-offloaded tables, but all tables must have the same incremental key type (i.e. be offloaded in the same way) and must have been offloaded to the same partition or data threshold. This means:

  • Partition-Based Offload tables must have the same partitioning type, partitioning key and offloaded high value. Join column names do not have to match as the Present Join specification allows for different column names

  • Subpartition-Based Offload tables must have the same subpartitioning key and offloaded high value. As with Partition-Based Offload, join column names do not need to match

  • Predicate-Based Offload tables must have the same predicate type and same set of predicates. The predicates for each table must match exactly

Refreshing Present Joins

Present Join objects are not automatically maintained and under some scenarios will become stale when a member table is further offloaded or the data in a presented member table changes. The guidelines are as follows:

  • If the Present Join is materialized, the join needs to be refreshed

  • If the Present Join is not materialized but contains one or more partially-offloaded tables that is further offloaded, the join needs to be refreshed

  • If the Present Join is not materialized and does not contain any partially-offloaded tables, the join does not need to be refreshed

Refreshing Materialized Present Joins

Materialized Present Joins always need to be refreshed when either the data in a presented member table changes or the offloaded data for a member table has changed. The method for doing this depends on which of the member tables in a Present Join has been subject to change:

  • If the data for a backend-only presented table changes, a dependent materialized Present Join table requires a full refresh. To do this, the original Present Join present command can be repeated with the --reset-backend-table option. Note that this will fully reset and re-populate the materialized Present Join table with the new results of the join

  • If a fully-offloaded table is re-offloaded, a dependent materialized Present Join table requires a full refresh. To do this, the original Present Join present command can be repeated with the --reset-backend-table option. Note that this will fully reset and re-populate the materialized Present Join table with the new results of the join

  • If a partially-offloaded table has a further incremental offload of one or more partitions, subpartitions or predicates, the dependent materialized Present Join table can be incrementally refreshed. To do this, simply repeat the original Present Join present command and the new data will be appended to the materialized join table

Refreshing Non-Materialized Present Joins

Non-materialized Present Joins only need to be refreshed when they include partially-offloaded tables and the boundaries for the partially-offloaded member tables have changed. To refresh the join, the original Present Join present command can be repeated but must include the --force or -f option. This ensures that the backend view is re-created with the new boundary predicates of the partially-offloaded tables in the join.

Schema Evolution

Schema evolution is possible in environments that have been offloaded or presented using Gluent Offload Engine. There are two methods for schema evolution in environments that use Gluent Data Platform:

  1. Automated: Gluent Offload Engine includes a utility named Schema Sync that automates the propagation of additional columns on offloaded RDBMS tables

  2. Manual: Schema changes not automated by Schema Sync can be performed manually, including the use of Present (as described below)

Schema Evolution with Schema Sync

See Schema Sync.

Schema Evolution with Present

There are two phases to schema evolution with Present:

  1. Structural changes (DDL) to user objects, including presented backend tables/views and/or offloaded RDBMS tables, are managed by the user

  2. Once all DDL changes are complete, Present is used for each modified object to refresh the corresponding hybrid schema objects

Schema Evolution Process (Presented Table/View)

If a backend table or view that has been presented to an RDBMS is modified such that the columns change, the schema evolution process is as follows:

  1. Modify the backend table or view (user DDL)

  2. Compute statistics (not applicable to Google BigQuery, Snowflake or to a presented view)

  3. Present the table or view to refresh the hybrid objects in the RDBMS

Schema Evolution Process (Offloaded Table)

If an offloaded RDBMS table is modified such that the columns change and the backend supports the equivalent column operation, the schema evolution process is as follows:

  1. Modify the RDBMS table (user DDL)

  2. Modify the equivalent table in the backend, ensuring data is updated if required (user DDL). The data type mappings used by Gluent Offload Engine should be used as a guide for defining any corresponding backend columns (see Offload Data Types)

  3. Compute statistics for the backend table (not applicable to Google BigQuery or Snowflake)

  4. Present the offloaded table to refresh the hybrid schema

Supported Schema Evolution Operations

Schema evolution supports the following operations:

  • Add column(s) (also automated with Schema Sync)

  • Rename column(s)

  • Drop column(s)

Some operations might not be supported by the RDBMS or backend system that Gluent Data Platform is being used with. More complex schema changes, such as partitioning scheme changes, are not currently supported.

Impala Restrictions

There are a few restrictions to be aware of when manually applying schema evolution changes to an Impala-based backend (such as Cloudera Data Hub or Cloudera Data Platform):

  • At present Impala does not support default column values. If adding a new column to an offloaded RDBMS table causes historical data to be modified, then the RDBMS table would need to be re-offloaded to Impala or the backend data updated manually

  • New columns do not have statistics by default and adding columns in Impala can, in some cases, invalidate table statistics. For these reasons the user should compute statistics before refreshing the RDBMS hybrid schema with Present

  • A schema evolution operation that requires the Impala data type to change from INT to BIGINT will require data migration because this type change cannot be performed natively by ALTER TABLE

  • When dropping columns in an Impala cluster, the PARQUET_FALLBACK_SCHEMA_RESOLUTION query parameter must be set to name. This can be done via the HS2_SESSION_PARAMS parameter in the Gluent Data Platform environment file

  • On Cloudera Data Platform Private Cloud and Cloudera Data Platform Public Cloud even with the PARQUET_FALLBACK_SCHEMA_RESOLUTION query parameter set to name it is not possible to drop a column in an external table

Google BigQuery Restrictions

At the time of writing, Google BigQuery does not support the following schema modifications:

  • Default column values

  • Rename column(s)

  • Drop column(s)

  • Changing a column’s data type

  • Changing a column’s mode (aside from relaxing REQUIRED columns to NULLABLE)

Common workarounds to these restrictions include:

  • SQL: Create and use a view with the desired column names or alias the column(s) that need to be renamed in SQL queries

  • Recreate the table: Use Create Table As Select (CTAS) or export/import to create a new table with all column and data changes as required

  • Use an update: If adding new column(s) to an offloaded RDBMS table causes historical data to be modified, the new column(s) in Google BigQuery would need to be updated accordingly. If this is not possible from the available data, re-offloading the table from the RDBMS could be considered

Snowflake Restrictions

At the time of writing, Snowflake does not support the following schema modifications:

  • Adding a default value to an existing column

  • Changing a default value on an existing column (other than sequence defaults)

  • Changing a column’s data type to another type

  • Reducing a string column’s length or changing a numeric column’s scale

Common workarounds to these restrictions include:

  • Recreate the table: Use Create Table As Select (CTAS) or export/import to create a new table with all column and data changes as required

  • Use an update: If adding default values to existing column(s) of an offloaded RDBMS table causes historical data to be modified, the modified column(s) in Snowflake would need to be updated accordingly. If this is not possible from the available data, re-offloading the table from the RDBMS could be considered

Present Data Types

When presenting data, Gluent Offload Engine will map data types from the backend database to the RDBMS automatically. However, as with most operations in Present, it is possible to override many of the default data type mappings if required. See Tables 3 and 4 below for default mappings and override options.

Supported Data Types

Gluent Offload Engine currently supports all of the Impala, Google BigQuery and Snowflake data types listed in Tables 3a-c below. For each backend system, tables that contain columns of any data type not listed cannot currently be presented.

Default Data Type Mappings

Table 3a lists the default data type mappings when presenting data from Impala (Cloudera Data Hub, Cloudera Data Platform) to Oracle Database.

Table 3a: Present Default Data Type Mappings (Impala)

Impala

Impala Aliases

Oracle Database

Comments

STRING

VARCHAR2

See Presenting Unbounded Strings

VARCHAR

VARCHAR2

See Presenting Unbounded Strings

CHAR

CHAR

TINYINT

NUMBER

SMALLINT

NUMBER

INT

NUMBER

BIGINT

NUMBER

DECIMAL

NUMBER

FLOAT

BINARY_FLOAT

DOUBLE

REAL

BINARY_DOUBLE

DATE

DATE

Impala 3.3+

TIMESTAMP

TIMESTAMP

Table 3b lists the default data type mappings when presenting data from Google BigQuery to Oracle Database.

Table 3b: Present Default Data Type Mappings (Google BigQuery)

Google BigQuery

Google BigQuery Aliases

Oracle Database

Comments

STRING

VARCHAR2

See Presenting Unbounded Strings

BYTES

RAW

INT64

NUMBER

NUMERIC

DECIMAL

NUMBER

BIGNUMERIC

BIGDECIMAL

NUMBER

See Presenting Google BigQuery Bignumeric

FLOAT64

BINARY_DOUBLE

DATE

DATE

DATETIME

TIMESTAMP

TIME

VARCHAR2

TIMESTAMP

TIMESTAMP WITH TIME ZONE

Google BigQuery stores TIMESTAMP data as UTC

Table 3c lists the default data type mappings when presenting data from Snowflake to Oracle Database.

Table 3c: Present Default Data Type Mappings (Snowflake)

Snowflake

Snowflake Aliases

Oracle Database

Comments

VARCHAR

CHAR, CHARACTER, STRING, TEXT

VARCHAR2, CLOB

See Presenting Snowflake Varchar

BINARY

VARBINARY

RAW, BLOB

See Presenting Snowflake Binary

INTEGER

INT, BYTEINT, TINYINT, SMALLINT, BIGINT

NUMBER

NUMBER

DECIMAL, NUMERIC,

NUMBER

FLOAT

FLOAT4, FLOAT8, DOUBLE [PRECISION], REAL

BINARY_DOUBLE

DATE

DATE

TIME

VARCHAR2

TIMESTAMP_NTZ

DATETIME

TIMESTAMP

TIMESTAMP_TZ

TIMESTAMP WITH TIME ZONE

Presenting Unbounded Strings

Impala and Google BigQuery do not expose length metadata for STRING columns. By default, Present will define all corresponding Oracle VARCHAR2 columns as VARCHAR2(4000 CHAR), unless the --detect-sizes option is used. If this option is specified, Present will sample the data in the backend STRING columns to determine the column character lengths to use when defining the corresponding hybrid external table in Oracle.

Presenting Google BigQuery Bignumeric

The Google BigQuery BIGNUMERIC data type supports a higher combined precision and scale than the Oracle NUMBER data type. When querying presented BIGNUMERIC data with a scale that exceeds the limits of the Oracle NUMBER data type, some data accuracy will be lost as decimals are rounded to the number of digits available for scale in Oracle. This can also affect the results of aggregate functions such as SUM and AVG, where the order of rounded data inputs can affect the final results.

Presenting Snowflake Varchar

By default, Snowflake VARCHAR columns with length <= 4000 will be presented to Oracle as VARCHAR2 (unless overridden; see Presenting Large Strings) and Snowflake columns with lengths > 4000 will be presented as CLOB. Snowflake VARCHAR columns are presented to Oracle Database using character length semantics. For Snowflake tables that store multibyte characters, it is important that the characterset of the Oracle database supports the characters that are to be presented (i.e. to ensure parity between the data and column lengths).

Presenting Snowflake Binary

By default, Snowflake BINARY columns with length <= 2000 will be presented to Oracle as RAW (unless overridden; see Presenting Large Binary Values) and Snowflake columns with lengths > 2000 will be presented as BLOB.

Override Data Type Mappings

When presenting a table, it is possible to override some of the default data type mappings to change the specification of the target RDBMS hybrid external table and hybrid view. Table 4a lists the override mappings and associated options that are available when presenting data from Impala (Cloudera Data Hub, Cloudera Data Platform) to Oracle Database.

Table 4a: Present Override Data Type Mappings (Impala)

Impala

Impala Alias

Present Option

Oracle Database

Comments

STRING

--binary-columns

RAW

--large-binary-columns

BLOB

See Presenting Large Binary Values

--large-string-columns

CLOB

See Presenting Large Strings

--interval-ds-columns

INTERVAL DAY TO SECOND

See Presenting Intervals

--interval-ym-columns

INTERVAL YEAR TO MONTH

See Presenting Intervals

VARCHAR

--binary-columns

RAW

--large-binary-columns

BLOB

See Presenting Large Binary Values

--large-string-columns

CLOB

See Presenting Large Strings

DATE

--timestamp-columns

TIMESTAMP

Impala 3.3+

TIMESTAMP

--date-columns

DATE

See Lossy Data Conversions

DECIMAL

--integer-1-columns

NUMBER(38)

--integer-2-columns

NUMBER(38)

--integer-4-columns

NUMBER(38)

--integer-8-columns

NUMBER(38)

--integer-38-columns

NUMBER(38)

FLOAT

--decimal-columns

NUMBER(38,18)

See Lossy Data Conversions

DOUBLE

REAL

--decimal-columns

NUMBER(38,18)

See Lossy Data Conversions

Table 4b lists the override mappings and associated options that are available when presenting data from Google BigQuery to Oracle Database.

Table 4b: Present Override Data Type Mappings (Google BigQuery)

Google BigQuery

Google BigQuery Alias

Present Option

Oracle Database

Comments

STRING

--binary-columns

RAW

--large-binary-columns

BLOB

See Presenting Large Binary Values

--large-string-columns

CLOB

See Presenting Large Strings

--interval-ds-columns

INTERVAL DAY TO SECOND

See Presenting Intervals

--interval-ym-columns

INTERVAL YEAR TO MONTH

See Presenting Intervals

BYTES

--large-binary-columns

BLOB

See Presenting Large Binary Values

DATE

--timestamp-columns

TIMESTAMP

DATETIME

--date-columns

DATE

See Lossy Data Conversions

TIMESTAMP

--date-columns

DATE

See Lossy Data Conversions

--timestamp-columns

TIMESTAMP

NUMERIC

DECIMAL

--integer-1-columns

NUMBER(38)

--integer-2-columns

NUMBER(38)

--integer-4-columns

NUMBER(38)

--integer-8-columns

NUMBER(38)

--integer-38-columns

NUMBER(38)

BIGNUMERIC

BIGDECIMAL

--integer-1-columns

NUMBER(38)

--integer-2-columns

NUMBER(38)

--integer-4-columns

NUMBER(38)

--integer-8-columns

NUMBER(38)

--integer-38-columns

NUMBER(38)

FLOAT64

--decimal-columns

NUMBER(38,9)

See Lossy Data Conversions

Table 4c lists the override mappings and associated options that are available when presenting data from Snowflake to Oracle Database.

Table 4c: Present Override Data Type Mappings (Snowflake)

Snowflake

Snowflake Alias

Present Option

Oracle Database

Comments

VARCHAR

CHAR CHARACTER TEXT STRING

--binary-columns

RAW

--large-binary-columns

BLOB

See Presenting Large Binary Values

--large-string-columns

CLOB

See Presenting Large Strings

--interval-ds-columns

INTERVAL DAY TO SECOND

See Presenting Intervals

--interval-ym-columns

INTERVAL YEAR TO MONTH

See Presenting Intervals

BINARY

VARBINARY

--large-binary-columns

BLOB

See Presenting Large Binary Values

DATE

--timestamp-columns

TIMESTAMP

TIMESTAMP_NTZ

DATETIME

--date-columns

DATE

See Lossy Data Conversions

TIMESTAMP_TZ

--date-columns

DATE

See Lossy Data Conversions

--timestamp-columns

TIMESTAMP

NUMBER

DECIMAL NUMERIC

--integer-1-columns

NUMBER(38)

--integer-2-columns

NUMBER(38)

--integer-4-columns

NUMBER(38)

--integer-8-columns

NUMBER(38)

--integer-38-columns

NUMBER(38)

FLOAT

FLOAT4 FLOAT8 DOUBLE [PRECISION] REAL

--decimal-columns

NUMBER(38,9)

See Lossy Data Conversions

Example 8: Overriding Data Types During a Present

In the following example, the SH.SALES table is presented with several data type overrides.

$ $OFFLOAD_HOME/bin/present -t SH.SALES -x \
      --date-columns=TIME_ID \
      --integer-4-columns=PROD_ID,CHANNEL_ID \
      --decimal-columns-type=10,2 --decimal-columns=AMOUNT_SOLD

Each override can be given one or more columns. The --decimal-columns override can be used to fine-tune the precision and scale of the presented NUMBER data type. This override option behaves differently to the others in that it must be paired with a --decimal-columns-type option to define the precision and scale for the columns listed. Multiple different names/type pairings can be provided in the same command (see --decimal-columns for further details).

Data type override options support use of the wildcard character *. This is useful when a logical model has column naming rules indicating the data they contain. The example above could use --decimal-columns=AMOUNT* to match multiple columns containing monetary values.

Presenting Large Strings

String data that exceeds the byte or characters limits of the CHAR or VARCHAR2 Oracle Database data types can be presented as CLOB data using the --large-string-columns option. By default, Present creates the hybrid external table with a maximum size of 32768 bytes for each CLOB column. If larger data values need to be presented, the --lob-data-length option can be used to increase the size of the presented CLOB columns.

Presenting Large Binary Values

Binary data that exceeds the byte or characters limits of the RAW Oracle Database data type can be presented as BLOB data using the --large-binary-columns option. By default, Present creates the hybrid external table with a maximum size of 32768 bytes for each BLOB column. If larger data values need to be presented, the --lob-data-length option can be used to increase the size of the presented BLOB columns.

Presenting Intervals

Intervals stored as strings can be presented using the appropriate interval option as described in Tables 4a-c. Present supports intervals stored in either ISO 8601 format or the datetime format supported by Oracle Database. See Table 5: Interval Data Type String Formats for a small sample of datetime formats and the ISO 8601 specification for full details of ISO interval formats.

Lossy Data Conversions

Depending on the source data, some options can lead to a loss of precision or data in hybrid queries and should be used with caution:

  • Using the --decimal-columns override option to present 32-bit or 64-bit floating point data to a decimal number can lead to a loss of data precision when querying the presented column

  • Using the --date-columns with datetime types such as an Impala TIMESTAMP, Google BigQuery DATETIME or Snowflake TIMESTAMP_NTZ can lead to the loss of fractional seconds when presented to an Oracle Database. This option can be safely used when the precision or granularity of the backend source data is at or below that of the target RDBMS datatype

  • When the --timestamp-columns option is used to present a Google BigQuery TIMESTAMP or Snowflake TIMESTAMP_TZ column, any time zone information is lost when querying the presented column

Important

Using lossy Present options such as the above can cause wrong results with some hybrid queries; such as when Gluent Query Engine pushes down a predicate containing values from the lossy column(s). It is not advised to enable lossy data type overrides for columns that are likely to be used in hybrid query predicates or joins.

Managing Statistics

Database statistics enable query optimizers to determine the most efficient way to execute queries. To ensure that hybrid queries benefit from efficient execution plans, Gluent Offload Engine manages statistics for both the hybrid external table and the backend target table (where applicable).

Hybrid Object Statistics

When presenting a table to an RDBMS, Present copies any available table statistics from the source backend table to the base hybrid external table. In some cases, Present will generate its own statistics for backend objects before copying them to the RDBMS. In general, the behavior is as follows:

  • Presenting tables:

    • If the backend table has statistics, these will be copied to the RDBMS

    • If the backend table does not have statistics, Present will generate its own statistics by scanning the table and these will be copied to the RDBMS

  • Presenting views:

    • Present will generate its own statistics by scanning the view and these will be copied to the RDBMS

    • This includes views that support non-materialized Present Join

  • Presenting materialized joins:

    • Present will compute statistics for the materialized backend table and these will be copied to the RDBMS

Hybrid external tables that support general or custom Advanced Aggregation Pushdown have their statistics derived from the base external table statistics by Gluent Offload Engine. Statistics for the count-only Advanced Aggregation Pushdown hybrid external table are generated from heuristics by Gluent Offload Engine.

Disabling Statistics Gathering

Statistics generation can be disabled by adding the --no-gather-stats option to the present command. This will work for any Present operation that requires statistics to be generated or natively computed by the backend (i.e. all of those described above). When disabling statistics, all hybrid external tables except those that support count-only Advanced Aggregation Pushdown will have no statistics.

Overriding the Statistics Sampling Rate

By default, Present generates statistics by fully scanning the backend view or table being presented. It is possible to override the sampling level by adding the --sample-stats option with a value between 0 and 100 to the present command, for example:

--sample-stats=50

Note

This option specifies the percentage of partitions to sample, not rows.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com