Present¶
Table of Contents
Documentation Conventions¶
Commands and keywords are in this
font
.$OFFLOAD_HOME
is set when the environment file (offload.env
) is sourced, unless already set, and refers to the directory namedoffload
that is created when the software is unpacked. This is also referred to as<OFFLOAD_HOME>
in sections of this guide where the environment file has not been created/sourced.Third party vendor product names might be aliased or shortened for simplicity. See Third Party Vendor Products for cross-references to full product names and trademarks.
Introduction¶
Present is a core component of Gluent Offload Engine that provides a simple and flexible way to query data stored in a backend data platform (such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery, Snowflake) from an RDBMS (such as Oracle Database). This guide describes how to use Present for a variety of common and advanced use-cases. It is assumed that readers are familiar with Gluent Data Platform Concepts prior to reading this Present guide, particularly the Present section.
Overview of the Present Process¶
To understand the Present software and its use-cases, options and operation, it is useful to start with a high-level understanding of a present process in general. Presenting a table or view from a backend system to an RDBMS requires a number of sequential operations to be orchestrated, involving a variety of components, interfaces and parts of the source and target systems.
In the majority of cases, the present process is limited to the creation of hybrid objects and metadata operations. No data is involved and a present operation is therefore quick and lightweight.
In the less-common case of a materialized Present-Join operation (described in Presenting Joins), the present process will begin by creating and populating a table in the backend based on the join specification, but will then finish with the standard phase of creating hybrid objects and metadata to enable hybrid queries of the backend data. Because data population is involved, this use-case will usually take longer than a regular present process.
The various steps executed by Present are visible both on the terminal and in a log file and the level of verbosity can be configured (see the LOG_LEVEL
configuration parameter and -v
and --vv
options for details).
Controlling the Present Process¶
The present process is highly configurable. Many steps in the process can be modified by options or configuration parameters and a number of steps are also optional and can be disabled per present. In addition to the remainder of this Present guide (which will describe many of the options), see the Reference guide for a full list of options and configuration parameters and their descriptions.
Present Objects¶
The following table lists the standard set of RDBMS and backend database objects that are created by a basic present of a backend table named DB.TABLE
.
Table 1: Standard Objects Created by Present¶
Database |
Object Type |
Object Naming Format |
Hybrid Object Type |
---|---|---|---|
RDBMS |
View |
DB_H.TABLE |
Hybrid view |
RDBMS |
Table |
DB_H.TABLE_EXT |
Hybrid external table |
RDBMS |
View |
DB_H.TABLE_AGG |
Aggregate hybrid view |
RDBMS |
Table |
DB_H.TABLE_AGG_EXT |
Aggregate hybrid external table |
RDBMS |
Rewrite Equivalence |
DB_H.TABLE_AGG |
Aggregate rewrite rule |
RDBMS |
View |
DB_H.TABLE_CNT_AGG |
Aggregate hybrid view |
RDBMS |
Table |
DB_H.TABLE_CNT_AGG_EXT |
Aggregate hybrid external table |
RDBMS |
Rewrite Equivalence |
DB_H.TABLE_CNT_AGG |
Aggregate rewrite rule |
All RDBMS hybrid objects are created in a hybrid schema, which itself is pre-created by Gluent Data Platform. In Table 1 above, the backend database is named “DB” and therefore, in the RDBMS, a hybrid schema named “DB_H” is pre-created to own the hybrid objects that Present generates. In most cases, the hybrid schema is created to complement an existing RDBMS application schema (often due to Offload), but Present can be used with hybrid schemas that have no relationship to other application schemas in the RDBMS.
Depending on the present operation, other objects might be created, such as a backend materialized join table or a backend conversion view. These are described later in this guide.
See also
Glossary for definitions of the hybrid object types in addition to the Concepts and Performance guides for further information on their purpose and usage.
Present Use-Cases¶
The common use-cases supported by Present are:
Create a Hybrid Query Interface to Backend-Resident Data¶
With Present, it is possible to create a hybrid query interface to tables or views that reside in backend platforms such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery or Snowflake. This feature enables hybrid queries that are executed in the RDBMS to access data that is stored in the backend system. Some examples follow.
Example 1: Presenting a Backend Table¶
The following example presents a table from a backend system to the RDBMS.
$ . $OFFLOAD_HOME/conf/offload.env
$ $OFFLOAD_HOME/bin/present -t SH.BACKEND_TABLE -x
Note
This example includes the sourcing of the Gluent Data Platform environment file for a present session. All other examples in this guide will assume this step has occurred.
As described in Overview of the Present Process, a basic present operation such as Example 1 involves the creation of hybrid objects and metadata in the RDBMS. No data is involved, making it a very fast operation. The target hybrid schema name is derived from the source database name and in this case is SH_H (there are other ways to specify the target schema, described later). Example 1 will create several hybrid objects in the SH_H hybrid schema, including a hybrid view named SH_H.BACKEND_TABLE that will provide the main query interface for RDBMS queries to use. Additional hybrid objects are created to support Gluent Data Platform optimizations.
A log file containing all of the steps in the present process will be generated in $OFFLOAD_HOME/log
. Using the -v
or --vv
option will result in more screen output during the present.
Source Backend Object Rules¶
The source backend object can be a table or view and is identified by the -t
option. There are some simple rules to correctly specifying the source object name with this option:
The source name must always fully match the backend database/dataset/schema and object name. This might seem obvious, but it is possible in some scenarios that the name of the backend database could be different to the target hybrid schema
Identifiers are case-sensitive when presenting tables or views from Google BigQuery or Snowflake. This means that the
-t
option value must be provided in the same identifier case as the source object. This does not apply to Hadoop-based backends
Source/Target Naming Mismatches¶
Object naming rules can vary between systems and it is possible that a backend table or view to be presented will contain name formats that are not supported by the target RDBMS. For example, Impala, Google BigQuery and Snowflake support long identifier names and/or characters that are not supported on older versions of Oracle Database. Present automatically handles these scenarios by creating an intermediate view in the backend database/dataset/schema with generated identifiers that are supported by the target RDBMS. This view, known as a conversion view, provides the interface between the hybrid external table (used by the hybrid view) and the presented backend table or view. Conversion views have a naming format of <PRESENTED_OBJECT>_CONV
.
Overriding the Names of Presented Hybrid Object¶
By default, the RDBMS hybrid objects created by Present are named according to the source table or view name and the RDBMS hybrid schema name is derived from the source backend database/dataset/schema name. For example, if a table named SH.SALES is presented, a SH_H.SALES hybrid view will be created in the RDBMS, together with a set of other hybrid objects named with SH_H.SALES as a prefix (e.g. SH_H.SALES_EXT, SH_H.SALES_AGG and so on). Present allows users to override the target hybrid schema and/or the default object naming prefix using the --target-name
option. This should be used when:
The source object contains characters that are not supported by the naming rules of the target RDBMS
Gluent Data Platform is configured to use multitenant naming and hence the backend database has an additional prefix in its name that the hybrid schema doesn’t have (see Multitenant Naming Schemes for details)
The hybrid schema name is not derived from the backend database/dataset/schema name
The naming standards of the target RDBMS application are different to the source backend database/dataset/schema
An aggregation or join is being presented (see Custom Advanced Aggregation Pushdown Rules and Presenting Joins for details)
Examples of the --target-name
option are provided in Examples 2 and 3 below and in other applicable use-cases later in this guide.
Example 2: Overriding the Names of Presented Hybrid Objects¶
$ $OFFLOAD_HOME/bin/present -t SH.SALES -x --target-name=SH_H.SALES_NEW_NAME
This example presents a backend table or view named SH.SALES, but the RDBMS hybrid objects it creates are renamed with the SALES_NEW_NAME target specification (for example, the hybrid view will be named SH_H.SALES_NEW_NAME and the hybrid external table will be named SH_H.SALES_NEW_NAME_EXT and so on).
Example 3: Overriding Hybrid Schema Name¶
In the following example, a source backend view is being presented from a database named SH to an RDBMS hybrid schema named DATA_LAKE_H. The view name is not modified and will be used as the hybrid object naming prefix in the RDBMS.
$ $OFFLOAD_HOME/bin/present -t SH.SALES_VIEW -x --target-name=DATA_LAKE_H.SALES_VIEW
In this example, Present will create a base hybrid view named DATA_LAKE_H.SALES_VIEW, supported by other hybrid objects prefixed with DATA_LAKE_H.SALES_VIEW.
Note
When an object is presented with an overridden name, any further presents of this base object (such as an aggregation) require the override name to be provided with the --base-name
option. See Custom Aggregations and Present Name Overrides for details.
Custom Advanced Aggregation Pushdown Rules¶
As described earlier, when a table or view is presented to an RDBMS, Present creates a range of hybrid objects, including a default pair of aggregation objects to support a Gluent Data Platform query optimization known as Advanced Aggregation Pushdown. This optimization enables Gluent Query Engine to pushdown aggregations from RDBMS hybrid queries to the backend platform. By default, Present creates a “general” aggregation rule that caters for every column and aggregation possible for the presented table and a simple “count” aggregation rule that optimizes COUNT(*)
queries. The same is also true when a table is offloaded from an RDBMS to a backend platform.
With Present, it is possible to create additional Advanced Aggregation Pushdown rules by presenting custom combinations of columns and aggregations to suit the hybrid queries that can benefit from them. Custom aggregation rules can only be created if the source backend table or view has already been presented with a standard present operation or it is being created for an RDBMS table that has already been offloaded.
Presenting custom aggregation rules are useful when:
The base table has a large number of columns (and therefore the default general aggregation rule has many columns to project)
Critical queries require a much more focused aggregation pushdown that eliminates any potential for aggregating more columns than required
Custom aggregation rules are simple to present, as Example 4 below demonstrates.
Example 4: Present a Custom Aggregate View¶
The following example presents a simple custom aggregation rule on a previously-offloaded SH.SALES table.
$ $OFFLOAD_HOME/bin/present -t SH.SALES -x --target-name=SH_H.SALES_BY_TIME_PRODUCT --aggregate-by=TIME_ID,PROD_ID
With this example, a custom aggregation is created that includes a hybrid view and aggregation rewrite rule named SH_H.SALES_BY_TIME_PRODUCT. Due to the limited aggregate column specification (provided by the --aggregate-by
option), the SH_H.SALES_BY_TIME_PRODUCT hybrid aggregation view will project fewer columns than the general SH_H.SALES_AGG hybrid view created by a standard present.
Fine-Tuning Custom Aggregation Projection¶
When presenting a custom aggregation, the dimension (or GROUP BY
) columns are specified by the --aggregate-by
option (2 columns in the Example 4 above). By default, all table columns are included as aggregated columns (measures) with a range of aggregate functions applied to each measure (applicable to the data type of each column). However, it is also possible to define the aggregated measure columns and aggregation functions that are applied to each. The options are:
Aggregation columns (dimensions) are specified with the
--aggregate-by
option (shown in Example 4 above)Aggregated columns (measures) are specified with the
--columns
optionAggregate functions applied to measures are specified with the
--numeric-fns
,--date-fns
and/or--string-fns
options as applicable
Example 5: Present a Limited Projection Custom Aggregate View¶
The following example repeats Example 4, but in this case the aggregated columns (measures) are limited and the aggregate functions that are applied to the measures are reduced to create a more streamlined aggregation view.
$ $OFFLOAD_HOME/bin/present -t SH.SALES -x --target-name=SH_H.SALES_BY_TIME_PRODUCT \
--aggregate-by=TIME_ID,PROD_ID --columns=TIME_ID,PROD_ID,QUANTITY_SOLD \
--numeric-fns=SUM,COUNT --string-fns==COUNT,MAX --date-fns=MIN,MAX
All available options have been used for demonstration purposes (although it’s more common to define a custom aggregation rule without the function overrides). See the Reference guide for a description of each of the options and their accepted values.
Custom Aggregations and Present Name Overrides¶
When presenting a custom aggregation rule for a table that was presented with an overridden name (see Example 3: Overriding Hybrid Schema Name for an example), the name of the base hybrid view must be provided. To do this, the --base-name
option must be included in the present
command. Example 6 below demonstrates this option.
Example 6: Presenting A Custom Aggregation with Base Name Override¶
The following example presents a custom aggregation of the backend SH.SALES table. Assuming that this table was first presented with a --target-name=SH_H.SALES_NEW_NAME
option, a custom aggregation can be created as follows.
$ $OFFLOAD_HOME/bin/present -t SH.SALES -x \
--target-name=SH_H.SALES_BY_TIME_PRODUCT --base-name=SH_H.SALES_NEW_NAME \
--aggregate-by=TIME_ID,PROD_ID
The base name option enables Present to use the correct base hybrid metadata when creating the hybrid objects and metadata for the custom aggregation rule.
Presenting Joins¶
Present enables joins between presented tables/views and/or offloaded tables to be presented as custom hybrid join objects to the RDBMS. This feature allows the processing of joins in a hybrid query to be pushed down from the RDBMS to the backend system where applicable. The join specification is provided using Present with the --present-join
option, based on a simple grammar as described in Present Join Grammar.
Example 7: Presenting a Join¶
The following example presents a three-table join to the RDBMS. For this example, assume that the SH.SALES table is partially offloaded from the RDBMS to the backend system, but the other tables are presented from the backend to the RDBMS (i.e. their data resides only in the backend system).
$ $OFFLOAD_HOME/bin/present -t sh.salescountry_join -x \
--present-join "table(sales) alias(s)" \
--present-join "table(customers) alias(cu) inner-join(s) join-clauses(cust_id)" \
--present-join "table(countries) alias(co) inner-join(cu) join-clauses(country_id)"
This command will create two views named SALESCOUNTRY_JOIN:
SH.SALESCOUNTRY_JOIN: a logical view of the join created in the backend SH database
SH_H.SALESCOUNTRY_JOIN: a hybrid view created in the RDBMS hybrid schema to support the pushdown of equivalent joins in hybrid queries to the backend. Other supporting hybrid objects will also be created (such as a hybrid external table named SH_H.SALESCOUNTRY_JOIN_EXT)
Hybrid views created by Present Join operations are not intended to be queried directly, because they are designed for transparent query rewrite. This feature enables users to benefit from the performance of having the joins processed in the backend system, but without having to rewrite any SQL. However, Present Join hybrid views can be queried directly if required (by replacing the corresponding joins in existing SQL statements with the hybrid view or writing new queries using the hybrid view), but in most cases it should not be necessary.
When to Present a Join¶
When at least one table in the join resides only in the backend, use Present with the --present-join
grammar option. By default, this will create a logical view of the join in the backend database (although this can be materialized as a table with the --materialize-join
option). Note that any backend-only tables will have to be presented to the RDBMS before they can be used in a presented join.
When to Offload a Join¶
When all tables in the join have been offloaded from the RDBMS to the backend system, use Offload with the --offload-join
grammar option. By default, this will materialize the results of the join in a backend table (this can be disabled with the --no-materialize-join
option and a logical view will be created in the backend instead). See Offloading Joins for more details.
Present Join Grammar¶
When presenting joins, each table or view in the join has its own --present-join
specification, as demonstrated in Example 7 above. The full grammar for this option is provided in Table 2 below.
Table 2: Present Join Grammar¶
Directive and Syntax |
Description |
---|---|
TABLE([<hybrid_schema>.]<table-name>)
|
|
ALIAS(<table-alias>)
|
|
PROJECT(<*|<column>[, ..]>)
|
|
INNER-JOIN(<join-alias>)
|
|
JOIN-CLAUSES(<alias>.<column>
= <alias>.<column>[, ..])
JOIN-CLAUSES(<column>)
|
|
Materialized Present Joins¶
By default, a presented join is represented in the backend by a view of the name provided by the -t
option (SH.SALESCOUNTRY_JOIN in Example 7 above). Results can instead be materialized to a table by using the --materialize-join
option.
Present Join and Partially-Offloaded Tables¶
Tables offloaded with Partition-Based Offload, Subpartition-Based Offload or Predicate-Based Offload are all supported by Present Join. When a Present Join operation includes a partially-offloaded table, the boundary predicates are included in the hybrid view and, if the join is not materialized to a table, the boundary predicates will also be included in the supporting backend join view.
Joins can include multiple partially-offloaded tables, but all tables must have the same incremental key type (i.e. be offloaded in the same way) and must have been offloaded to the same partition or data threshold. This means:
Partition-Based Offload tables must have the same partitioning type, partitioning key and offloaded high value. Join column names do not have to match as the Present Join specification allows for different column names
Subpartition-Based Offload tables must have the same subpartitioning key and offloaded high value. As with Partition-Based Offload, join column names do not need to match
Predicate-Based Offload tables must have the same predicate type and same set of predicates. The predicates for each table must match exactly
Refreshing Present Joins¶
Present Join objects are not automatically maintained and under some scenarios will become stale when a member table is further offloaded or the data in a presented member table changes. The guidelines are as follows:
If the Present Join is materialized, the join needs to be refreshed
If the Present Join is not materialized but contains one or more partially-offloaded tables that is further offloaded, the join needs to be refreshed
If the Present Join is not materialized and does not contain any partially-offloaded tables, the join does not need to be refreshed
Refreshing Materialized Present Joins¶
Materialized Present Joins always need to be refreshed when either the data in a presented member table changes or the offloaded data for a member table has changed. The method for doing this depends on which of the member tables in a Present Join has been subject to change:
If the data for a backend-only presented table changes, a dependent materialized Present Join table requires a full refresh. To do this, the original Present Join
present
command can be repeated with the--reset-backend-table
option. Note that this will fully reset and re-populate the materialized Present Join table with the new results of the joinIf a fully-offloaded table is re-offloaded, a dependent materialized Present Join table requires a full refresh. To do this, the original Present Join
present
command can be repeated with the--reset-backend-table
option. Note that this will fully reset and re-populate the materialized Present Join table with the new results of the joinIf a partially-offloaded table has a further incremental offload of one or more partitions, subpartitions or predicates, the dependent materialized Present Join table can be incrementally refreshed. To do this, simply repeat the original Present Join
present
command and the new data will be appended to the materialized join table
Refreshing Non-Materialized Present Joins¶
Non-materialized Present Joins only need to be refreshed when they include partially-offloaded tables and the boundaries for the partially-offloaded member tables have changed. To refresh the join, the original Present Join present
command can be repeated but must include the --force
or -f
option. This ensures that the backend view is re-created with the new boundary predicates of the partially-offloaded tables in the join.
Schema Evolution¶
Schema evolution is possible in environments that have been offloaded or presented using Gluent Offload Engine. There are two methods for schema evolution in environments that use Gluent Data Platform:
Automated: Gluent Offload Engine includes a utility named Schema Sync that automates the propagation of additional columns on offloaded RDBMS tables
Manual: Schema changes not automated by Schema Sync can be performed manually, including the use of Present (as described below)
Schema Evolution with Schema Sync¶
See Schema Sync.
Schema Evolution with Present¶
There are two phases to schema evolution with Present:
Structural changes (DDL) to user objects, including presented backend tables/views and/or offloaded RDBMS tables, are managed by the user
Once all DDL changes are complete, Present is used for each modified object to refresh the corresponding hybrid schema objects
Schema Evolution Process (Presented Table/View)¶
If a backend table or view that has been presented to an RDBMS is modified such that the columns change, the schema evolution process is as follows:
Modify the backend table or view (user DDL)
Compute statistics (not applicable to Google BigQuery, Snowflake or to a presented view)
Present the table or view to refresh the hybrid objects in the RDBMS
Schema Evolution Process (Offloaded Table)¶
If an offloaded RDBMS table is modified such that the columns change and the backend supports the equivalent column operation, the schema evolution process is as follows:
Modify the RDBMS table (user DDL)
Modify the equivalent table in the backend, ensuring data is updated if required (user DDL). The data type mappings used by Gluent Offload Engine should be used as a guide for defining any corresponding backend columns (see Offload Data Types)
Compute statistics for the backend table (not applicable to Google BigQuery or Snowflake)
Present the offloaded table to refresh the hybrid schema
Supported Schema Evolution Operations¶
Schema evolution supports the following operations:
Add column(s) (also automated with Schema Sync)
Rename column(s)
Drop column(s)
Some operations might not be supported by the RDBMS or backend system that Gluent Data Platform is being used with. More complex schema changes, such as partitioning scheme changes, are not currently supported.
Impala Restrictions¶
There are a few restrictions to be aware of when manually applying schema evolution changes to an Impala-based backend (such as Cloudera Data Hub or Cloudera Data Platform):
At present Impala does not support default column values. If adding a new column to an offloaded RDBMS table causes historical data to be modified, then the RDBMS table would need to be re-offloaded to Impala or the backend data updated manually
New columns do not have statistics by default and adding columns in Impala can, in some cases, invalidate table statistics. For these reasons the user should compute statistics before refreshing the RDBMS hybrid schema with Present
A schema evolution operation that requires the Impala data type to change from
INT
toBIGINT
will require data migration because this type change cannot be performed natively byALTER TABLE
When dropping columns in an Impala cluster, the
PARQUET_FALLBACK_SCHEMA_RESOLUTION
query parameter must be set toname
. This can be done via theHS2_SESSION_PARAMS
parameter in the Gluent Data Platform environment fileOn Cloudera Data Platform Private Cloud and Cloudera Data Platform Public Cloud even with the
PARQUET_FALLBACK_SCHEMA_RESOLUTION
query parameter set toname
it is not possible to drop a column in an external table
Google BigQuery Restrictions¶
At the time of writing, Google BigQuery does not support the following schema modifications:
Default column values
Rename column(s)
Drop column(s)
Changing a column’s data type
Changing a column’s mode (aside from relaxing
REQUIRED
columns toNULLABLE
)
Common workarounds to these restrictions include:
SQL: Create and use a view with the desired column names or alias the column(s) that need to be renamed in SQL queries
Recreate the table: Use Create Table As Select (CTAS) or export/import to create a new table with all column and data changes as required
Use an update: If adding new column(s) to an offloaded RDBMS table causes historical data to be modified, the new column(s) in Google BigQuery would need to be updated accordingly. If this is not possible from the available data, re-offloading the table from the RDBMS could be considered
Snowflake Restrictions¶
At the time of writing, Snowflake does not support the following schema modifications:
Adding a default value to an existing column
Changing a default value on an existing column (other than sequence defaults)
Changing a column’s data type to another type
Reducing a string column’s length or changing a numeric column’s scale
Common workarounds to these restrictions include:
Recreate the table: Use Create Table As Select (CTAS) or export/import to create a new table with all column and data changes as required
Use an update: If adding default values to existing column(s) of an offloaded RDBMS table causes historical data to be modified, the modified column(s) in Snowflake would need to be updated accordingly. If this is not possible from the available data, re-offloading the table from the RDBMS could be considered
Present Data Types¶
When presenting data, Gluent Offload Engine will map data types from the backend database to the RDBMS automatically. However, as with most operations in Present, it is possible to override many of the default data type mappings if required. See Tables 3 and 4 below for default mappings and override options.
Supported Data Types¶
Gluent Offload Engine currently supports all of the Impala, Google BigQuery and Snowflake data types listed in Tables 3a-c below. For each backend system, tables that contain columns of any data type not listed cannot currently be presented.
Default Data Type Mappings¶
Table 3a lists the default data type mappings when presenting data from Impala (Cloudera Data Hub, Cloudera Data Platform) to Oracle Database.
Table 3a: Present Default Data Type Mappings (Impala)¶
Impala |
Impala Aliases |
Oracle Database |
Comments |
---|---|---|---|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
|
|
|
|
Impala 3.3+ |
|
|
|
Table 3b lists the default data type mappings when presenting data from Google BigQuery to Oracle Database.
Table 3b: Present Default Data Type Mappings (Google BigQuery)¶
Google BigQuery |
Google BigQuery Aliases |
Oracle Database |
Comments |
---|---|---|---|
|
|
||
|
|
||
|
|
||
|
|
|
|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
Google BigQuery stores |
Table 3c lists the default data type mappings when presenting data from Snowflake to Oracle Database.
Table 3c: Present Default Data Type Mappings (Snowflake)¶
Snowflake |
Snowflake Aliases |
Oracle Database |
Comments |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
||
|
|
|
|
|
|
Presenting Unbounded Strings¶
Impala and Google BigQuery do not expose length metadata for STRING
columns. By default, Present will define all corresponding Oracle VARCHAR2
columns with the maximum supported length of 4000, unless the --detect-sizes
option is used. If this option is specified, Present will sample the data in the backend STRING
columns to determine the column lengths to use when defining the corresponding hybrid external table in Oracle.
Presenting Snowflake Varchar¶
By default, Snowflake VARCHAR
columns with length <= 4000 will be presented to Oracle as VARCHAR2
(unless overridden; see Presenting Large Strings) and Snowflake columns with lengths > 4000 will be presented as CLOB
. Snowflake VARCHAR
columns are presented to Oracle Database using character length semantics. For Snowflake tables that store multibyte characters, it is important that the characterset of the Oracle database supports the characters that are to be presented (i.e. to ensure parity between the data and column lengths).
Presenting Snowflake Binary¶
By default, Snowflake BINARY
columns with length <= 2000 will be presented to Oracle as RAW
(unless overridden; see Presenting Large Binary Values) and Snowflake columns with lengths > 2000 will be presented as BLOB
.
Override Data Type Mappings¶
When presenting a table, it is possible to override some of the default data type mappings to change the specification of the target RDBMS hybrid external table and hybrid view. Table 4a lists the override mappings and associated options that are available when presenting data from Impala (Cloudera Data Hub, Cloudera Data Platform) to Oracle Database.
Table 4a: Present Override Data Type Mappings (Impala)¶
Impala |
Impala Alias |
Present Option |
Oracle Database |
Comments |
---|---|---|---|---|
|
|
|
||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|
||
|
|
|||
|
|
|||
|
|
|
Impala 3.3+ |
|
|
|
|
||
|
|
|
||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|
||
|
|
|
|
Table 4b lists the override mappings and associated options that are available when presenting data from Google BigQuery to Oracle Database.
Table 4b: Present Override Data Type Mappings (Google BigQuery)¶
Google BigQuery |
Google BigQuery Alias |
Present Option |
Oracle Database |
Comments |
---|---|---|---|---|
|
|
|
||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|||
|
|
|
|
|
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|
Table 4c lists the override mappings and associated options that are available when presenting data from Snowflake to Oracle Database.
Table 4c: Present Override Data Type Mappings (Snowflake)¶
Snowflake |
Snowflake Alias |
Present Option |
Oracle Database |
Comments |
---|---|---|---|---|
|
|
|
|
|
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
||
|
|
|||
|
|
|
|
|
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|
|
Example 8: Overriding Data Types During a Present¶
In the following example, the SH.SALES table is presented with several data type overrides.
$ $OFFLOAD_HOME/bin/present -t SH.SALES -x \
--date-columns=TIME_ID \
--integer-4-columns=PROD_ID,CHANNEL_ID \
--decimal-columns-type=10,2 --decimal-columns=AMOUNT_SOLD
Each override can be given one or more columns. The --decimal-columns
override can be used to fine-tune the precision and scale of the presented NUMBER
data type. This override option behaves differently to the others in that it must be paired with a --decimal-columns-type
option to define the precision and scale for the columns listed. Multiple different names/type pairings can be provided in the same command (see --decimal-columns
for further details).
Presenting Large Strings¶
String data that exceeds the byte or characters limits of the CHAR
or VARCHAR2
Oracle Database data types can be presented as CLOB
data using the --large-string-columns
option. By default, Present creates the hybrid external table with a maximum size of 32768 bytes for each CLOB
column. If larger data values need to be presented, the --lob-data-length
option can be used to increase the size of the presented CLOB
columns.
Presenting Large Binary Values¶
Binary data that exceeds the byte or characters limits of the RAW
Oracle Database data type can be presented as BLOB
data using the --large-binary-columns
option. By default, Present creates the hybrid external table with a maximum size of 32768 bytes for each BLOB
column. If larger data values need to be presented, the --lob-data-length
option can be used to increase the size of the presented BLOB
columns.
Presenting Intervals¶
Intervals stored as strings can be presented using the appropriate interval option as described in Tables 4a-c. Present supports intervals stored in either ISO 8601 format or the datetime format supported by Oracle Database. See Table 5: Interval Data Type String Formats for a small sample of datetime formats and the ISO 8601 specification for full details of ISO interval formats.
Lossy Data Conversions¶
Depending on the source data, some options can lead to a loss of precision or data in hybrid queries and should be used with caution:
Using the
--decimal-columns
override option to present 32-bit or 64-bit floating point data to a decimal number can lead to a loss of data precision when querying the presented columnUsing the
--date-columns
with datetime types such as an ImpalaTIMESTAMP
, Google BigQueryDATETIME
or SnowflakeTIMESTAMP_NTZ
can lead to the loss of fractional seconds when presented to an Oracle Database. This option can be safely used when the precision or granularity of the backend source data is at or below that of the target RDBMS datatypeWhen the
--timestamp-columns
option is used to present a Google BigQueryTIMESTAMP
or SnowflakeTIMESTAMP_TZ
column, any time zone information is lost when querying the presented column
Important
Using lossy Present options such as the above can cause wrong results with some hybrid queries; such as when Gluent Query Engine pushes down a predicate containing values from the lossy column(s). It is not advised to enable lossy data type overrides for columns that are likely to be used in hybrid query predicates or joins.
Managing Statistics¶
Database statistics enable query optimizers to determine the most efficient way to execute queries. To ensure that hybrid queries benefit from efficient execution plans, Gluent Offload Engine manages statistics for both the hybrid external table and the backend target table (where applicable).
Hybrid Object Statistics¶
When presenting a table to an RDBMS, Present copies any available table statistics from the source backend table to the base hybrid external table. In some cases, Present will generate its own statistics for backend objects before copying them to the RDBMS. In general, the behavior is as follows:
Presenting tables:
If the backend table has statistics, these will be copied to the RDBMS
If the backend table does not have statistics, Present will generate its own statistics by scanning the table and these will be copied to the RDBMS
Presenting views:
Present will generate its own statistics by scanning the view and these will be copied to the RDBMS
This includes views that support non-materialized Present Join
Presenting materialized joins:
Present will compute statistics for the materialized backend table and these will be copied to the RDBMS
Hybrid external tables that support general or custom Advanced Aggregation Pushdown have their statistics derived from the base external table statistics by Gluent Offload Engine. Statistics for the count-only Advanced Aggregation Pushdown hybrid external table are generated from heuristics by Gluent Offload Engine.
Disabling Statistics Gathering¶
Statistics generation can be disabled by adding the --no-gather-stats
option to the present
command. This will work for any Present operation that requires statistics to be generated or natively computed by the backend (i.e. all of those described above). When disabling statistics, all hybrid external tables except those that support count-only Advanced Aggregation Pushdown will have no statistics.
Overriding the Statistics Sampling Rate¶
By default, Present generates statistics by fully scanning the backend view or table being presented. It is possible to override the sampling level by adding the --sample-stats
option with a value between 0 and 100 to the present
command, for example:
--sample-stats=50
Note
This option specifies the percentage of partitions to sample, not rows.