Concepts

Documentation Conventions

  • Commands and keywords are in this font.

  • $OFFLOAD_HOME is set when the environment file (offload.env) is sourced, unless already set, and refers to the directory named offload that is created when the software is unpacked. This is also referred to as <OFFLOAD_HOME> in sections of this guide where the environment file has not been created/sourced.

  • Third party vendor product names might be aliased or shortened for simplicity. See Third Party Vendor Products for cross-references to full product names and trademarks.

Introduction

This guide describes Gluent Data Platform, Gluent’s transparent data virtualization software. Reading this guide will provide a conceptual overview of the software, its architecture, components and implementation. Links are provided to further details and technical information where applicable.

Transparent Data Virtualization

For an overview of transparent data virtualization, visit Gluent’s website at https://gluent.com/gdp/data-virtualization/.

Gluent Data Platform

Gluent Data Platform Functions

Gluent Data Platform enables transparent data virtualization with software covering three main functional areas:

Orchestration

Gluent Data Platform orchestrates all of the processes required to offload data from an RDBMS to a modern data platform (such as Cloudera Data Platform, Cloudera Data Hub, Google BigQuery or Snowflake); present data from a modern data platform to an RDBMS (a metadata operation) and manage schema changes between the source and target systems

Transparent Query

Gluent Data Platform enables virtualized data to be queried from existing applications without changing any application or user SQL

Advisors

Gluent Data Platform includes advisors to help users to determine which data to offload and which common query aggregations to potentially optimize

Gluent Data Platform Software Components

Gluent Data Platform’s three main functions are implemented by the following software components:

  • Gluent Offload Engine (Orchestration)

  • Gluent Query Engine (Transparent Query)

  • Gluent Advisor (Advisors)

Table 1 below summarizes the functional areas, their implementing software engine and components.

Table 1: Gluent Data Platform Components

Functional Area

Software Engine

Software Component

Purpose

Orchestration

Gluent Offload Engine

Offload

Offload data from RDBMS to backend platform

Present

Present data from backend platform to RDBMS

Incremental Update

Propagate data updates from RDBMS to offloaded data

Schema Sync

Propagate schema changes from RDBMS to backend platform

Offload Status Report

Describe offloaded tables and their offload characteristics

Connect

Validate Gluent Data Platform installation and configuration

Diagnose

Generate diagnostics for Gluent Support

Others

Various other small but useful utilities

Transparent Query

Gluent Query Engine

Smart Connector

Manage and optimize RDBMS queries involving virtualized data

Metadata Daemon

Build queries to execute on the backend system

Data Daemon

Execute queries on the backend system and return results to the RDBMS

Hybrid Query Report

Report performance characteristics of virtualized SQL queries

Advisors

Gluent Advisor

Offload Advisor

Recommend candidate data to offload

Aggregation Advisor (BETA)

Recommend potential aggregate query optimizations

Gluent Data Platform Software Packages

Gluent Data Platform is delivered in the following packages:

Mandatory Packages

  • Gluent Data Platform: A single package containing all Gluent Offload Engine and Gluent Query Engine components

Optional Packages

  • Gluent Transport: An optional package containing Spark Standalone and dependencies, sometimes required when offloading data to a cloud warehouse such as Google BigQuery or Snowflake

  • Gluent Advisor Data Extractor: A package containing the software to extract information from the RDBMS to enable an Offload Advisor report to be generated

Gluent Data Platform Architecture

Gluent Data Platform has a simple architecture. A hybrid environment always includes an RDBMS and a backend data platform (such as a Hadoop cluster or a cloud data warehouse), with Gluent Data Platform as the virtualization interface between them. In some hybrid environments, an additional server or cluster is utilized for some of the interface functions of Gluent Data Platform. As such, Gluent Data Platform offers great flexibility in how it is installed and configured.

Installation and Topology

Depending on the specific requirements and nature of the RDBMS and backend environment, Gluent Data Platform packages may be installed on the following servers/clusters:

  • RDBMS Server: Mandatory

  • Hadoop Cluster (if applicable): Optional

  • Gluent Node (where applicable): Optional

In addition to this, some database objects are created during installation where applicable (described below).

RDBMS Server

The Gluent Data Platform package is always installed on the RDBMS server(s). This is mandatory for the operation of Smart Connector and Metadata Daemon. Installation includes unpacking the software binaries to a location on the server(s) and installing some RDBMS schemas and objects (described below).

Gluent Offload Engine and/or Data Daemon can optionally be installed on the RDBMS server(s), but this is less common for the following reasons:

  • Resource utilization: Most users choose to utilize the resources of other servers for these components, rather than use RDBMS server resources

  • Security: Gluent Offload Engine is usually installed on another server or cluster for security and segregation of duty

Software Location

Gluent Data Platform software is unpacked to a location on the RDBMS server(s) known as OFFLOAD_HOME. This application area contains binaries, executables, installation scripts (for supporting RDBMS objects), configuration and environment files and so on. It also includes a location for all log and trace files generated by Gluent Data Platform components.

RDBMS Standard Schemas

Gluent Data Platform creates three standard RDBMS schemas for the operation of the software:

  • An owner or administrator schema that includes database objects required for the operation of Gluent Data Platform

  • An application schema that is used for connections from Smart Connector

  • A repository schema that is used to store Gluent Metadata Repository data for the operation of Gluent Data Platform

RDBMS Hybrid Schemas

For each RDBMS application schema that contains tables to be offloaded, an additional schema must be created in advance of the first offload (Gluent Data Platform provides a simple installation utility to enable this). This is known as a hybrid schema and is required for the operation of Gluent Data Platform. Gluent Offload Engine creates database objects related to each offload in the accompanying hybrid schema and never modifies the source application schema.

An RDBMS hybrid schema is also required to enable tables or views in a backend database or dataset to be presented to the RDBMS. If the intended present operations are unrelated to an existing set of offloads, one or more new hybrid schemas can be created without requiring any corresponding application schemas. Alternatively, it is possible to use an existing hybrid schema that was created to support offloads.

Hadoop Cluster

For virtualized environments that include a Hadoop-based backend such as Cloudera Data Hub or Cloudera Data Platform, it is common to install and operate Gluent Offload Engine on a Hadoop edge node. Software installation is simple and requires unpacking the software to an OFFLOAD_HOME location, which contains (among other things) Gluent Offload Engine binaries, executables and log file location.

It is also possible to install Data Daemon on the edge node. Data Daemon is included in the standard OFFLOAD_HOME location and Gluent Data Platform provides several methods to manage the daemon processes.

Installing Gluent Offload Engine and/or Data Daemon on the Hadoop edge node reduces the resource footprint of Gluent Data Platform on the RDBMS server(s) and relocates the processing to the Hadoop cluster.

User-Defined Functions (UDFs)

For Impala-supported backends, Gluent Data Platform optionally installs UDFs, either in the default Impala database or in an alternative specified by the user. These are used for performance optimization reasons and are recommended, but not mandatory. Additional database objects can optionally be installed for Impala performance optimizations on older versions, but these are not generally required for more recent releases.

Gluent Node

Gluent Node is the logical name given to a server or cluster that hosts the following functions:

  • Data Daemon: This is included in the Gluent Data Platform package

  • Gluent Transport: This package contains Spark Standalone, required for offloading RDBMS data to cloud storage buckets (for temporary staging)

The configuration of Gluent Node is very flexible. For example:

  • Gluent Node can be either a single server or a cluster

  • Gluent Node can be an on-premises physical or virtual server or cluster

  • Gluent Node can be a cloud server or cluster (for example, Google Cloud Platform VMs for operating with Google BigQuery)

  • Gluent Node can be an edge node of a Hadoop cluster (for example, for Cloudera Data Hub or Cloudera Data Platform environments)

  • Both packages can be installed on the same or separate servers or clusters and have different resource allocations

  • The Gluent Transport package can be omitted if an existing Spark cluster (or Hadoop cluster with Spark) is available

  • Gluent Node could be the RDBMS server or cluster (not recommended)

Backend Datasets/Databases/Schemas

For each RDBMS application schema that contains tables to be offloaded, two datasets/databases/schemas are required in the backend platform:

  • Staging: A load database used for temporary staging data

  • Final: A database used to store offloaded data

Each pair of staging and final datasets/databases/schemas can be:

  • Created by the very first use of Offload

  • Created ahead of time in preparation for using Offload

  • Existing databases containing data unrelated to the use of Gluent Data Platform (with requisite operational permissions/grants)

Multitenant Naming Schemes

When backend datasets or databases are created specifically for offloading from an RDBMS application schema, they are usually created with the same naming scheme as the source. However, Gluent Data Platform also supports a multitenant naming scheme, which is useful to avoid namespace clashes in situations where schemas are offloaded from many RDBMSs to a centralized backend platform.

Data Model Virtualization

Gluent Data Platform provides several opportunities to define alternative data model elements for offloaded or presented data. For example, Offload users might choose a different partition key or granularity for offloaded data. Offloaded and presented data can be defined with alternative data types and joins can be offloaded or presented (either materialized or as views) to provide an alternative virtualized data model for a group of tables. Aggregations can be presented for specific dimensions and measures for an optimized view with subsets of columns. All data model virtualization options are easily defined as arguments to Offload and Present commands.

Gluent Advisors

Offload Advisor

Offload Advisor is not installed with the Gluent Data Platform package. It is managed via contact with Gluent Sales. A typical Offload Advisor process will be:

  1. Download and run Gluent Advisor Data Extractor package

  2. Share results with Gluent Sales

  3. Gluent Sales will generate and share a rich Offload Advisor report

See https://gluent.com/gdp/gluent-advisor/ for more details.

Aggregation Advisor (BETA)

Aggregation Advisor (BETA) is included in the Gluent Data Platform package. It is designed to be run on Oracle Database instances only. For further details, contact Gluent Support.

See also

For further details of Gluent Data Platform’s architecture and installation, see the Installation & Upgrade guide.

Gluent Offload Engine

Gluent Offload Engine orchestrates the processes required to virtualize data between RDBMSs and backend data platforms. It includes a number of software components such as Offload, Present, Incremental Update, Schema Sync, Offload Status Report, Connect, Diagnose and several other utilities that cover a wide range of functional and operational requirements for maintaining hybrid environments. Gluent Offload Engine software components have a similar look and feel and each has a range of configuration parameters or command-line options.

Offload

Offload enables data to be copied from an RDBMS (such as Oracle Database) to a modern data platform (such as Cloudera Data Platform, Cloudera Data Hub, Google BigQuery or Snowflake).

Backend Databases and Tables

When data is offloaded from an RDBMS table, Offload creates a backend version of the table and copies the source data to it. The backend database (or dataset or schema, depending on the terminology of the backend platform) for the table can be:

  • Created automatically by the very first use of Offload

  • Created ahead of time in preparation for using Offload

  • An existing database containing data unrelated to the use of Gluent Offload Engine

Backend databases are often named the same as the source RDBMS schema, but Offload also caters for multitenant naming schemes or target-name overrides.

Backend tables created by Offload are usually named the same as the source RDBMS table, but it is possible to override this if required.

Gluent Data Platform supports several storage formats for backend tables. For example, the default storage scheme for a Hadoop-based backend table is HDFS, but Amazon S3 and various Microsoft Azure storage schemes are also supported. Google BigQuery tables are stored internally in a proprietary format, but Google Cloud Storage is utilized by Offload during the data transportation phase of an offload. Snowflake tables are also stored in a proprietary format, but either Google Cloud Storage, Amazon S3 or various Microsoft Azure storage schemes are supported for the data transportation phase of an offload.

Backend Supporting Database Objects

Depending on the backend platform and the nature of the offload, other supporting objects may be created by Gluent Offload Engine. For example, when offloading to a Snowflake schema for the first time, Offload will create supporting stage and file format objects to facilitate loading for all offloads to that schema.

Offload Hybrid Schemas

When data is copied to the target backend by Offload, Gluent Offload Engine creates a set of objects (known as hybrid objects) and metadata in the RDBMS to enable the offloaded data to be queried. Gluent Offload Engine never modifies a users’ source RDBMS schema and instead creates all of the associated hybrid objects in a shadow hybrid schema. Each source schema that has tables offloaded must have a hybrid schema prepared in advance of the first offload. Gluent Offload Engine includes a simple utility to create hybrid schemas. A hybrid schema is first created with private synonyms for all tables and views in the corresponding application schema. As tables are offloaded, the associated synonym is replaced by a hybrid object of the same name.

Offload Scenarios

Offload supports a wide range of table, partition and data offload scenarios and is highly-configurable. The most common scenarios are full table offloads (usually for copying tables such as dimension or reference tables to a backend) or partition-based offloading (for example, to offload the majority of partitions from a table to the backend platform and keep the hottest partitions in the frontend RDBMS until they are also ready to be offloaded). It is also possible to offload data by predicate or to offload joins of tables. The following Offload features are supported and described in detail in the Offload guide:

  • Full Offload

  • Partition-Based Offload

  • Subpartition-Based Offload

  • Predicate-Based Offload

  • Offload Join

Note

The Partition/Subpartition-Based Offload features are also known as Incremental Partition Append. The Predicate-Based Offload feature is also known as Incremental Data Append. As their alternative names suggest, each of these features allow additional data to be offloaded and appended to the backend table over multiple Offload operations.

Offload Type

Despite the wide range of offload options and scenarios, offloads generally fall into one of two offload type categories: either all data is offloaded or a partial set of data is offloaded. When all data is offloaded for a table or a join, this is known as having an offload type of FULL. When a table or join is partially offloaded, this is known as having an offload type of INCREMENTAL. In some cases, it is possible to combine the behavior of both (see Hybrid View Types).

Backend Table Partitioning

Depending on several factors such as the structure of the source RDBMS table, the type of backend system and users’ chosen data virtualization options, the backend version of the table might be partitioned according to one or more of the following:

  • Gluent Offload Engine’s standard partitioning scheme for sharding offloaded data (known as offload buckets), applicable only to Hadoop-based systems and used to enable parallel query access to offloaded data. Google BigQuery provides native parallel-read capabilities so does not require this partitioning scheme

  • The same partition column(s) as the source RDBMS table, either at the same granularity or a different granularity

  • Different partition column(s) to the source RDBMS table, chosen by the user for the initial offload and supporting a range of granularities

Offload Hybrid Objects

The primary hybrid objects created by Offload are a hybrid external table and a hybrid view. The hybrid external table provides the mechanism for Gluent Data Platform to access the offloaded data, but the hybrid view is the query interface for users. The hybrid view has the same name and columns as the original offloaded table and enables users to query the offloaded data using their original RDBMS queries. For partially-offloaded tables, the corresponding hybrid view will include boundary predicates to ensure that data is retrieved consistently from either the RDBMS or backend as appropriate.

Other hybrid objects are created by Offload to support various query optimizations such as the pushing down of aggregations or joins from the RDBMS to the backend platform. These optimizations are described later in this guide, but in terms of hybrid objects the principle is the same, i.e. the creation of a hybrid external table and hybrid view.

Hybrid View Types

Hybrid view formats can differ according to the offload scenario.

When an RDBMS table has been fully offloaded, the hybrid view is used to access all data from the target backend system. This type of view is known as a 100-0 hybrid view (where 100% of data is accessed from the backend and 0% is accessed from the source RDBMS).

When an RDBMS table has been partially offloaded, the hybrid view is a composite of both the hybrid external table (for accessing the offloaded data) and the source RDBMS table (for accessing the data retained in the RDBMS), and includes predicates based on the offload boundaries. This enables database queries to access data from the appropriate system and, depending on the nature of the users’ queries, allows optimizations such as table or partition elimination to occur. This type of hybrid view is known as a 90-10 hybrid view (denoting that the majority of data is accessed from the backend and a smaller proportion is accessed from the source RDBMS).

For tables that support partial offloading, it is also possible to combine the 100-0 and 90-10 behavior by offloading all of the table data but specifying a boundary to include in a composite hybrid view. This ensures that all data is made available to direct users of the backend system but the hybrid view enables RDBMS queries to continue to access some of the data (usually the most frequently-accessed) from the source RDBMS. This type of hybrid view is known as a 100-10 hybrid view (denoting that 100% of the data is available in the backend system, but a small proportion is accessed from the source RDBMS). This scenario supports the continued offloading of additional sets of data or partitions in the same way as a 90-10 offload.

In addition to the base hybrid external table and hybrid view, Gluent Offload Engine also creates hybrid external tables, hybrid views and metadata to support aggregation pushdown for aggregate queries. When joins are offloaded, Gluent Offload Engine creates hybrid external tables, hybrid views and metadata to support join pushdown for queries that include joins of offloaded tables. In both aggregation and join pushdown scenarios, the hybrid views are not intended to be queried by users: they are created to support transparent query rewrite by the RDBMS and Gluent Data Platform.

See also

For details of how to use Offload, see the Offload guide.

Present

Present enables users and applications to query data stored in a backend data platform (such as Cloudera Data Hub, Cloudera Data Platform, Google BigQuery or Snowflake) from an RDBMS (such as Oracle Database). Present facilitates data-sharing and the enrichment of existing RDBMS application data with information from other sources.

Present Scenarios

Present supports a range of scenarios and is highly-configurable. The following Present features are supported and described in detail in the Present guide:

  • Present a backend table or view to an RDBMS

  • Present a custom aggregation view of a backend table to an RDBMS

  • Present a join of backend tables or views to an RDBMS

  • Present a range of schema modifications to synchronize backend and RDBMS data definitions

Present and Data Sharing

Unlike Offload, Present does not move data between the backend and RDBMS systems. In the majority of cases, Present will simply create hybrid objects and metadata to facilitate the querying of presented data from the RDBMS. This makes Present both lightweight and quick. It is also possible to materialize the results of a Present-Join operation if required. In this case, Gluent Offload Engine will create and populate a new table in the specified backend database.

Present Hybrid Objects

To enable a Presented table or view to be queried from the RDBMS, Gluent Offload Engine creates a set of hybrid objects and metadata in the RDBMS. The primary hybrid objects created by Present are a hybrid external table and a hybrid view. The hybrid external table provides the mechanism for Gluent Data Platform to access the presented data, but the hybrid view is the query interface for users. The hybrid view and hybrid external table have the same columns as the original presented table/view. The hybrid view usually inherits the same name, but it is possible to specify a different name if required.

Other hybrid objects are created by Present to support various query optimizations such as the pushing down of aggregations or joins from the RDBMS to the backend platform. These optimizations are described later in this guide, but in terms of hybrid objects the principle is the same, i.e. the creation of a hybrid external table and hybrid view.

Present Hybrid Schemas

Hybrid objects are always created in hybrid schemas (Gluent Offload Engine never modifies existing application schemas or databases in the RDBMS). At least one hybrid schema must be prepared in advance of the first present (Gluent Offload Engine provides a simple utility to create hybrid schemas), but there are many hybrid schema configurations that Present supports.

Most users of Gluent Data Platform begin by offloading data from RDBMS application schemas to backend systems, usually before presenting backend objects to the RDBMS. The most common configuration, therefore, is to have a hybrid schema per application schema that is being offloaded and to use these for Present also. It is also possible to couple hybrid schemas to RDBMS application schemas for Present-only users. However, hybrid schemas do not need to be coupled to an existing RDBMS application schema. Hybrid schemas can be created as their own namespace and it is therefore possible to present backend objects to the RDBMS without any corresponding RDBMS application schema in scope.

A hybrid schema for Present might therefore be:

  • Coupled to an existing RDBMS application schema and used for both Offload and Present or used for only Present

  • Coupled (by name) to the source backend database/dataset/schema for the Present operation

  • A general, de-coupled hybrid schema, used to Present from one or more backend databases/datasets/schemas to the RDBMS

See also

For details of how to use Present, see the Present guide.

Incremental Update

Incremental Update is a data synchronization feature of Gluent Offload Engine that enables offloaded data to be updated with any data modifications (DML) that occur in the RDBMS after the initial offload. It is primarily designed to enable users to fully-offload tables such as dimensions that typically continue to be modified after offload. Incremental Update can also be used to make updates to offloaded data that no longer exists in the RDBMS (for example, to modify data in partitions that were offloaded and then dropped from the RDBMS).

Incremental Update is highly-configurable and can be enabled for a wide range of tables. It has three main processes:

  1. Capture: A variety of change data capture methods are supported to identify data modifications that need to be synchronized

  2. Extraction: An extraction process propagates modified data to the backend platform

  3. Compaction: A variety of compaction methods are available to merge data modifications to provide optimal performance when querying data

These processes commonly occur at different frequencies according to the user’s latency requirements and the volume of data modifications.

See also

See the Incremental Update guide for more details on how and when to use this feature.

Schema Sync

Schema Sync automates the detection and propagation of source RDBMS schema changes to offloaded tables. It is currently limited to the addition of columns. Schema Sync can be scheduled to run periodically and will detect changes in one, several or all RDBMS schemas that have offloaded tables. It can be further refined to detect changes for one, some or all offloaded tables by name or wildcard. Once detected, changes will be propagated to all related backend tables and views and all dependent hybrid objects in the RDBMS.

For schema changes that are not currently supported by Schema Sync, a manual schema evolution process is possible. Some changes can be propagated to dependent hybrid objects by the simple use of Present, whereas others might require manual schema changes to a corresponding backend table before running Present to refresh the dependent hybrid objects.

See also

See Schema Sync for further details on automated schema evolution and Schema Evolution for details of manual schema evolution with Present.

Offload Status Report

Offload Status Report provides a database, schema and table-level breakdown of offloaded data and categorizes it as follows:

  • Offloaded: The size, partitions and rows of the backend table of offloaded data

  • Retained: The size, partitions/tables and rows of the source RDBMS table that cannot be dropped

  • Reclaimable: The size, partitions/tables and rows of the source RDBMS table that can be dropped if desired

Offload Status Report can output either summary or detailed information and can generate a report in a variety of formats, including HTML, text, CSV and JSON. The first two formats are useful for generating static reports. The last two formats enable thirdparty programs to consume the Offload Status Report data and process it further as required.

See also

See Offload Status Report for further details.

Other Gluent Offload Engine Utilities

Gluent Offload Engine includes several other utilities for supporting and managing the virtualized environment. For example, Diagnose enables Gluent Data Platform users to generate a package of diagnostics information for sharing with Gluent Support. Connect enables administrators to validate the installation and configuration of a Gluent Data Platform environment.

See also

See the Utilities guide for more examples and details of the various Gluent Offload Engine utilities not described earlier in this guide.

Gluent Query Engine

Gluent Query Engine is Gluent Data Platform’s transparent query engine. When an RDBMS user or application executes a SQL query that includes virtualized data (known as a hybrid query), Gluent Query Engine co-ordinates all of the work necessary to return the virtualized data to the RDBMS.

Gluent Query Engine Components

Gluent Query Engine comprises the following components:

Smart Connector

Smart Connector (a C++ binary) is the primary co-ordinator of a hybrid query that requires virtualized data to be accessed, either in addition to RDBMS data or exclusively. One or more Smart Connector processes are instantiated every time a hybrid external table is accessed by a node of an RDBMS execution plan (multiple processes are started when parallel query execution is in scope). The Smart Connector process (or co-ordinator process in the event of a parallel hybrid query) determines the context for the hybrid query, including database session information, execution plan information and metadata for the associated hybrid object. Smart Connector passes control to Metadata Daemon and ultimately Data Daemon to retrieve the virtualized data from the backend and stream it for ingestion by the hybrid external table.

Note

A hybrid external table is accessed by its associated hybrid view. Gluent strongly recommends that the hybrid view is utilized by RDBMS users, rather than the hybrid external table. This provides several benefits, including the fact that in almost all cases, application/user SQL will not have to change to accommodate the fact that some or all data required by the query has been virtualized.

Metadata Daemon

Metadata Daemon is a multi-threaded C++ daemon process that runs on the RDBMS server(s). It is responsible for receiving hybrid query context from the Smart Connector process and building an appropriate SQL statement (in the dialect of the backend platform) that can be used by Data Daemon to retrieve the required virtualized data.

Metadata Daemon can either be managed as a service or will automatically start when a hybrid query first accesses a hybrid external table. It will restart whenever a configuration change is detected. Gluent Data Platform provides and supports a variety of operational controls for Metadata Daemon.

Data Daemon

Data Daemon is a multi-threaded Java daemon that is responsible for retrieving virtualized data from the backend system and returning it to the RDBMS. It receives the SQL provided by Metadata Daemon and executes it on the backend data platform. Depending on factors such as the backend system, backend SQL engine and the degree of parallelism, the SQL will run one or more sessions to extract the data and one or more reader processes to return the data to Smart Connector.

Data Daemon is flexible both in terms of installation and operation. It can run on one or more servers, physical or virtual, on-premises or on a cloud platform. The server or cluster that hosts Data Daemon is typically known as Gluent Node, described earlier in this guide.

Gluent Data Platform provides administrators with tools to manage Data Daemon, including a basic browser-based UI to view the range of backend SQL queries being issued by the daemon.

Hybrid Queries

Hybrid queries are executed on the RDBMS but include virtualized data (either exclusively or in combination with RDBMS data). Hybrid queries can include one or more hybrid views, including multiple references to the same hybrid view if required. The presence of a hybrid view in a query does not always mean that the virtualized backend data will be accessed. Depending on the structure of the hybrid view (described in Hybrid View Types) and elements such as query predicates and join conditions, the underlying hybrid external table(s) might be skipped or eliminated from the RDBMS execution plan. In such cases, data is returned from the RDBMS only with no Gluent Query Engine overheads.

Transparent Query

Hybrid views that are created by Offload will in most cases have the same name as the source offloaded table, but be located in the corresponding hybrid schema. RDBMS tables or views in the source application schema that have not been offloaded will be referenced by synonyms in the hybrid schema. This means that existing user or application SQL can transparently access virtualized data with no changes to the SQL itself by using one of the following common methods to access the hybrid schema objects:

  • Set the current schema for the RDBMS session to the hybrid schema (either manually or with a logon trigger) before executing any SQL

  • Change database connection properties to login to the hybrid schema to execute the SQL

  • Create synonyms for hybrid views in the schema that executes the SQL

Of course, it is also possible to add hybrid schema references to the SQL itself if minor code changes are acceptable, but the techniques in the list above enable transparent hybrid queries to execute without any SQL changes.

Hybrid views that are created by Present are most likely to be new to the RDBMS and would need to be incorporated into new or existing application SQL. The same hybrid query principles apply, however, and these presented hybrid views can be included in user or application SQL to access new sources of virtualized data.

Parallel Hybrid Queries

Hybrid queries can be executed serially or in parallel. When the hybrid external table is accessed in parallel, multiple Smart Connector processes are invoked and the virtualized data is returned concurrently. Depending on the backend platform, it is sometimes more efficient to execute multiple sessions against the backend SQL engine to retrieve the data. Gluent Offload Engine distributes offloaded data across a configurable number of buckets for this purpose and this method is the default when offloading to Hadoop-based backends that support Impala. For cloud warehouses such as Google BigQuery and Snowflake, it is usually more efficient (in terms of resources and hence cost) to execute a single co-ordinator session to create a result cache and read and return the data to the RDBMS in parallel.

Hybrid Query Optimizations

To make the access and ingestion of virtualized data as efficient as possible, Smart Connector utilizes one or more of a number of optimizations as applicable:

  • Projection Pushdown: Smart Connector requests only the columns required by the hybrid query to minimize the amount of data to be transferred by Data Daemon

  • Predicate Pushdown: Smart Connector is able to push down a variety of predicates and expressions to access the backend data as efficiently as possible (including partition elimination) and to minimize the number of rows returned to the RDBMS

  • Join Filter Pulldown (JFPD): This is a patented optimization that generates additional predicates from filters on joined tables and pushes them to the backend, thereby reducing the number of rows that are returned to the RDBMS

  • Advanced Aggregation Pushdown (AAPD): Aggregations can be pushed down to the backend platform to utilize its processing power and to reduce the number of rows returned (pre-aggregated) to the RDBMS

  • Join Pushdown: Joins can be pushed down to the backend to utilize its processing resources, eliminate joins in the RDBMS and in many cases to reduce the number of rows returned to the RDBMS

  • Data type formatting: Data is returned in an optimized format internal to the RDBMS to reduce CPU consumption while ingesting the data. In some cases, data types are converted on the backend system by UDFs, which are designed to utilize backend CPU resources and in some cases they are converted by Data Daemon

See also

For more information on performance optimizations, see the Performance guide.

Hybrid Query Report

Gluent Data Platform includes the Hybrid Query Report utility to visualize and quantify the performance characteristics of a hybrid query. Hybrid Query Report is provided in two execution formats and can be used to report on a hybrid query in several ways. A generated report includes the original RDBMS execution plan together with a variety of information for each hybrid external table in the hybrid query, including the backend query generated by Gluent Query Engine, the number of rows scanned and filtered in the backend, the backend query profile and Smart Connector query optimizations.

See also

See the Utilities guide for more information on Hybrid Query Report.

Documentation Feedback

Send feedback on this documentation to: feedback@gluent.com