Data Warehouse Guide
Data Warehouse Guide
Release 2 (9.2)
March 2002
Part No. A96520-01
Oracle9i Data Warehousing Guide, Release 2 (9.2)
Contributors: Patrick Amor, Hermann Baer, Subhransu Basu, Srikanth Bellamkonda, Randy Bello,
Tolga Bozkaya, Benoit Dageville, John Haydu, Lilian Hobbs, Hakan Jakobsson, George Lumpkin, Cetin
Ozbutun, Jack Raitto, Ray Roccaforte, Sankar Subramanian, Gregory Smith, Ashish Thusoo,
Jean-Francois Verrier, Gary Vincent, Andy Witkowski, Zia Ziauddin
The Programs (which include both the software and documentation) contain proprietary information of
Oracle Corporation; they are provided under a license agreement containing restrictions on use and
disclosure and are also protected by copyright, patent and other intellectual and industrial property
laws. Reverse engineering, disassembly or decompilation of the Programs, except to the extent required
to obtain interoperability with other independently created software or as specified by law, is prohibited.
The information contained in this document is subject to change without notice. If you find any problems
in the documentation, please report them to us in writing. Oracle Corporation does not warrant that this
document is error-free. Except as may be expressly permitted in your license agreement for these
Programs, no part of these Programs may be reproduced or transmitted in any form or by any means,
electronic or mechanical, for any purpose, without the express written permission of Oracle Corporation.
If the Programs are delivered to the U.S. Government or anyone licensing or using the programs on
behalf of the U.S. Government, the following notice is applicable:
Restricted Rights Notice Programs delivered subject to the DOD FAR Supplement are "commercial
computer software" and use, duplication, and disclosure of the Programs, including documentation,
shall be subject to the licensing restrictions set forth in the applicable Oracle license agreement.
Otherwise, Programs delivered subject to the Federal Acquisition Regulations are "restricted computer
software" and use, duplication, and disclosure of the Programs shall be subject to the restrictions in FAR
52.227-19, Commercial Computer Software - Restricted Rights (June, 1987). Oracle Corporation, 500
Oracle Parkway, Redwood City, CA 94065.
The Programs are not intended for use in any nuclear, aviation, mass transit, medical, or other inherently
dangerous applications. It shall be the licensee's responsibility to take all appropriate fail-safe, backup,
redundancy, and other measures to ensure the safe use of such applications if the Programs are used for
such purposes, and Oracle Corporation disclaims liability for any damages caused by such use of the
Programs.
Oracle is a registered trademark, and Express, Oracle Expert, Oracle Store, Oracle7, Oracle8, Oracle8i,
Oracle9i, Oracle Store, PL/SQL, Pro*C, and SQL*Plus are trademarks or registered trademarks of Oracle
Corporation. Other names may be trademarks of their respective owners.
Contents
Preface.......................................................................................................................................................... xxi
Part I Concepts
iii
Creating a Logical Design ................................................................................................................. 2-2
Data Warehousing Schemas.............................................................................................................. 2-3
Star Schemas .................................................................................................................................. 2-4
Other Schemas............................................................................................................................... 2-5
Data Warehousing Objects................................................................................................................ 2-5
Fact Tables...................................................................................................................................... 2-5
Dimension Tables ......................................................................................................................... 2-6
Unique Identifiers ......................................................................................................................... 2-8
Relationships ................................................................................................................................. 2-8
Example of Data Warehousing Objects and Their Relationships.......................................... 2-8
iv
RAID 1 (Mirroring)..................................................................................................................... 4-10
RAID 0+1 (Striping and Mirroring) ......................................................................................... 4-10
Striping, Mirroring, and Media Recovery............................................................................... 4-10
RAID 5.......................................................................................................................................... 4-11
The Importance of Specific Analysis........................................................................................ 4-12
6 Indexes
Bitmap Indexes.................................................................................................................................... 6-2
Bitmap Join Indexes...................................................................................................................... 6-6
B-tree Indexes .................................................................................................................................... 6-10
Local Indexes Versus Global Indexes ........................................................................................... 6-10
7 Integrity Constraints
Why Integrity Constraints are Useful in a Data Warehouse ...................................................... 7-2
v
Overview of Constraint States.......................................................................................................... 7-3
Typical Data Warehouse Integrity Constraints ............................................................................. 7-4
UNIQUE Constraints in a Data Warehouse ............................................................................. 7-4
FOREIGN KEY Constraints in a Data Warehouse................................................................... 7-5
RELY Constraints.......................................................................................................................... 7-6
Integrity Constraints and Parallelism ........................................................................................ 7-7
Integrity Constraints and Partitioning....................................................................................... 7-7
View Constraints........................................................................................................................... 7-7
8 Materialized Views
Overview of Data Warehousing with Materialized Views......................................................... 8-2
Materialized Views for Data Warehouses................................................................................. 8-2
Materialized Views for Distributed Computing ...................................................................... 8-3
Materialized Views for Mobile Computing .............................................................................. 8-3
The Need for Materialized Views .............................................................................................. 8-3
Components of Summary Management ................................................................................... 8-5
Data Warehousing Terminology ................................................................................................ 8-7
Materialized View Schema Design ............................................................................................ 8-8
Loading Data ............................................................................................................................... 8-10
Overview of Materialized View Management Tasks ............................................................ 8-11
Types of Materialized Views .......................................................................................................... 8-12
Materialized Views with Aggregates....................................................................................... 8-13
Materialized Views Containing Only Joins ............................................................................ 8-16
Nested Materialized Views ....................................................................................................... 8-18
Creating Materialized Views .......................................................................................................... 8-21
Naming Materialized Views ..................................................................................................... 8-22
Storage And Data Segment Compression............................................................................... 8-23
Build Methods ............................................................................................................................. 8-23
Enabling Query Rewrite ............................................................................................................ 8-24
Query Rewrite Restrictions ....................................................................................................... 8-24
Refresh Options........................................................................................................................... 8-25
ORDER BY Clause ...................................................................................................................... 8-31
Materialized View Logs ............................................................................................................. 8-31
Using Oracle Enterprise Manager ............................................................................................ 8-32
Using Materialized Views with NLS Parameters .................................................................. 8-32
vi
Registering Existing Materialized Views..................................................................................... 8-33
Partitioning and Materialized Views............................................................................................ 8-35
Partition Change Tracking ........................................................................................................ 8-35
Partitioning a Materialized View ............................................................................................. 8-39
Partitioning a Prebuilt Table ..................................................................................................... 8-40
Rolling Materialized Views....................................................................................................... 8-41
Materialized Views in OLAP Environments............................................................................... 8-41
OLAP Cubes ................................................................................................................................ 8-41
Specifying OLAP Cubes in SQL ............................................................................................... 8-42
Querying OLAP Cubes in SQL................................................................................................. 8-43
Partitioning Materialized Views for OLAP ............................................................................ 8-47
Compressing Materialized Views for OLAP.......................................................................... 8-47
Materialized Views with Set Operators .................................................................................. 8-47
Choosing Indexes for Materialized Views................................................................................... 8-49
Invalidating Materialized Views ................................................................................................... 8-50
Security Issues with Materialized Views..................................................................................... 8-50
Altering Materialized Views .......................................................................................................... 8-51
Dropping Materialized Views........................................................................................................ 8-52
Analyzing Materialized View Capabilities ................................................................................. 8-52
Using the DBMS_MVIEW.EXPLAIN_MVIEW Procedure................................................... 8-53
MV_CAPABILITIES_TABLE.CAPABILITY_NAME Details ............................................... 8-56
MV_CAPABILITIES_TABLE Column Details ....................................................................... 8-58
9 Dimensions
What are Dimensions? ....................................................................................................................... 9-2
Creating Dimensions ......................................................................................................................... 9-4
Multiple Hierarchies .................................................................................................................... 9-7
Using Normalized Dimension Tables ....................................................................................... 9-9
Viewing Dimensions........................................................................................................................ 9-10
Using The DEMO_DIM Package.............................................................................................. 9-10
Using Oracle Enterprise Manager............................................................................................ 9-11
Using Dimensions with Constraints............................................................................................. 9-11
Validating Dimensions .................................................................................................................... 9-12
Altering Dimensions........................................................................................................................ 9-13
Deleting Dimensions ....................................................................................................................... 9-14
vii
Using the Dimension Wizard ......................................................................................................... 9-14
Managing the Dimension Object .............................................................................................. 9-14
Creating a Dimension................................................................................................................. 9-17
viii
External Tables............................................................................................................................ 13-6
OCI and Direct-Path APIs ......................................................................................................... 13-8
Export/Import ............................................................................................................................ 13-8
Transformation Mechanisms .......................................................................................................... 13-9
Transformation Using SQL ....................................................................................................... 13-9
Transformation Using PL/SQL .............................................................................................. 13-15
Transformation Using Table Functions................................................................................. 13-16
Loading and Transformation Scenarios...................................................................................... 13-25
Parallel Load Scenario.............................................................................................................. 13-25
Key Lookup Scenario ............................................................................................................... 13-33
Exception Handling Scenario ................................................................................................. 13-34
Pivoting Scenarios .................................................................................................................... 13-35
ix
Tips for Refreshing Materialized Views Without Aggregates ........................................... 14-22
Tips for Refreshing Nested Materialized Views .................................................................. 14-23
Tips for Fast Refresh with UNION ALL ............................................................................... 14-25
Tips After Refreshing Materialized Views............................................................................ 14-25
Using Materialized Views with Partitioned Tables ................................................................. 14-26
Fast Refresh with Partition Change Tracking....................................................................... 14-26
Fast Refresh with CONSIDER FRESH................................................................................... 14-30
16 Summary Advisor
Overview of the Summary Advisor in the DBMS_OLAP Package ........................................ 16-2
Using the Summary Advisor .......................................................................................................... 16-6
Identifier Numbers ..................................................................................................................... 16-7
Workload Management ............................................................................................................. 16-7
Loading a User-Defined Workload.......................................................................................... 16-9
Loading a Trace Workload ...................................................................................................... 16-12
x
Loading a SQL Cache Workload ............................................................................................ 16-15
Validating a Workload............................................................................................................. 16-17
Removing a Workload ............................................................................................................. 16-18
Using Filters with the Summary Advisor ............................................................................. 16-18
Removing a Filter ..................................................................................................................... 16-22
Recommending Materialized Views...................................................................................... 16-23
SQL Script Generation ............................................................................................................. 16-27
Summary Data Report ............................................................................................................. 16-29
When Recommendations are No Longer Required............................................................. 16-31
Stopping the Recommendation Process................................................................................ 16-32
Summary Advisor Sample Sessions ...................................................................................... 16-32
Summary Advisor and Missing Statistics ............................................................................. 16-37
Summary Advisor Privileges and ORA-30446..................................................................... 16-38
Estimating Materialized View Size............................................................................................. 16-38
ESTIMATE_MVIEW_SIZE Parameters ................................................................................. 16-38
Is a Materialized View Being Used? ........................................................................................... 16-39
DBMS_OLAP.EVALUATE_MVIEW_STRATEGY Procedure ........................................... 16-39
Summary Advisor Wizard............................................................................................................. 16-40
Summary Advisor Steps.......................................................................................................... 16-41
xi
Analyzing Across Multiple Dimensions ................................................................................. 18-3
Optimized Performance............................................................................................................. 18-4
An Aggregate Scenario .............................................................................................................. 18-5
Interpreting NULLs in Examples ............................................................................................. 18-6
ROLLUP Extension to GROUP BY................................................................................................ 18-6
When to Use ROLLUP ............................................................................................................... 18-7
ROLLUP Syntax .......................................................................................................................... 18-7
Partial Rollup............................................................................................................................... 18-8
CUBE Extension to GROUP BY ................................................................................................... 18-10
When to Use CUBE................................................................................................................... 18-10
CUBE Syntax ............................................................................................................................. 18-11
Partial CUBE.............................................................................................................................. 18-12
Calculating Subtotals Without CUBE .................................................................................... 18-13
GROUPING Functions .................................................................................................................. 18-13
GROUPING Function .............................................................................................................. 18-14
When to Use GROUPING ....................................................................................................... 18-16
GROUPING_ID Function ........................................................................................................ 18-17
GROUP_ID Function................................................................................................................ 18-17
GROUPING SETS Expression ..................................................................................................... 18-19
Composite Columns ....................................................................................................................... 18-21
Concatenated Groupings............................................................................................................... 18-24
Concatenated Groupings and Hierarchical Data Cubes..................................................... 18-26
Considerations when Using Aggregation .................................................................................. 18-28
Hierarchy Handling in ROLLUP and CUBE ........................................................................ 18-28
Column Capacity in ROLLUP and CUBE............................................................................. 18-29
HAVING Clause Used with GROUP BY Extensions .......................................................... 18-29
ORDER BY Clause Used with GROUP BY Extensions ....................................................... 18-30
Using Other Aggregate Functions with ROLLUP and CUBE............................................ 18-30
Computation Using the WITH Clause........................................................................................ 18-30
xii
Bottom N Ranking.................................................................................................................... 19-12
CUME_DIST .............................................................................................................................. 19-13
PERCENT_RANK..................................................................................................................... 19-14
NTILE ......................................................................................................................................... 19-14
ROW_NUMBER........................................................................................................................ 19-16
Windowing Aggregate Functions ................................................................................................ 19-17
Treatment of NULLs as Input to Window Functions ......................................................... 19-18
Windowing Functions with Logical Offset........................................................................... 19-18
Cumulative Aggregate Function Example ........................................................................... 19-18
Moving Aggregate Function Example .................................................................................. 19-19
Centered Aggregate Function................................................................................................. 19-20
Windowing Aggregate Functions in the Presence of Duplicates ...................................... 19-21
Varying Window Size for Each Row ..................................................................................... 19-22
Windowing Aggregate Functions with Physical Offsets.................................................... 19-23
FIRST_VALUE and LAST_VALUE ....................................................................................... 19-24
Reporting Aggregate Functions ................................................................................................... 19-24
Reporting Aggregate Example ............................................................................................... 19-26
RATIO_TO_REPORT ............................................................................................................... 19-27
LAG/LEAD Functions.................................................................................................................... 19-27
LAG/LEAD Syntax .................................................................................................................. 19-28
FIRST/LAST Functions.................................................................................................................. 19-28
FIRST/LAST Syntax................................................................................................................. 19-29
FIRST/LAST As Regular Aggregates.................................................................................... 19-29
FIRST/LAST As Reporting Aggregates ................................................................................ 19-30
Linear Regression Functions ........................................................................................................ 19-31
REGR_COUNT ......................................................................................................................... 19-32
REGR_AVGY and REGR_AVGX ........................................................................................... 19-32
REGR_SLOPE and REGR_INTERCEPT................................................................................ 19-32
REGR_R2.................................................................................................................................... 19-32
REGR_SXX, REGR_SYY, and REGR_SXY............................................................................. 19-33
Linear Regression Statistics Examples................................................................................... 19-33
Sample Linear Regression Calculation.................................................................................. 19-34
Inverse Percentile Functions......................................................................................................... 19-34
Normal Aggregate Syntax....................................................................................................... 19-35
Inverse Percentile Restrictions................................................................................................ 19-38
xiii
Hypothetical Rank and Distribution Functions ....................................................................... 19-38
Hypothetical Rank and Distribution Syntax......................................................................... 19-38
WIDTH_BUCKET Function.......................................................................................................... 19-40
WIDTH_BUCKET Syntax........................................................................................................ 19-40
User-Defined Aggregate Functions ............................................................................................. 19-43
CASE Expressions........................................................................................................................... 19-44
CASE Example .......................................................................................................................... 19-44
Creating Histograms With User-Defined Buckets............................................................... 19-45
xiv
How Oracle Determines the Degree of Parallelism for Operations.................................. 21-34
Balancing the Workload .......................................................................................................... 21-37
Parallelization Rules for SQL Statements.............................................................................. 21-38
Enabling Parallelism for Tables and Queries ....................................................................... 21-46
Degree of Parallelism and Adaptive Multiuser: How They Interact ................................ 21-47
Forcing Parallel Execution for a Session ............................................................................... 21-48
Controlling Performance with the Degree of Parallelism .................................................. 21-48
Tuning General Parameters for Parallel Execution .................................................................. 21-49
Parameters Establishing Resource Limits for Parallel Operations.................................... 21-49
Parameters Affecting Resource Consumption ..................................................................... 21-58
Parameters Related to I/O ...................................................................................................... 21-63
Monitoring and Diagnosing Parallel Execution Performance............................................... 21-64
Is There Regression?................................................................................................................. 21-66
Is There a Plan Change?........................................................................................................... 21-66
Is There a Parallel Plan?........................................................................................................... 21-66
Is There a Serial Plan? .............................................................................................................. 21-66
Is There Parallel Execution? .................................................................................................... 21-67
Is the Workload Evenly Distributed? .................................................................................... 21-67
Monitoring Parallel Execution Performance with Dynamic Performance Views .......... 21-68
Monitoring Session Statistics .................................................................................................. 21-71
Monitoring System Statistics................................................................................................... 21-73
Monitoring Operating System Statistics................................................................................ 21-74
Affinity and Parallel Operations.................................................................................................. 21-75
Affinity and Parallel Queries .................................................................................................. 21-75
Affinity and Parallel DML....................................................................................................... 21-76
Miscellaneous Parallel Execution Tuning Tips......................................................................... 21-76
Setting Buffer Cache Size for Parallel Operations ............................................................... 21-77
Overriding the Default Degree of Parallelism...................................................................... 21-77
Rewriting SQL Statements ...................................................................................................... 21-78
Creating and Populating Tables in Parallel .......................................................................... 21-78
Creating Temporary Tablespaces for Parallel Sort and Hash Join.................................... 21-80
Executing Parallel SQL Statements ........................................................................................ 21-81
Using EXPLAIN PLAN to Show Parallel Operations Plans .............................................. 21-81
Additional Considerations for Parallel DML ....................................................................... 21-82
Creating Indexes in Parallel .................................................................................................... 21-85
xv
Parallel DML Tips..................................................................................................................... 21-87
Incremental Data Loading in Parallel .................................................................................... 21-90
Using Hints with Cost-Based Optimization ......................................................................... 21-92
FIRST_ROWS(n) Hint .............................................................................................................. 21-93
Enabling Dynamic Statistic Sampling.................................................................................... 21-93
22 Query Rewrite
Overview of Query Rewrite............................................................................................................ 22-2
Cost-Based Rewrite..................................................................................................................... 22-3
When Does Oracle Rewrite a Query? ...................................................................................... 22-4
Enabling Query Rewrite.................................................................................................................. 22-7
Initialization Parameters for Query Rewrite .......................................................................... 22-8
Controlling Query Rewrite........................................................................................................ 22-8
Privileges for Enabling Query Rewrite.................................................................................... 22-9
Accuracy of Query Rewrite ..................................................................................................... 22-10
How Oracle Rewrites Queries ...................................................................................................... 22-11
Text Match Rewrite Methods.................................................................................................. 22-12
General Query Rewrite Methods............................................................................................ 22-13
When are Constraints and Dimensions Needed? ................................................................ 22-14
Special Cases for Query Rewrite ................................................................................................. 22-45
Query Rewrite Using Partially Stale Materialized Views................................................... 22-45
Query Rewrite Using Complex Materialized Views ........................................................... 22-49
Query Rewrite Using Nested Materialized Views............................................................... 22-50
Query Rewrite When Using GROUP BY Extensions .......................................................... 22-51
Did Query Rewrite Occur?............................................................................................................ 22-56
Explain Plan............................................................................................................................... 22-56
DBMS_MVIEW.EXPLAIN_REWRITE Procedure ............................................................... 22-57
Design Considerations for Improving Query Rewrite Capabilities..................................... 22-63
Query Rewrite Considerations: Constraints......................................................................... 22-63
Query Rewrite Considerations: Dimensions ........................................................................ 22-63
Query Rewrite Considerations: Outer Joins ......................................................................... 22-63
Query Rewrite Considerations: Text Match ......................................................................... 22-63
Query Rewrite Considerations: Aggregates ......................................................................... 22-64
Query Rewrite Considerations: Grouping Conditions ....................................................... 22-64
Query Rewrite Considerations: Expression Matching ........................................................ 22-64
xvi
Query Rewrite Considerations: Date Folding ...................................................................... 22-65
Query Rewrite Considerations: Statistics.............................................................................. 22-65
Glossary
Index
xvii
xviii
Send Us Your Comments
Oracle9i Data Warehousing Guide, Release 2 (9.2)
Part No. A96520-01
Oracle Corporation welcomes your comments and suggestions on the quality and usefulness of this
document. Your input is an important part of the information used for revision.
■ Did you find any errors?
■ Is the information clearly presented?
■ Do you need more information? If so, where?
■ Are the examples correct? Do you need more examples?
■ What features did you like most?
If you find any errors or have any other suggestions for improvement, please indicate the document
title and part number, and the chapter, section, and page number (if available). You can send com-
ments to us in the following ways:
■ Electronic mail: [email protected]
■ FAX: (650) 506-7227 Attn: Server Technologies Documentation Manager
■ Postal service:
Oracle Corporation
Server Technologies Documentation
500 Oracle Parkway, Mailstop 4op11
Redwood Shores, CA 94065
USA
If you would like a reply, please give your name, address, telephone number, and (optionally) elec-
tronic mail address.
If you have problems with the software, please contact your local Oracle Support Services.
xix
xx
Preface
xxi
Audience
Oracle9i Data Warehousing Guide is intended for database administrators, system
administrators, and database application developers who design, maintain, and use
data warehouses.
To use this document, you need to be familiar with relational database concepts,
basic Oracle server concepts, and the operating system environment under which
you are running Oracle.
Organization
This document contains:
Part 1: Concepts
Chapter 6, Indexes
This chapter describes how to use indexes in data warehouses.
xxii
Chapter 7, Integrity Constraints
This chapter describes some issues involving constraints.
Chapter 9, Dimensions
This chapter describes how to use dimensions in data warehouses.
xxiii
Part 5: Warehouse Performance
Glossary
Related Documentation
For more information, see these Oracle resources:
■ Oracle9i Database Performance Tuning Guide and Reference
Many of the examples in this book use the sample schemas of the seed database,
which is installed by default when you install Oracle. Refer to Oracle9i Sample
Schemas for information on how these schemas were created and how you can use
them yourself.
In North America, printed documentation is available for sale in the Oracle Store at
http://oraclestore.oracle.com/
xxiv
Customers in Europe, the Middle East, and Africa (EMEA) can purchase
documentation from
http://www.oraclebookshop.com/
If you already have a username and password for OTN, then you can go directly to
the documentation section of the OTN Web site at
http://otn.oracle.com/docs/index.htm
Conventions
This section describes the conventions used in the text and code examples of this
documentation set. It describes:
■ Conventions in Text
■ Conventions in Code Examples
■ Conventions for Windows Operating Systems
xxv
Conventions in Text
We use various conventions in text to help you more quickly identify special terms.
The following table describes those conventions and provides examples of their use.
xxvi
Conventions in Code Examples
Code examples illustrate SQL, PL/SQL, SQL*Plus, or other command-line
statements. They are displayed in a monospace (fixed-width) font and separated
from normal text as shown in this example:
SELECT username FROM dba_users WHERE username = ’MIGRATE’;
The following table describes typographic conventions used in code examples and
provides examples of their use.
xxvii
Convention Meaning Example
Italics Italicized text indicates placeholders or CONNECT SYSTEM/system_password
variables for which you must supply DB_NAME = database_name
particular values.
UPPERCASE Uppercase typeface indicates elements SELECT last_name, employee_id FROM
supplied by the system. We show these employees;
terms in uppercase in order to distinguish SELECT * FROM USER_TABLES;
them from terms you define. Unless terms DROP TABLE hr.employees;
appear in brackets, enter them in the
order and with the spelling shown.
However, because these terms are not
case sensitive, you can enter them in
lowercase.
lowercase Lowercase typeface indicates SELECT last_name, employee_id FROM
programmatic elements that you supply. employees;
For example, lowercase indicates names sqlplus hr/hr
of tables, columns, or files. CREATE USER mjones IDENTIFIED BY ty3MU9;
Note: Some programmatic elements use a
mixture of UPPERCASE and lowercase.
Enter these elements as shown.
xxviii
Convention Meaning Example
C:\> Represents the Windows command C:\oracle\oradata>
prompt of the current hard disk drive.
The escape character in a command
prompt is the caret (^). Your prompt
reflects the subdirectory in which you are
working. Referred to as the command
prompt in this manual.
Special characters The backslash (\) special character is C:\>exp scott/tiger TABLES=emp
sometimes required as an escape QUERY=\"WHERE job=’SALESMAN’ and
character for the double quotation mark sal<1600\"
(") special character at the Windows C:\>imp SYSTEM/password FROMUSER=scott
command prompt. Parentheses and the TABLES=(emp, dept)
single quotation mark (’) do not require
an escape character. Refer to your
Windows operating system
documentation for more information on
escape and special characters.
HOME_NAME Represents the Oracle home name. The C:\> net start OracleHOME_NAMETNSListener
home name can be up to 16 alphanumeric
characters. The only special character
allowed in the home name is the
underscore.
xxix
Convention Meaning Example
ORACLE_HOME In releases prior to Oracle8i release 8.1.3, Go to the ORACLE_BASE\ORACLE_
and ORACLE_ when you installed Oracle components, HOME\rdbms\admin directory.
BASE all subdirectories were located under a
top level ORACLE_HOME directory that by
default used one of the following names:
■ C:\orant for Windows NT
■ C:\orawin98 for Windows 98
This release complies with Optimal
Flexible Architecture (OFA) guidelines.
All subdirectories are not under a top
level ORACLE_HOME directory. There is a
top level directory called ORACLE_BASE
that by default is C:\oracle. If you
install the latest Oracle release on a
computer with no other Oracle software
installed, then the default setting for the
first Oracle home directory is
C:\oracle\orann, where nn is the
latest release number. The Oracle home
directory is located directly under
ORACLE_BASE.
All directory path examples in this guide
follow OFA conventions.
Refer to Oracle9i Database Getting Started
for Windows for additional information
about OFA compliances and for
information about installing Oracle
products in non-OFA compliant
directories.
xxx
Documentation Accessibility
Our goal is to make Oracle products, services, and supporting documentation
accessible, with good usability, to the disabled community. To that end, our
documentation includes features that make information available to users of
assistive technology. This documentation is available in HTML format, and contains
markup to facilitate access by the disabled community. Standards will continue to
evolve over time, and Oracle Corporation is actively engaged with other
market-leading technology vendors to address technical obstacles so that our
documentation can be accessible to all of our customers. For additional information,
visit the Oracle Accessibility Program Web site at
http://www.oracle.com/accessibility/
xxxi
xxxii
What’s New in Data Warehousing?
This section describes new features of Oracle9i release 2 (9.2) and provides pointers
to additional information. New features information from previous releases is also
retained to help those users migrating to the current release.
The following sections describe the new features in Oracle Data Warehousing:
■ Oracle9i Release 2 (9.2) New Features in Data Warehousing
■ Oracle9i Release 1 (9.0.1) New Features in Data Warehousing
xxxiii
Oracle9i Release 2 (9.2) New Features in Data Warehousing
■ Data Segment Compression
You can compress data segments in heap-organized tables, and a typical
example of a heap-organized table you should consider for data segment
compression is partitioned tables. Data segment compression is also useful for
highly redundant data, such as tables with many foreign keys and materialized
views created with the ROLLUP clause. You should avoid compression on tables
with many updates or DML.
■ Partitioning Enhancements
You can now simplify SQL syntax by using a DEFAULT partition or a
subpartition template. You can implement SPLIT operations more easily.
xxxiv
■ Query Rewrite Enhancements
Text match processing and join equivalence recognition have been improved.
Materialized views containing the UNION ALL operator can now use query
rewrite.
■ Range-List Partitioning
You can now subpartition by list range-partitioned tables.
■ ETL Enhancements
Oracle’s extraction, transformation, and loading capabilities have been
improved with a MERGE statement, multi-table inserts, and table functions.
xxxv
■ Full Outer Joins
Oracle added full support for full outer joins so that you can more easily
express certain complex queries.
■ Grouping Sets
You can now selectively specify the set of groups that you want to create using
a GROUPING SETS expression within a GROUP BY clause. This allows precise
specification across multiple dimensions without computing the whole CUBE.
■ List Partitioning
List partitioning offers you precise control over which data belongs in a
particular partition.
xxxvi
■ Summary Advisor Enhancements
The Summary Advisor tool and its related DBMS_OLAP package were improved
so you can specify workloads. In addition, a broader class of schemas is now
supported.
■ WITH Clause
The WITH clause enables you to reuse a query block in a SELECT statement
when it occurs more than once within a complex query.
xxxvii
xxxviii
Part I
Concepts
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more
about your company’s sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems
as naming conflicts and inconsistencies among units of measure. When they achieve
this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change.
This is logical because the purpose of a warehouse is to enable you to analyze what
has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A
data warehouse’s focus on change over time is what is meant by the term time
variant.
Complex data
structures Multidimensional
(3NF databases) data structures
One major difference between the types of system is that data warehouses are not
usually in third normal form (3NF), a type of data normalization common in OLTP
environments.
Data warehouses and OLTP systems have very different requirements. Here are
some examples of differences between typical data warehouses and OLTP systems:
■ Workload
Data warehouses are designed to accommodate ad hoc queries. You might not
know the workload of your data warehouse in advance, so a data warehouse
should be optimized to perform well for a wide variety of possible query
operations.
OLTP systems support only predefined operations. Your applications might be
specifically tuned or designed to support only these operations.
■ Data modifications
A data warehouse is updated on a regular basis by the ETL process (run nightly
or weekly) using bulk data modification techniques. The end users of a data
warehouse do not directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification
statements to the database. The OLTP database is always up to date, and reflects
the current state of each business transaction.
■ Schema design
Data warehouses often use denormalized or partially denormalized schemas
(such as a star schema) to optimize query performance.
OLTP systems often use fully normalized schemas to optimize
update/insert/delete performance, and to guarantee data consistency.
■ Typical operations
A typical data warehouse query scans thousands or millions of rows. For
example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For example,
"Retrieve the current order for this customer."
■ Historical data
Data warehouses usually store many months or years of data. This is to support
historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP
system stores only historical data as needed to successfully meet the
requirements of the current transaction.
Operational Analysis
System
Metadata
In Figure 1–2, the metadata and raw data of a traditional OLTP system is present, as
is an additional type of data, summary data. Summaries are very valuable in data
warehouses because they pre-compute long operations in advance. For example, a
typical data warehouse query is to retrieve something like August sales. A
summary in Oracle is called a materialized view.
Data Staging
Sources Area Warehouse Users
Operational Analysis
System
Metadata
Figure 1–4 Architecture of a Data Warehouse with a Staging Area and Data Marts
Metadata
This section deals with the issues in logical design in a data warehouse.
It contains the following chapter:
■ Logical Design in Data Warehouses
2
Logical Design in Data Warehouses
This chapter tells you how to design a data warehousing environment and includes
the following topics:
■ Logical Versus Physical Design in Data Warehouses
■ Creating a Logical Design
■ Data Warehousing Schemas
■ Data Warehousing Objects
Your logical design should result in (1) a set of entities and attributes corresponding
to fact tables and dimension tables and (2) a model of operational data from your
source into subject-oriented information in your target data warehouse schema.
You can create the logical design using a pen and paper, or you can use a design
tool such as Oracle Warehouse Builder (specifically designed to support modeling
the ETL process) or Oracle Designer (a general purpose modeling tool).
Star Schemas
The star schema is the simplest data warehouse schema. It is called a star schema
because the diagram resembles a star, with points radiating from a center. The
center of the star consists of one or more fact tables and the points of the star are the
dimension tables, as shown in Figure 2–1.
products times
sales
(amount_sold,
quantity_sold)
Fact Table
customers channels
The most natural way to model a data warehouse is as a star schema, only one join
establishes the relationship between the fact table and any one of the dimension
tables.
A star schema optimizes performance by keeping queries simple and providing fast
response time. All the information about each level is stored in one row.
Other Schemas
Some schemas in data warehousing environments use third normal form rather
than star schemas. Another schema that is sometimes useful is the snowflake
schema, which is a star schema with normalized dimensions in a tree structure.
Fact Tables
A fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension tables. A
fact table contains either detail-level facts or facts that have been aggregated. Fact
tables that contain aggregated facts are often called summary tables. A fact table
usually contains facts with the same level of aggregation. Though most facts are
additive, they can also be semi-additive or non-additive. Additive facts can be
aggregated by simple arithmetical addition. A common example of this is sales.
Non-additive facts cannot be added at all. An example of this is averages.
Semi-additive facts can be aggregated along some of the dimensions and not along
others. An example of this is inventory levels, where you cannot tell what a level
means simply by looking at it.
Dimension Tables
A dimension is a structure, often composed of one or more hierarchies, that
categorizes data. Dimensional attributes help to describe the dimensional value.
They are normally descriptive, textual values. Several distinct dimensions,
combined with facts, enable you to answer business questions. Commonly used
dimensions are customers, products, and time.
Dimension data is typically collected at the lowest level of detail and then
aggregated into higher level totals that are more useful for analysis. These natural
rollups or aggregations within a dimension table are called hierarchies.
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing
data. A hierarchy can be used to define data aggregation. For example, in a time
dimension, a hierarchy might aggregate data from the month level to the quarter
level to the year level. A hierarchy can also be used to define a navigational drill
path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below it.
Data values at lower levels aggregate into the data values at higher levels. A
dimension can be composed of more than one hierarchy. For example, in the
product dimension, there might be two hierarchies—one for product categories
and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use
hierarchies to enable you to drill down into your data to view different levels of
granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level
value, a value at the next higher level is its parent, and values at the next lower level
are its children. These familial relationships enable analysts to access data quickly.
region
subregion
country_name
customer
Unique Identifiers
Unique identifiers are specified for one distinct record in a dimension table.
Artificial unique identifiers are often used to avoid the potential problem of unique
identifiers changing. Unique identifiers are represented with the # character. For
example, #customer_id.
Relationships
Relationships guarantee business integrity. An example is that if a business sells
something, there is obviously a customer and a product. Designing a relationship
between the sales information in the fact table and the dimension tables products
and customers enforces the business rules in databases.
Relationship
products customers
Fact Table #cust_id
#prod_id
sales cust_last_name
cust_id cust_city
cust_state_province Hierarchy
prod_id
times channels
promotions
Dimension Table Dimension Table
Dimension Table
This chapter describes the physical design of a data warehousing environment, and
includes the following topics:
■ Moving from Logical to Physical Design
■ Physical Design
See Also:
■ Chapter 5, "Parallelism and Partitioning in Data Warehouses"
for further information regarding partitioning
■ Oracle9i Database Concepts for further conceptual material
regarding all design matters
Physical Design
During the logical design phase, you defined a model for your data warehouse
consisting of entities, attributes, and relationships. The entities are linked together
using relationships. Attributes are used to describe the entities. The unique
identifier (UID) distinguishes between one instance of an entity and another.
Figure 3–1 offers you a graphical way of looking at the different ways of thinking
about logical and physical designs.
Integrity Materialized
Relationships Constraints Views
- Primary Key
- Foreign Key
- Not Null
Attributes Dimensions
Columns
Unique
Identifiers
During the physical design process, you translate the expected schemas into actual
database structures. At this time, you have to map:
■ Entities to tables
■ Relationships to foreign key constraints
■ Attributes to columns
■ Primary unique identifiers to primary key constraints
■ Unique identifiers to unique key constraints
Tablespaces
A tablespace consists of one or more datafiles, which are physical structures within
the operating system you are using. A datafile is associated with only one
tablespace. From a design perspective, tablespaces are containers for physical
design structures.
Tablespaces need to be separated by differences. For example, tables should be
separated from their indexes and small tables should be separated from large tables.
Tablespaces should also represent logical business units if possible. Because a
tablespace is the coarsest granularity for backup and recovery or the transportable
tablespaces mechanism, the logical business design affects availability and
maintenance operations.
Views
A view is a tailored presentation of the data contained in one or more tables or
other views. A view takes the output of a query and treats it as a table. Views do not
require any space in the database.
Integrity Constraints
Integrity constraints are used to enforce business rules associated with your
database and to prevent having invalid information in the tables. Integrity
constraints in data warehousing differ from constraints in OLTP environments. In
OLTP environments, they primarily prevent the insertion of invalid data into a
record, which is not a big problem in data warehousing environments because
accuracy has already been guaranteed. In data warehousing environments,
constraints are only used for query rewrite. NOT NULL constraints are particularly
common in data warehouses. Under some specific circumstances, constraints need
space in the database. These constraints are in the form of the underlying unique
index.
Materialized Views
Materialized views are query results that have been stored in advance so
long-running calculations are not necessary when you actually execute your SQL
statements. From a physical design point of view, materialized views resemble
tables or partitioned tables and behave like indexes.
Dimensions
A dimension is a schema object that defines hierarchical relationships between
columns or column sets. A hierarchical relationship is a functional dependency
from one level of a hierarchy to the next one. A dimension is a container of logical
relationships and does not require any space in the database. A typical dimension is
city, state (or province), region, and country.
This chapter explains some of the hardware and I/O issues in a data warehousing
environment and includes the following topics:
■ Overview of Hardware and I/O Considerations in Data Warehouses
■ RAID Configurations
Controller 1 Controller 2
1 1 1 1 tablespace 1
0001 0001 0001 0001
2 2 2 2 tablespace 2
0002 0002 0002 0002
3 3 3 3 tablespace 3
4 4 4 4 tablespace 4
5 5 5 5 tablespace 5
See Also: Oracle9i Database Concepts for further details about disk
striping
You should stripe tablespaces for tables, indexes, rollback segments, and temporary
tablespaces. You must also spread the devices over controllers, I/O channels, and
internal buses. To make striping effective, you must make sure that enough
controllers and other I/O components are available to support the bandwidth of
parallel data movement into and out of the striped tablespaces.
You can use RAID systems or you can perform striping manually through careful
data file allocation to tablespaces.
The striping of data across physical drives has several consequences besides
balancing I/O. One additional advantage is that logical files can be created that are
larger than the maximum size usually supported by an operating system. There are
disadvantages however. Striping means that it is no longer possible to locate a
single datafile on a specific physical drive. This can cause the loss of some
application tuning capabilities. Also, it can cause database recovery to be more
time-consuming. If a single physical disk in a RAID array needs recovery, all the
disks that are part of that logical RAID device must be involved in the recovery.
Automatic Striping
Automatic striping is usually flexible and easy to manage. It supports many
scenarios such as multiple users running sequentially or as single users running in
parallel. Two main advantages make automatic striping preferable to manual
striping, unless the system is very small or availability is the main concern:
■ For parallel scan operations (such as full table scan or fast full scan), operating
system striping increases the number of disk seeks. Nevertheless, this is largely
offset by the large I/O size (DB_BLOCK_SIZE * MULTIBLOCK_READ_COUNT),
which should enable this operation to reach the maximum I/O throughput for
your platform. This maximum is in general limited by the number of controllers
or I/O buses of the platform, not by the number of disks (unless you have a
small configuration or are using large disks).
■ For index probes (for example, within a nested loop join or parallel index range
scan), operating system striping enables you to avoid hot spots by evenly
distributing I/O across the disks.
Oracle Corporation recommends using a large stripe size of at least 64 KB. Stripe
size must be at least as large as the I/O size. If stripe size is larger than I/O size by a
factor of two or four, then trade-offs may arise. The large stripe size can be
advantageous because it lets the system perform more sequential operations on
each disk; it decreases the number of seeks on disk. Another advantage of large
stripe sizes is that more users can work on the system without affecting each other.
The disadvantage is that large stripes reduce the I/O parallelism, so fewer disks are
simultaneously active. If you encounter problems, increase the I/O size of scan
operations (for example, from 64 KB to 128 KB), instead of changing the stripe size.
The maximum I/O size is platform-specific (in a range, for example, of 64 KB to 1
MB).
With automatic striping, from a performance standpoint, the best layout is to stripe
data, indexes, and temporary tablespaces across all the disks of your platform. This
layout is also appropriate when you have little information about system usage. To
increase availability, it may be more practical to stripe over fewer disks to prevent a
single disk value from affecting the entire data warehouse. However, for better
performance, it is crucial to stripe all objects over multiple disks. In this way,
maximum I/O performance (both in terms of throughput and in number of I/Os
per second) can be reached when one object is accessed by a parallel operation. If
multiple objects are accessed at the same time (as in a multiuser configuration),
striping automatically limits the contention.
Manual Striping
You can use manual striping on all platforms. To do this, add multiple files to each
tablespace, with each file on a separate disk. If you use manual striping correctly,
your system’s performance improves significantly. However, you should be aware
of several drawbacks that can adversely affect performance if you do not stripe
correctly.
When using manual striping, the degree of parallelism (DOP) is more a function of
the number of disks than of the number of CPUs. First, it is necessary to have one
server process for each datafile to drive all the disks and limit the risk of
experiencing I/O bottlenecks. Second, manual striping is very sensitive to datafile
size skew, which can affect the scalability of parallel scan operations. Third, manual
striping requires more planning and set-up effort than automatic striping.
,,
,,
Partition 1 Partition 2
Stripe 1 Stripe 3
Stripe 2 Stripe 4
Global striping, illustrated in Figure 4–3, entails overlapping disks and partitions.
Partition 1 Partition 2
Stripe 1
Stripe 2
Global striping is advantageous if you have partition pruning and need to access
data in only one partition. Spreading the data in that partition across many disks
improves performance for parallel execution operations. A disadvantage of global
striping is that if one disk fails, all partitions are affected if the disks are not
mirrored.
Analyzing Striping
Two considerations arise when analyzing striping issues for your applications. First,
consider the cardinality of the relationships among the objects in a storage system.
Second, consider what you can optimize in your striping effort: full table scans,
general tablespace availability, partition scans, or some combinations of these goals.
Cardinality and optimization are discussed in the following section.
1 p s 1 1 f m n
table partitions tablespace files devices
Figure 4–4 shows the cardinality of the relationships among objects in a typical
Oracle storage system. For every table there may be:
■ p partitions, shown in Figure 4–4 as a one-to-many relationship
■ s partitions for every tablespace, shown in Figure 4–4 as a many-to-one
relationship
■ f files for every tablespace, shown in Figure 4–4 as a one-to-many relationship
■ m files to n devices, shown in Figure 4–4 as a many-to-many relationship
Striping Goals
You can stripe an object across devices to achieve one of three goals:
■ Goal 1: To optimize full table scans, place a table on many devices.
■ Goal 2: To optimize availability, restrict the tablespace to a few devices.
■ Goal 3: To optimize partition scans, achieve intra-partition parallelism by
placing each partition on many devices.
To attain both Goals 1 and 2 (having the table reside on many devices, with the
highest possible availability), maximize the number of partitions p and minimize
the number of partitions for each tablespace s.
To maximize Goal 1 but with minimal intra-partition parallelism, place each
partition in its own tablespace. Do not used striped files, and use one file for each
tablespace.
To minimize Goal 2 and thereby minimize availability, set f and n equal to 1. When
you minimize availability, you maximize intra-partition parallelism. Goal 3 conflicts
with Goal 2 because you cannot simultaneously maximize the formula for Goal 3
and minimize the formula for Goal 2. You must compromise to achieve some of the
benefits of both goals.
Striping Goal 1: Optimize Full Table Scans
Having a table reside on many devices ensures scalable full table scans.
To calculate the optimal number of devices for each table, use this formula:
You can do this by having t partitions, with every partition in its own tablespace, if
every tablespace has one file, and these files are not striped.
t x 1 / p x 1 x 1, up to t devices
If the table is not partitioned, but is in one tablespace in one file, stripe it over n
devices.
1 x 1 x n devices
There are a maximum of t partitions, every partition in its own tablespace, f files in
each tablespace, each tablespace on a striped device:
t x f x n devices
Partitions can reside in a tablespace that can have many files. You can have either a
striped file or many files for each tablespace.
RAID Configurations
RAID systems, also called disk arrays, can be hardware- or software-based systems.
The difference between the two is how CPU processing of I/O requests is handled.
In software-based RAID systems, the operating system or an application level
handles the I/O request, while in hardware-based RAID systems, disk controllers
handle I/O requests. RAID usage is transparent to Oracle. All the features specific
to a given RAID configuration are handled by the operating system and Oracle does
not need to worry about them.
Primary logical database structures have different access patterns during read and
write operations. Therefore, different RAID implementations will be better suited
for these structures. The purpose of this chapter is to discuss some of the basic
decisions you must make when designing the physical layout of your data
warehouse implementation. It is not meant as a replacement for operating system
and storage documentation or a consultant’s analysis of your I/O requirements.
There are advantages and disadvantages to using RAID, and those depend on the
RAID level under consideration and the specific system in question. The most
common configurations in data warehouses are:
■ RAID 0 (Striping)
■ RAID 1 (Mirroring)
■ RAID 0+1 (Striping and Mirroring)
■ RAID 5
RAID 0 (Striping)
RAID 0 is a non-redundant disk array, so there will be data loss with any disk
failure. If something on the disk becomes corrupted, you cannot restore or
recalculate that data. RAID 0 provides the best write throughput performance
because it never updates redundant information. Read throughput is also quite
good, but you can improve it by combining RAID 0 with RAID 1.
Oracle does not recommend using RAID 0 systems without RAID 1 because the loss
of one disk in the array will affect the complete system and make it unavailable.
RAID 0 systems are used mainly in environments where performance and capacity
are the primary concerns rather than availability.
RAID 1 (Mirroring)
RAID 1 provides full data redundancy by complete mirroring of all files. If a disk
failure occurs, the mirrored copy is used to transparently service the request. RAID
1 mirroring requires twice as much disk space as there is data. In general, RAID 1 is
most useful for systems where complete redundancy of data is required and disk
space is not an issue. For large datafiles or systems with less disk space, RAID 1
may not be feasible, because it requires twice as much disk space as there is data.
Writes under RAID 1 are no faster and no slower than usual. Reading data can be
faster than on a single disk because the system can choose to read the data from the
disk that can respond faster.
substitute for, backups and log archives. Mirroring can help your system recover
from disk failures more quickly than using a backup, but mirroring is not as robust.
Mirroring does not protect against software faults and other problems against
which an independent backup would protect your system.
You can effectively use mirroring if you are able to reload read-only data from the
original source tapes. If you have a disk failure, restoring data from backups can
involve lengthy downtime, whereas restoring from a mirrored disk enables your
system to get back online quickly or even stay online while the crashed disk is
replaced and resynchronized.
RAID 5
RAID 5 systems provide redundancy for the original data while storing parity
information as well. The parity information is striped over all disks in the system to
avoid a single disk as a bottleneck during write operations. The I/O throughput of
RAID 5 systems depends upon the implementation and the striping size. For a
typical RAID 5 system, the throughput is normally lower than RAID 0 + 1
configurations. In particular, the performance for high concurrent write operations
such as parallel load can be poor.
Many vendors use memory (as battery-backed cache) in front of the disks to
increase throughput and to become comparable to RAID 0+1. Contact your disk
array vendor for specific details.
Data warehouses often contain large tables and require techniques both for
managing these large tables and for providing good query performance across these
large tables. This chapter discusses two key methodologies for addressing these
needs: parallelism and partitioning.
These topics are discussed:
■ Overview of Parallel Execution
■ Granules of Parallelism
■ Partitioning Design Considerations
■ Miscellaneous Partition Operations
Granules of Parallelism
Different parallel operations use different types of parallelism. The optimal physical
database layout depends on the parallel operations that are most prevalent in your
application or even of the necessity of using partitions.
The basic unit of work in parallelism is a called a granule. Oracle divides the
operation being parallelized (for example, a table scan, table update, or index
creation) into granules. Parallel execution processes execute the operation one
granule at a time. The number of granules and their size correlates with the degree
of parallelism (DOP). It also affects how well the work is balanced across query
server processes. There is no way you can enforce a specific granule strategy as
Oracle makes this decision internally.
deleting portions of data) might influence partition layout more than performance
considerations.
Partition Granules
When Oracle uses partition granules, a query server process works on an entire
partition or subpartition of a table or index. Because partition granules are statically
determined by the structure of the table or index when a table or index is created,
partition granules do not give you the flexibility in parallelizing an operation that
block granules do. The maximum allowable DOP is the number of partitions. This
might limit the utilization of the system and the load balancing across parallel
execution servers.
When Oracle uses partition granules for parallel access to a table or index, you
should use a relatively large number of partitions (ideally, three times the DOP), so
that Oracle can effectively balance work across the query server processes.
Partition granules are the basic unit of parallel index range scans and of parallel
operations that modify multiple partitions of a partitioned table or index. These
operations include parallel creation of partitioned indexes, and parallel creation of
partitioned tables.
Types of Partitioning
This section describes the partitioning features that significantly enhance data
access and improve overall application performance. This is especially true for
applications that access tables and indexes with millions of rows and many
gigabytes of data.
Partitioning Methods
Oracle offers four partitioning methods:
■ Range Partitioning
■ Hash Partitioning
■ List Partitioning
■ Composite Partitioning
Each partitioning method has different advantages and design considerations.
Thus, each method is more appropriate for a particular situation.
Note: This table was created with the COMPRESS keyword, thus
all partitions inherit this attribute.
See Also: Oracle9i SQL Reference for partitioning syntax and the
Oracle9i Database Administrator’s Guide for more examples
See Also: Oracle9i SQL Reference for partitioning syntax and the
Oracle9i Database Administrator’s Guide for more examples
List Partitioning List partitioning enables you to explicitly control how rows map to
partitions. You do this by specifying a list of discrete values for the partitioning
column in the description for each partition. This is different from range
partitioning, where a range of values is associated with a partition and with hash
partitioning, where you have no control of the row-to-partition mapping. The
advantage of list partitioning is that you can group and organize unordered and
unrelated sets of data in a natural way. The following example creates a list
partitioned table grouping states according to their sales regions:
CREATE TABLE sales_list
(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_state VARCHAR2(20),
sales_amount NUMBER(10),
sales_date DATE)
PARTITION BY LIST(sales_state)
(
PARTITION sales_west VALUES('California', 'Hawaii') COMPRESS,
PARTITION sales_east VALUES('New York', 'Virginia', 'Florida'),
PARTITION sales_central VALUES('Texas', 'Illinois')
);
Index Partitioning
You can choose whether or not to inherit the partitioning strategy of the underlying
tables. You can create both local and global indexes on a table partitioned by range,
hash, or composite methods. Local indexes inherit the partitioning attributes of
their related tables. For example, if you create a local index on a composite table,
Oracle automatically partitions the local index using the composite method.
Oracle supports only range partitioning for global partitioned indexes. You cannot
partition global indexes using the hash or composite partitioning methods.
Range partitioning is also ideal when you periodically load new data and purge old
data. It is easy to add or drop partitions.
It is common to keep a rolling window of data, for example keeping the past 36
months’ worth of data online. Range partitioning simplifies this process. To add
data from a new month, you load it into a separate table, clean it, index it, and then
add it to the range-partitioned table using the EXCHANGE PARTITION statement, all
while the original table remains online. Once you add the new partition, you can
drop the trailing month with the DROP PARTITION statement. The alternative to
using the DROP PARTITION statement can be to archive the partition and make it
read only, but this works only when your partitions are in separate tablespaces.
In conclusion, consider using range partitioning when:
■ Very large tables are frequently scanned by a range predicate on a good
partitioning column, such as ORDER_DATE or PURCHASE_DATE. Partitioning
the table on that column enables partition pruning.
■ You want to maintain a rolling window of data.
■ You cannot complete administrative operations, such as backup and restore, on
large tables in an allotted time frame, but you can divide them into smaller
logical pieces based on the partition range column.
The following example creates the table sales for a period of two years, 1999 and
2000, and partitions it by range according to the column s_salesdate to separate
the data into eight quarters, each corresponding to a partition.
CREATE TABLE sales
(s_productid NUMBER,
s_saledate DATE,
s_custid NUMBER,
s_totalprice NUMBER)
PARTITION BY RANGE(s_saledate)
(PARTITION sal99q1 VALUES LESS THAN (TO_DATE('01-APR-1999', 'DD-MON-YYYY')),
PARTITION sal99q2 VALUES LESS THAN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')),
PARTITION sal99q3 VALUES LESS THAN (TO_DATE('01-OCT-1999', 'DD-MON-YYYY')),
PARTITION sal99q4 VALUES LESS THAN (TO_DATE('01-JAN-2000', 'DD-MON-YYYY')),
PARTITION sal00q1 VALUES LESS THAN (TO_DATE('01-APR-2000', 'DD-MON-YYYY')),
PARTITION sal00q2 VALUES LESS THAN (TO_DATE('01-JUL-2000', 'DD-MON-YYYY')),
PARTITION sal00q3 VALUES LESS THAN (TO_DATE('01-OCT-2000', 'DD-MON-YYYY')),
PARTITION sal00q4 VALUES LESS THAN (TO_DATE('01-JAN-2001', 'DD-MON-YYYY')));
When to Use Hash Partitioning The way Oracle distributes data in hash partitions does
not correspond to a business or a logical view of the data, as it does in range
partitioning. Consequently, hash partitioning is not an effective way to manage
If you add or merge a hashed partition, Oracle automatically rearranges the rows to
reflect the change in the number of partitions and subpartitions. The hash function
that Oracle uses is especially designed to limit the cost of this reorganization.
Instead of reshuffling all the rows in the table, Oracles uses an "add partition" logic
that splits one and only one of the existing hashed partitions. Conversely, Oracle
coalesces a partition by merging two existing hashed partitions.
Although the hash function’s use of "add partition" logic dramatically improves the
manageability of hash partitioned tables, it means that the hash function can cause a
skew if the number of partitions of a hash partitioned table, or the number of
subpartitions in each partition of a composite table, is not a power of two. In the
worst case, the largest partition can be twice the size of the smallest. So for optimal
performance, create a number of partitions and subpartitions for each partition that
is a power of two. For example, 2, 4, 8, 16, 32, 64, 128, and so on.
The following example creates four hashed partitions for the table sales_hash
using the column s_productid as the partition key:
CREATE TABLE sales_hash
(s_productid NUMBER,
s_saledate DATE,
s_custid NUMBER,
s_totalprice NUMBER)
PARTITION BY HASH(s_productid)
PARTITIONS 4;
Specify partition names if you want to choose the names of the partitions.
Otherwise, Oracle automatically generates internal names for the partitions. Also,
you can use the STORE IN clause to assign hash partitions to tablespaces in a
round-robin manner.
See Also: Oracle9i SQL Reference for partitioning syntax and the
Oracle9i Database Administrator’s Guide for more examples
When to Use List Partitioning You should use list partitioning when you want to
specifically map rows to partitions based on discrete values.
Unlike range and hash partitioning, multi-column partition keys are not supported
for list partitioning. If a table is partitioned by list, the partitioning key can only
consist of a single column of the table.
■ Are eligible for partition pruning and partition-wise joins on the range and hash
dimensions
Each hashed subpartition contains sales data for a single quarter ordered by
product code. The total number of subpartitions is 4x8 or 32.
In this example, every partition has the same number of subpartitions. A sample
mapping for sal99q1 is illustrated in Table 5–1. Similar mappings exist for
sal99q2 through sal99q4.
See Also: Oracle9i SQL Reference for details regarding syntax and
restrictions
See Also: Oracle9i SQL Reference for details regarding syntax and
restrictions
If you use the MOVE statement, the local indexes for partition sales_q1_1998
become unusable. You have to rebuild them afterward, as follows:
ALTER TABLE sales
MODIFY PARTITION sales_q1_1998 REBUILD UNUSABLE LOCAL INDEXES;
The following statement merges two existing partitions into a new, compressed
partition, residing in a separate tablespace. The local bitmap indexes have to be
rebuilt afterward, as follows:
ALTER TABLE sales MERGE PARTITIONS sales_q1_1998, sales_q2_1998
INTO PARTITION sales_1_1998 TABLESPACE ts_arch_1_1998
COMPRESS UPDATE GLOBAL INDEXES;
Partition Pruning
Partition pruning is an essential performance feature for data warehouses. In
partition pruning, the cost-based optimizer analyzes FROM and WHERE clauses in
SQL statements to eliminate unneeded partitions when building the partition access
list. This enables Oracle to perform operations only on those partitions that are
relevant to the SQL statement. Oracle prunes partitions when you use range, LIKE,
equality, and IN-list predicates on the range or list partitioning columns, and when
you use equality and IN-list predicates on the hash partitioning columns.
Partition pruning dramatically reduces the amount of data retrieved from disk and
shortens the use of processing time, improving query performance and resource
utilization. If you partition the index and table on different columns (with a global,
partitioned index), partition pruning also eliminates index partitions even when the
partitions of the underlying table cannot be eliminated.
On composite partitioned objects, Oracle can prune at both the range partition level
and at the hash or list subpartition level using the relevant predicates. Refer to the
table sales_range_hash earlier, partitioned by range on the column s_
salesdate and subpartitioned by hash on column s_productid, and consider
the following example:
SELECT * FROM sales_range_hash
WHERE s_saledate BETWEEN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')) AND
(TO_DATE('01-OCT-1999', 'DD-MON-YYYY')) AND s_productid = 1200;
Oracle uses the predicate on the partitioning columns to perform partition pruning
as follows:
■ When using range partitioning, Oracle accesses only partitions sal99q2 and
sal99q3.
■ When using hash subpartitioning, Oracle accesses only the one subpartition in
each partition that stores the rows with s_productid=1200. The mapping
between the subpartition and the predicate is calculated based on Oracle’s
internal hash distribution function.
Although this uses the DD-MON-RR format, which is not the same as the base
partition, the optimizer can still prune properly.
If you execute an EXPLAIN PLAN statement on the query, the PARTITION_START
and PARTITION_STOP columns of the output table do not specify which partitions
Oracle is accessing. Instead, you see the keyword KEY for both columns. The
keyword KEY for both columns means that partition pruning occurs at run-time. It
can also affect the execution plan because the information about the pruned
partitions is missing compared to the same statement using the same TO_DATE
function than the partition table definition.
Partition-Wise Joins
Partition-wise joins reduce query response time by minimizing the amount of data
exchanged among parallel execution servers when joins execute in parallel. This
significantly reduces response time and improves the use of both CPU and memory
resources. In Oracle Real Application Clusters environments, partition-wise joins
also avoid or at least limit the data traffic over the interconnect, which is the key to
achieving good scalability for massive join operations.
Partition-wise joins can be full or partial. Oracle decides which type of join to use.
This large join is typical in data warehousing environments. The entire customer
table is joined with one quarter of the sales data. In large data warehouse
applications, this might mean joining millions of rows. The join method to use in
that case is obviously a hash join. You can reduce the processing time for this hash
join even more if both tables are equipartitioned on the customerid column. This
enables a full partition-wise join.
When you execute a full partition-wise join in parallel, the granule of parallelism, as
described under "Granules of Parallelism" on page 5-3, is a partition. As a result, the
degree of parallelism is limited to the number of partitions. For example, you
require at least 16 partitions to set the degree of parallelism of the query to 16.
You can use various partitioning methods to equipartition both tables on the
column customerid with 16 partitions. These methods are described in these
subsections.
Hash-Hash This is the simplest method: the customers and sales tables are both
partitioned by hash into 16 partitions, on the s_customerid and c_customerid
columns. This partitioning method enables full partition-wise join when the tables
are joined on s_customerid and c_customerid, both representing the same
customer identification number. Because you are using the same hash function to
distribute the same information (customer ID) into the same number of hash
partitions, you can join the equivalent partitions. They are storing the same values.
In serial, this join is performed between pairs of matching hash partitions, one at a
time. When one partition pair has been joined, the join of another partition pair
begins. The join completes when the 16 partition pairs have been processed.
sales P1 P2 P3 P16
...
customers P1 P2 P3 P16
Parallel
Execution Server Server Server Server
Servers
In Figure 5–1, assume that the degree of parallelism and the number of partitions
are the same, in other words, 16 for both. Defining more partitions than the degree
of parallelism may improve load balancing and limit possible skew in the
execution. If you have more partitions than query servers, when one query server
completes the join of one pair of partitions, it requests that the query coordinator
give it another pair to join. This process repeats until all pairs have been processed.
This method enables the load to be balanced dynamically when the number of
partition pairs is greater than the degree of parallelism, for example, 64 partitions
with a degree of parallelism of 16.
customerid
1999 - Q1
1999 - Q2
1999 - Q3
salesdate
1999 - Q4
2000 - Q1
2000 - Q2
2000 - Q3
2000 - Q4
Hash partition #9
■ The rules for data placement on MPP systems apply here. The only difference is
that a hash partition is now a collection of subpartitions. You must ensure that
all these subpartitions are placed on the same node as the matching hash
partition from the other table. For example, in Figure 5–2, store hash partition 9
of the sales table shown by the eight circled subpartitions, on the same node
as hash partition 9 of the customers table.
Range-Range and List-List You can also join range partitioned tables with range
partitioned tables and list partitioned tables with list partitioned tables in a
partition-wise manner, but this is relatively uncommon. This is more complex to
implement because you must know the distribution of the data before performing
the join. Furthermore, if you do not correctly identify the partition bounds so that
you have partitions of equal size, data skew during the execution may result.
The basic principle for using range-range and list-list is the same as for using
hash-hash: you must equipartition both tables. This means that the number of
partitions must be the same and the partition bounds must be identical. For
example, assume that you know in advance that you have 10 million customers,
and that the values for customerid vary from 1 to 10,000,000. In other words, you
have 10 million possible different values. To create 16 partitions, you can range
partition both tables, sales on c_customerid and customers on s_
customerid. You should define partition bounds for both tables in order to
generate partitions of the same size. In this example, partition bounds should be
defined as 625001, 1250001, 1875001, ... 10000001, so that each partition contains
625000 rows.
example, all rows in customers that could have matching rows in partition P1 of
sales are sent to query server 1 in the second set. Rows received by the second set
of query servers are joined with the rows from the corresponding partitions in
sales. Query server number 1 in the second set joins all customers rows that it
receives with partition P1 of sales.
...
Parallel
execution JOIN
server
set 2
Parallel
execution
server
Server Server ... Server re-distribution
hash(c_customerid)
set 1
customers SELECT
Considerations for full partition-wise joins also apply to partial partition-wise joins:
■ The degree of parallelism does not need to equal the number of partitions. In
Figure 5–3, the query executes with two sets of 16 query servers. In this case,
Oracle assigns 1 partition to each query server of the second set. Again, the
number of partitions should always be a multiple of the degree of parallelism.
Composite As with full partition-wise joins, the prime partitioning method for the
sales table is to use the range method on column s_salesdate. This is because
sales is a typical example of a table that stores historical data. To enable a partial
partition-wise join while preserving this range partitioning, subpartition sales by
hash on column s_customerid using 16 subpartitions for each partition. Pruning
and partial partition-wise joins can be used together if a query joins customers
and sales and if the query has a selection predicate on s_salesdate.
When sales is composite, the granule of parallelism for a partial partition-wise
join is a hash partition and not a subpartition. Refer to Figure 5–2 for an illustration
of a hash partition in a composite table. Again, the number of hash partitions
should be a multiple of the degree of parallelism. Also, on an MPP system, ensure
that each hash partition has affinity to a single node. In the previous example, the
eight subpartitions composing a hash partition should have affinity to the same
node.
Range Finally, you can use range partitioning on s_customerid to enable a partial
partition-wise join. This works similarly to the hash method, but a side effect of
range partitioning is that the resulting data distribution could be skewed if the size
of the partitions differs. Moreover, this method is more complex to implement
because it requires prior knowledge of the values of the partitioning column that is
also a join key.
Reduction of Memory Requirements Partition-wise joins require less memory than the
equivalent join operation of the complete data set of the tables being joined.
In the case of serial joins, the join is performed at the same time on a pair of
matching partitions. If data is evenly distributed across partitions, the memory
requirement is divided by the number of partitions. There is no skew.
In the parallel case, memory requirements depend on the number of partition pairs
that are joined in parallel. For example, if the degree of parallelism is 20 and the
number of partitions is 100, 5 times less memory is required because only 20 joins of
two partitions are performed at the same time. The fact that partition-wise joins
require less memory has a direct effect on performance. For example, the join
probably does not need to write blocks to disk during the build phase of a hash join.
Adding Partitions
Different types of partitions require slightly different syntax when being added.
Basic topics are:
■ Adding a Partition to a Range-Partitioned Table
■ Adding a Partition to a Hash-Partitioned Table
■ Adding a Partition to a List-Partitioned Table
Any value in the set of literal values that describe the partition being added must
not exist in any of the other partitions of the table.
You cannot add a partition to a list-partitioned table that has a default partition, but
you can split the default partition. By doing so, you effectively create a new
partition defined by the values that you specify, and a second partition that remains
the default partition.
Local and global indexes associated with the list-partitioned table remain usable.
Dropping Partitions
You can drop partitions from range, composite, list, or composite range-list
partitioned tables. For hash-partitioned tables, or hash subpartitions of range-hash
partitioned tables, you must perform a coalesce operation instead.
In this example, you disable the integrity constraints, issue the ALTER TABLE ...
DROP PARTITION statement, then enable the integrity constraints. This method is
most appropriate for large tables where the partition being dropped contains a
significant percentage of the total data in the table.
Exchanging Partitions
You can convert a partition (or subpartition) into a nonpartitioned table, and a
nonpartitioned table into a partition (or subpartition) of a partitioned table by
exchanging their data segments. You can also convert a hash-partitioned table into a
partition of a range-hash partitioned table, or convert the partition of the
range-hash partitioned table into a hash-partitioned table. Similarly, you can
convert a list-partitioned table into a partition of a range-list partitioned table, or
convert the partition of the range-list partitioned table into a list-partitioned table
A typical example of exchanging into a nonpartitioned table follows. In this
example, table stocks can be range, hash, or list partitioned.
ALTER TABLE stocks
EXCHANGE PARTITION p3 WITH stock_table_3;
Moving Partitions
Use the MOVE PARTITION clause to move a partition. For example, to move the
most active partition to a tablespace that resides on its own disk (in order to balance
I/O) and to not log the action, issue the following statement:
ALTER TABLE parts MOVE PARTITION depot2
TABLESPACE ts094 NOLOGGING;
This statement always drops the partition’s old segment and creates a new segment,
even if you do not specify a new tablespace.
Use the ALTER TABLE ... MERGE PARTITIONS statement to merge the contents
of two partitions into one partition. The two original partitions are dropped, as are
any corresponding local indexes.
You cannot use this statement for a hash-partitioned table or for hash subpartitions
of a range-hash partitioned table.
The following statement merges two subpartitions of a table partitioned using
range-list method into a new subpartition located in tablespace tbs_west:
ALTER TABLE quarterly_regional_sales
MERGE SUBPARTITIONS q1_1999_northwest, q1_1999_southwest
INTO SUBPARTITION q1_1999_west
TABLESPACE tbs_west;
Truncating Partitions
Use the ALTER TABLE ... TRUNCATE PARTITION statement to remove all rows
from a table partition. Truncating a partition is similar to dropping a partition,
except that the partition is emptied of its data, but not physically dropped.
You cannot truncate an index partition. However, if there are local indexes defined
for the table, the ALTER TABLE TRUNCATE PARTITION statement truncates the
matching partition in each local index.
The following example illustrates a partition that contains data and has referential
integrity constraints:
ALTER TABLE sales
DISABLE CONSTRAINT dname_sales1;
ALTER TABLE sales TRUNCATE PARTITTION dec94;
ALTER TABLE sales
ENABLE CONSTRAINT dname_sales1;
In this example, you disable the integrity constraints, issue the ALTER TABLE ...
TRUNCATE PARTITION statement, then re-enable the integrity constraints.
This method is most appropriate for large tables where the partition being
truncated contains a significant percentage of the total data in the table.
Coalescing Partitions
Coalescing partitions is a way of reducing the number of partitions in a
hash-partitioned table, or the number of subpartitions in a range-hash partitioned
table. When a hash partition is coalesced, its contents are redistributed into one or
more remaining partitions determined by the hash function. The specific partition
that is coalesced is selected by Oracle, and is dropped after its contents have been
redistributed.
The following statement illustrates a typical case of reducing by one the number of
partitions in a table:
ALTER TABLE ouu1
COALESCE PARTITION;
This chapter describes how to use indexes in a data warehousing environment and
discusses the following types of index:
■ Bitmap Indexes
■ B-tree Indexes
■ Local Indexes Versus Global Indexes
Indexes 6-1
Bitmap Indexes
Bitmap Indexes
Bitmap indexes are widely used in data warehousing environments. The
environments typically have large amounts of data and ad hoc queries, but a low
level of concurrent DML transactions. For such applications, bitmap indexing
provides:
■ Reduced response time for large classes of ad hoc queries
■ Reduced storage requirements compared to other indexing techniques
■ Dramatic performance gains even on hardware with a relatively small number
of CPUs or a small amount of memory
■ Efficient maintenance during parallel DML and loads
Fully indexing a large table with a traditional B-tree index can be prohibitively
expensive in terms of space because the indexes can be several times larger than the
data in the table. Bitmap indexes are typically only a fraction of the size of the
indexed data in the table.
Note: Bitmap indexes are available only if you have purchased the
Oracle9i Enterprise Edition. See Oracle9i Database New Features for
more information about the features available in Oracle9i and the
Oracle9i Enterprise Edition.
An index provides pointers to the rows in a table that contain a given key value. A
regular index stores a list of rowids for each key corresponding to the rows with
that key value. In a bitmap index, a bitmap for each key value replaces a list of
rowids.
Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means
that the row with the corresponding rowid contains the key value. A mapping
function converts the bit position to an actual rowid, so that the bitmap index
provides the same functionality as a regular index. If the number of different key
values is small, bitmap indexes save space.
Bitmap indexes are most effective for queries that contain multiple conditions in the
WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before
the table itself is accessed. This improves response time, often dramatically.
Cardinality
The advantages of using bitmap indexes are greatest for columns in which the ratio
of the number of distinct values to the number of rows in the table is under 1%. We
refer to this ratio as the degree of cardinality. A gender column, which has only
two distinct values (male and female), is ideal for a bitmap index. However, data
warehouse administrators also build bitmap indexes on columns with higher
cardinalities.
For example, on a table with one million rows, a column with 10,000 distinct values
is a candidate for a bitmap index. A bitmap index on this column can outperform a
B-tree index, particularly when this column is often queried in conjunction with
other indexed columns. In fact, in a typical data warehouse environments, a bitmap
index can be considered for any non-unique column.
B-tree indexes are most effective for high-cardinality data: that is, for data with
many possible values, such as customer_name or phone_number. In a data
warehouse, B-tree indexes should be used only for unique columns or other
columns with very high cardinalities (that is, columns that are almost unique). The
majority of indexes in a data warehouse should be bitmap indexes.
In ad hoc queries and similar situations, bitmap indexes can dramatically improve
query performance. AND and OR conditions in the WHERE clause of a query can be
resolved quickly by performing the corresponding Boolean operations directly on
the bitmaps before converting the resulting bitmap to rowids. If the resulting
number of rows is small, the query can be answered quickly without resorting to a
full table scan.
Indexes 6-3
Bitmap Indexes
Each entry (or bit) in the bitmap corresponds to a single row of the customers
table. The value of each bit depends upon the values of the corresponding row in
the table. For instance, the bitmap cust_gender='F' contains a one as its first bit
because the region is east in the first row of the customers table. The bitmap
cust_gender='F' has a zero for its third bit because the gender of the third row
is not F.
An analyst investigating demographic trends of the company's customers might
ask, "How many of our married customers have an income level of G or H?" This
corresponds to the following SQL query:
SELECT COUNT(*) FROM customers
WHERE cust_marital_status = 'married'
AND cust_income_level IN ('H: 150,000 - 169,999', 'G: 130,000 - 149,999');
Bitmap indexes can efficiently process this query by merely counting the number of
ones in the bitmap illustrated in Figure 6–1. The result set will be found by using
bitmap or merge operations without the necessity of a conversion to rowids. To
identify additional specific customer attributes that satisfy the criteria, use the
resulting bitmap to access the table after a bitmap to rowid conversion.
0 0 0 0 0 0
1 1 0 1 1 1
1 0 1 1 1 1
AND OR = AND =
0 0 1 0 1 0
0 1 0 0 1 0
1 1 0 1 1 1
Indexes 6-5
Bitmap Indexes
This query uses a bitmap index on cust_marital_status. Note that this query
would not be able to use a B-tree index.
SELECT COUNT(*) FROM employees;
Any bitmap index can be used for this query because all table rows are indexed,
including those that have NULL data. If nulls were not indexed, the optimizer would
be able to use indexes only on columns with NOT NULL constraints.
The following query shows how to use this bitmap join index and illustrates its
bitmap pattern:
SELECT sales.time_id, customers.cust_gender, sales.amount
FROM sales, customers
WHERE sales.cust_id = customers.cust_id;
TIME_ID C AMOUNT
--------- - ----------
01-JAN-98 M 2291
01-JAN-98 F 114
01-JAN-98 M 553
01-JAN-98 M 0
01-JAN-98 M 195
01-JAN-98 M 280
01-JAN-98 M 32
...
Indexes 6-7
Bitmap Indexes
You can create other bitmap join indexes using more than one column or more than
one table, as shown in these examples.
Indexes 6-9
B-tree Indexes
B-tree Indexes
A B-tree index is organized like an upside-down tree. The bottom level of the index
holds the actual data values and pointers to the corresponding rows, much as the
index in a book has a page number associated with each index entry.
In general, use B-tree indexes when you know that your typical query refers to the
indexed column and retrieves a few rows. In these queries, it is faster to find the
rows by looking at the index. However, using the book index analogy, if you plan to
look at every single topic in a book, you might not want to look in the index for the
topic and then look up the page. It might be faster to read through every chapter in
the book. Similarly, if you are retrieving most of the rows in a table, it might not
make sense to look up the index to find the table rows. Instead, you might want to
read or scan the table.
B-tree indexes are most commonly used in a data warehouse to index unique or
near-unique keys. In many cases, it may not be necessary to index these columns in
a data warehouse, because unique constraints can be maintained without an index,
and because typical data warehouse queries may not work better with such indexes.
Bitmap indexes should be more common than B-tree indexes in most data
warehouse environments.
Many significant constraint features have been introduced for data warehousing.
Readers familiar with Oracle's constraint functionality in Oracle7 and Oracle8
should take special note of the functionality described in this chapter. In fact, many
Oracle7-based and Oracle8-based data warehouses lacked constraints because of
concerns about constraint performance. Newer constraint functionality addresses
these concerns.
By default, this constraint is both enabled and validated. Oracle implicitly creates a
unique index on sales_id to support this constraint. However, this index can be
problematic in a data warehouse for three reasons:
■ The unique index can be very large, because the sales table can easily have
millions or even billions of rows.
■ The unique index is rarely used for query execution. Most data warehousing
queries do not have predicates on unique keys, so creating this index will
probably not improve performance.
■ If sales is partitioned along a column other than sales_id, the unique index
must be global. This can detrimentally affect all maintenance operations on the
sales table.
A unique index is required for unique constraints to ensure that each individual
row modified in the sales table satisfies the UNIQUE constraint.
For data warehousing tables, an alternative mechanism for unique constraints is
illustrated in the following statement:
ALTER TABLE sales ADD CONSTRAINT sales_unique
UNIQUE (sales_id) DISABLE VALIDATE;
This statement creates a unique constraint, but, because the constraint is disabled, a
unique index is not required. This approach can be advantageous for many data
warehousing environments because the constraint now ensures uniqueness without
the cost of a unique index.
However, there are trade-offs for the data warehouse administrator to consider with
DISABLE VALIDATE constraints. Because this constraint is disabled, no DML
statements that modify the unique column are permitted against the sales table.
You can use one of two strategies for modifying this table in the presence of a
constraint:
■ Use DDL to add data to this table (such as exchanging partitions). See the
example in Chapter 14, "Maintaining the Data Warehouse".
■ Before modifying this table, drop the constraint. Then, make all necessary data
modifications. Finally, re-create the disabled constraint. Re-creating the
constraint is more efficient than re-creating an enabled constraint. However, this
approach does not guarantee that data added to the sales table while the
constraint has been dropped is unique.
However, in some situations, you may choose to use a different state for the
FOREIGN KEY constraints, in particular, the ENABLE NOVALIDATE state. A data
warehouse administrator might use an ENABLE NOVALIDATE constraint when
either:
■ The tables contain data that currently disobeys the constraint, but the data
warehouse administrator wishes to create a constraint for future enforcement.
■ An enforced constraint is required immediately.
Suppose that the data warehouse loaded new data into the fact tables every day, but
refreshed the dimension tables only on the weekend. During the week, the
dimension tables and fact tables may in fact disobey the FOREIGN KEY constraints.
Nevertheless, the data warehouse administrator might wish to maintain the
enforcement of this constraint to prevent any changes that might affect the
FOREIGN KEY constraint outside of the ETL process. Thus, you can create the
FOREIGN KEY constraints every night, after performing the ETL process, as shown
here:
ALTER TABLE sales ADD CONSTRAINT sales_time_fk
FOREIGN KEY (sales_time_id) REFERENCES time (time_id)
ENABLE NOVALIDATE;
ENABLE NOVALIDATE can quickly create an enforced constraint, even when the
constraint is believed to be true. Suppose that the ETL process verifies that a
FOREIGN KEY constraint is true. Rather than have the database re-verify this
FOREIGN KEY constraint, which would require time and database resources, the
data warehouse administrator could instead create a FOREIGN KEY constraint using
ENABLE NOVALIDATE.
RELY Constraints
The ETL process commonly verifies that certain constraints are true. For example, it
can validate all of the foreign keys in the data coming into the fact table. This means
that you can trust it to provide clean data, instead of implementing constraints in
the data warehouse. You create a RELY constraint as follows:
ALTER TABLE sales ADD CONSTRAINT sales_time_fk
FOREIGN KEY (sales_time_id) REFERENCES time (time_id)
RELY DISABLE NOVALIDATE;
RELY constraints, even though they are not used for data validation, can:
■ Enable more sophisticated query rewrites for materialized views. See
Chapter 22, "Query Rewrite" for further details.
■ Enable other data warehousing tools to retrieve information regarding
constraints directly from the Oracle data dictionary.
Creating a RELY constraint is inexpensive and does not impose any overhead
during DML or load. Because the constraint is not being validated, no data
processing is necessary to create it.
View Constraints
You can create constraints on views. The only type of constraint supported on a
view is a RELY constraint.
This type of constraint is useful when queries typically access views instead of base
tables, and the DBA thus needs to define the data relationships between views
rather than tables. View constraints are particularly useful in OLAP environments,
where they may enable more sophisticated rewrites for materialized views.
This chapter introduces you to the use of materialized views and discusses:
■ Overview of Data Warehousing with Materialized Views
■ Types of Materialized Views
■ Creating Materialized Views
■ Registering Existing Materialized Views
■ Partitioning and Materialized Views
■ Materialized Views in OLAP Environments
■ Choosing Indexes for Materialized Views
■ Invalidating Materialized Views
■ Security Issues with Materialized Views
■ Altering Materialized Views
■ Dropping Materialized Views
■ Analyzing Materialized View Capabilities
are often referred to as summaries, because they store summarized data. They can
also be used to precompute joins with or without aggregations. A materialized view
eliminates the overhead associated with expensive joins and aggregations for a
large or important class of queries.
Oracle9i
Query Results
Strategy
Generate Plan
Strategy
When using query rewrite, create materialized views that satisfy the largest number
of queries. For example, if you identify 20 queries that are commonly applied to the
detail or fact tables, then you might be able to satisfy them with five or six
well-written materialized views. A materialized view definition can include any
number of aggregations (SUM, COUNT(x), COUNT(*), COUNT(DISTINCT x), AVG,
VARIANCE, STDDEV, MIN, and MAX). It can also include any number of joins. If you
are unsure of which materialized views to create, Oracle provides a set of advisory
procedures in the DBMS_OLAP package to help in designing and evaluating
materialized views for query rewrite. These functions are also known as the
Summary Advisor or the Advisor. Note that the OLAP Summary Advisor is
different. See Oracle9i OLAP User’s Guide for further details regarding the OLAP
Summary Advisor.
Many large decision support system (DSS) databases have schemas that do not
closely resemble a conventional data warehouse schema, but that still require joins
and aggregates. The use of summary management features imposes no schema
restrictions, and can enable some existing DSS database applications to improve
performance without the need to redesign the database or the application.
Figure 8–2 illustrates the use of summary management in the warehousing cycle.
After the data has been transformed, staged, and loaded into the detail data in the
warehouse, you can invoke the summary management process. First, use the
Advisor to plan how you will use summaries. Then, create summaries and design
how queries will be rewritten.
Operational
Databases Staging
file
Extraction of Data
Incremental Transformations
Detail Data
Summary
Management
Data Warehouse
Query
Rewrite MDDB
Data Mart
Detail
Incremental Extract
Load and Refresh Program
Summary
Workload
Statistics
Understanding the summary management process during the earliest stages of data
warehouse design can yield large dividends later in the form of higher
performance, lower summary administration costs, and reduced storage
requirements.
■ Fact tables describe the business transactions of an enterprise. Fact tables are
sometimes called detail tables.
The vast majority of data in a data warehouse is stored in a few very large fact
tables that are updated periodically with data from one or more operational
OLTP databases.
Fact tables include facts (also called measures) such as sales, units, and
inventory.
– A simple measure is a numeric or character column of one table such as
fact.sales.
– A computed measure is an expression involving measures of one table, for
example, fact.revenues - fact.expenses.
– A multitable measure is a computed measure defined on multiple tables,
for example, fact_a.revenues - fact_b.expenses.
Fact tables also contain one or more foreign keys that organize the business
transactions by the relevant business entities such as time, product, and market.
In most cases, these foreign keys are non-null, form a unique compound key of
the fact table, and each foreign key joins with exactly one row of a dimension
table.
Guideline 6 After each load and before refreshing your materialized view,
Dimensions use the VALIDATE_DIMENSION procedure of the DBMS_MVIEW
package to incrementally verify dimensional integrity.
Guideline 7 If a time dimension appears in the materialized view as a time
Time Dimensions column, partition and index the materialized view in the same
manner as you have the fact tables.
If you are concerned with the time required to enable constraints and whether any
constraints might be violated, use the ENABLE NOVALIDATE with the RELY clause
to turn on constraint checking without validating any of the existing constraints.
The risk with this approach is that incorrect query results could occur if any
constraints are broken. Therefore, as the designer, you must determine how clean
the data is and whether the risk of wrong results is too great.
Loading Data
A popular and efficient way to load data into a warehouse or data mart is to use
SQL*Loader with the DIRECT or PARALLEL option or to use another loader tool
that uses the Oracle direct-path API.
load time are minimized. The DML that may be required after one-phase loading
causes multitable aggregate materialized views to become unusable in the safest
rewrite integrity level.
In a two-phase loading process:
■ Data is first loaded into a temporary table in the warehouse.
■ Quality assurance procedures are applied to the data.
■ Referential integrity constraints on the target table are disabled, and the local
index in the target partition is marked unusable.
■ The data is copied from the temporary area into the appropriate partition of the
target table using INSERT AS SELECT with the PARALLEL or APPEND hint.
■ The temporary table is dropped.
■ The constraints are enabled, usually with the NOVALIDATE option.
Immediately after loading the detail data and updating the indexes on the detail
data, the database can be opened for operation, if desired. You can disable query
rewrite at the system level by issuing an ALTER SYSTEM SET QUERY_REWRITE_
ENABLED = false statement until all the materialized views are refreshed.
If QUERY_REWRITE_INTEGRITY is set to stale_tolerated, access to the
materialized view can be allowed at the session level to any users who do not
require the materialized views to reflect the data from the latest load by issuing an
ALTER SESSION SET QUERY_REWRITE_INTEGRITY=true statement. This
scenario does not apply when QUERY_REWRITE_INTEGRITY is either enforced
or trusted because the system ensures in these modes that only materialized
views with updated data participate in a query rewrite.
statement), subqueries, and materialized views can all be joined or referenced in the
SELECT clause.
The types of materialized views are:
■ Materialized Views with Aggregates
■ Materialized Views Containing Only Joins
■ Nested Materialized Views
Fast refresh for a materialized view containing joins and aggregates is possible after
any type of DML to the base tables (direct load or conventional INSERT, UPDATE, or
DELETE). It can be defined to be refreshed ON COMMIT or ON DEMAND. A REFRESH
ON COMMIT, materialized view will be refreshed automatically when a transaction
that does DML to one of the materialized view’s detail tables commits. The time
taken to complete the commit may be slightly longer than usual when this method
is chosen. This is because the refresh operation is performed as part of the commit
process. Therefore, this method may not be suitable if many users are concurrently
changing the tables upon which the materialized view is based.
Here are some examples of materialized views with aggregates. Note that
materialized view logs are only created because this materialized view will be fast
refreshed.
Example 8–3 creates a materialized view that contains aggregates on a single table.
Because the materialized view log has been created, the materialized view is fast
refreshable. If DML is applied against the sales table, then the changes will be
reflected in the materialized view when the commit is issued.
Note that COUNT(*) must always be present. Oracle recommends that you include
the optional aggregates in column Z in the materialized view in order to obtain the
most efficient and accurate fast refresh of the aggregates.
■ If there are outer joins, unique constraints must exist on the join columns of the
inner table. For example, if you are joining the fact table and a dimension table
and the join is an outer join with the fact table being the outer table, there must
exist unique constraints on the join columns of the dimension table.
If some of these restrictions are not met, you can create the materialized view as
REFRESH FORCE to take advantage of fast refresh when it is possible. If one of the
tables did not meet all of the criteria, but the other tables did, the materialized view
would still be fast refreshable with respect to the other tables for which all the
criteria are met.
A materialized view log should contain the rowid of the master table. It is not
necessary to add other columns.
To speed up refresh, you should create indexes on the materialized view's columns
that store the rowids of the fact table.
Alternatively, if the previous example did not include the columns times_rid and
customers_id, and if the refresh method was REFRESH FORCE, then this
materialized view would be fast refreshable only if the sales table was updated but
not if the tables times or customers were updated.
CREATE MATERIALIZED VIEW detail_sales_mv
PARALLEL
BUILD IMMEDIATE
REFRESH FORCE
AS
SELECT
s.rowid "sales_rid",
c.cust_id, c.cust_last_name, s.amount_sold,
s.quantity_sold, s.time_id
FROM sales s, times t, customers c
WHERE s.cust_id = c.cust_id(+) AND
s.time_id = t.time_id(+);
All the underlying objects (materialized views or tables) on which the materialized
view is defined must have a materialized view log. All the underlying objects are
treated as if they were tables. All the existing options for materialized views can be
used, with the exception of ON COMMIT REFRESH, which is not supported for a
nested materialized views that contains joins and aggregates.
Using the tables and their columns from the sh sample schema, the following
materialized views illustrate how nested materialized views can be created.
/* create the materialized view logs */
CREATE MATERIALIZED VIEW LOG ON sales
WITH ROWID;
CREATE MATERIALIZED VIEW LOG ON customers
WITH ROWID;
CREATE MATERIALIZED VIEW LOG ON times
WITH ROWID;
FROM join_sales_cust_time
GROUP BY cust_last_name, day_number_in_week;
sum_sales_cust_time join_sales_cust_time_prod
join_sales_cust_time
If you already have a naming convention for tables and indexes, you might consider
extending this naming scheme to the materialized views so that they are easily
identifiable. For example, instead of naming the materialized view sum_of_sales,
it could be called sum_of_sales_mv to denote that this is a materialized view and
not a table or view.
Build Methods
Two build methods are available for creating the materialized view, as shown in
Table 8–2. If you select BUILD IMMEDIATE, the materialized view definition is
added to the schema objects in the data dictionary, and then the fact or detail tables
are scanned according to the SELECT expression and the results are stored in the
materialized view. Depending on the size of the tables to be scanned, this build
process can take a considerable amount of time.
An alternative approach is to use the BUILD DEFERRED clause, which creates the
materialized view without data, thereby enabling it to be populated at a later date
■ The query cannot contain any references to RAW or LONG RAW datatypes or
object REFs.
■ If the defining query of the materialized view contains set operators (UNION,
MINUS, and so on), rewrite will use them for full text match rewrite only.
■ If the materialized view was registered as PREBUILT, the precision of the
columns must agree with the precision of the corresponding SELECT
expressions unless overridden by the WITH REDUCED PRECISION clause.
■ If the materialized view contains the same table more than once, it is possible to
do a general rewrite, provided the query has the same aliases for the duplicate
tables as the materialized view.
Refresh Options
When you define a materialized view, you can specify two refresh options: how to
refresh and what type of refresh. If unspecified, the defaults are assumed as ON
DEMAND and FORCE.
The two refresh execution modes are: ON COMMIT and ON DEMAND. Depending on
the materialized view you create, some of the options may not be available.
Table 8–3 describes the refersh modes.
When a materialized view is maintained using the ON COMMIT method, the time
required to complete the commit may be slightly longer than usual. This is because
the refresh operation is performed as part of the commit process. Therefore this
method may not be suitable if many users are concurrently changing the tables
upon which the materialized view is based.
If you anticipate performing insert, update or delete operations on tables referenced
by a materialized view concurrently with the refresh of that materialized view, and
that materialized view includes joins and aggregation, Oracle recommends you use
ON COMMIT fast refresh rather than ON DEMAND fast refresh.
If you think the materialized view did not refresh, check the alert log or trace file.
If a materialized view fails during refresh at COMMIT time, you must explicitly
invoke the refresh procedure using the DBMS_MVIEW package after addressing the
errors specified in the trace files. Until this is done, the materialized view will no
longer be refreshed automatically at commit time.
You can specify how you want your materialized views to be refreshed from the
detail tables by selecting one of four options: COMPLETE, FAST, FORCE, and NEVER.
Table 8–4 describes the refresh options.
Whether the fast refresh option is available depends upon the type of materialized
view. You can call the procedure DBMS_MVIEW.EXPLAIN_MVIEW to determine
whether fast refresh is possible.
■ Materialized views with named views or subqueries in the FROM clause can be
fast refreshed provided the views can be completely merged. For information
on which views will merge, refer to the Oracle9i Database Performance Tuning
Guide and Reference.
■ If there are no outer joins, you may have arbitrary selections and joins in the
WHERE clause.
■ Materialized aggregate views with outer joins are fast refreshable after
conventional DML and direct loads, provided only the outer table has been
modified. Also, unique constraints must exist on the join columns of the inner
join table. If there are outer joins, all the joins must be connected by ANDs and
must use the equality (=) operator.
■ For materialized views with CUBE, ROLLUP, Grouping Sets, or concatenation of
them, the following restrictions apply:
■ The SELECT list should contain grouping distinguisher that can either be a
GROUPING_ID function on all GROUP BY expressions or GROUPING
functions one for each GROUP BY expression. For example, if the GROUP BY
clause of the materialized view is "GROUP BY CUBE(a, b)", then the
SELECT list should contain either "GROUPING_ID(a, b)" or
"GROUPING(a) AND GROUPING(b)" for the materialized view to be fast
refreshable.
■ GROUP BY should not result in any duplicate groupings. For example,
"GROUP BY a, ROLLUP(a, b)" is not fast refreshable because it results
in duplicate groupings "(a), (a, b), AND (a)".
ORDER BY Clause
An ORDER BY clause is allowed in the CREATE MATERIALIZED VIEW statement. It
is used only during the initial creation of the materialized view. It is not used
during a full refresh or a fast refresh.
To improve the performance of queries against large materialized views, store the
rows in the materialized view in the order specified in the ORDER BY clause. This
initial ordering provides physical clustering of the data. If indexes are built on the
columns by which the materialized view is ordered, accessing the rows of the
materialized view using the index often reduces the time for disk I/O due to the
physical clustering.
The ORDER BY clause is not considered part of the materialized view definition. As a
result, there is no difference in the manner in which Oracle detects the various types
of materialized views (for example, materialized join views with no aggregates). For
the same reason, query rewrite is not affected by the ORDER BY clause. This feature
is similar to the CREATE TABLE ... ORDER BY capability that exists in Oracle.
■ For ON COMMIT, the mixed DML statements occur within the same transaction
because the refresh of the materialized view will occur upon commit of this
transaction.
■ For ON DEMAND, the mixed DML statements occur between refreshes. The
following example of a materialized view log illustrates where one is created on
the table sales that includes the SEQUENCE keyword:
CREATE MATERIALIZED VIEW LOG ON sales
WITH SEQUENCE, ROWID
(prod_id, cust_id, time_id, channel_id, promo_id,
quantity_sold, amount_sold)
INCLUDING NEW VALUES;
See Also: Chapter 22, "Query Rewrite" for details about integrity
levels
When you drop a materialized view that was created on a prebuilt table, the table
still exists—only the materialized view is dropped.
When a prebuilt table is registered as a materialized view and query rewrite is
desired, the parameter QUERY_REWRITE_INTEGRITY must be set to at least
You could have compressed this table to save space. See "Storage And Data
Segment Compression" on page 8-23 for details regarding data segment
compression.
In some cases, user-defined materialized views are refreshed on a schedule that is
longer than the update cycle. For example, a monthly materialized view might be
updated only at the end of each month, and the materialized view values always
refer to complete time periods. Reports written directly against these materialized
views implicitly select only data that is not in the current (incomplete) time period.
If a user-defined materialized view already contains a time dimension:
■ It should be registered and then fast refreshed each update cycle.
■ You can create a view that selects the complete time period of interest.
■ The reports should be modified to refer to the view instead of referring directly
to the user-defined materialized view.
If the user-defined materialized view does not contain a time dimension, then:
■ Create a new materialized view that does include the time dimension (if
possible).
■ The view should aggregate over the time column in the new materialized view.
Partition Marker
In many cases, the advantages of PCT will be offset by this restriction for highly
aggregated materialized views. The DBMS_MVIEW.PMARKER function is designed to
significantly reduce the cardinality of the materialized view (see Example 8–7 on
page 8-37 for an example). The function returns a partition identifier that uniquely
identifies the partition for a specified row within a specified partition table. The
DBMS_MVIEW.PMARKER function is used instead of the partition key column in the
SELECT and GROUP BY clauses.
Unlike the general case of a PL/SQL function in a materialized view, use of the
DBMS_MVIEW.PMARKER does not prevent rewrite with that materialized view even
when the rewrite mode is QUERY_REWRITE_INTEGRITY=enforced.
cust_mth_sales_mv includes the partition key column from table sales (time_
id) in both its SELECT and GROUP BY lists. This enables PCT on table sales for
materialized view cust_mth_sales_mv. However, the GROUP BY and SELECT
lists include PRODUCTS.PROD_ID rather than the partition key column (PROD_
CATEGORY) of the products table. Therefore, PCT is not enabled on table
products for this materialized view. In other words, any partition maintenance
operation to the sales table will allow a PCT fast refresh of cust_mth_sales_mv.
However, PCT fast refresh is not possible after any kind of modification to the
products table. To correct this, the GROUP BY and SELECT lists must include
column PRODUCTS.PROD_CATEGORY. Following a partition maintenance
operation, such as a drop partition, a PCT fast refresh should be performed on any
materialized view that is referencing the table upon which the partition operations
are undertaken.
This would generally be significantly less than the cardinality impact of including
the respective partition key columns.
A subsequent INSERT statement adds a new row to the sales_part3 partition of
table sales. At this point, because cust_mth_sales_mv and prod_yr_sales_
mv have partition change tracking available on table sales, Oracle can determine
that those rows in these materialized views corresponding to sales_part3 are
stale, while all other rows in these materialized views are unchanged in their
freshness state. An INSERT INTO products statement is not tracked for
materialized view cust_mth_sales_mv. Therefore, cust_mth_sales_mv
becomes completely stale when the products table is modified in this way.
In this example, the table part_sales_tab has been partitioned over three
months and then the materialized view was registered to use the prebuilt table. This
materialized view is eligible for query rewrite because the ENABLE QUERY
REWRITE clause has been included.
OLAP Cubes
While data warehouse environments typically view data in the form of a star
schema, OLAP environments view data in the form of a hierarchical cube. A
hierarchical cube includes both detail data and aggregated data: it is a data set
where the data is aggregated along the rollup hierarchy of each of its dimensions
and these aggregations are combined across dimensions. It includes the typical set
of aggregations needed for business intelligence queries.
Note that as you increase the number of dimensions and levels, the number of
groups to calculate increases dramatically. This example involves 16 groups, but if
you were to add just two more dimensions with the same number of levels, you
would have 4 x 4 x 4 x 4 = 256 different groups. Also, consider that a similar
increase in groups occurs if you have multiple hierarchies in your dimensions. For
example, the time dimension might have an additional hierarchy of fiscal month
rolling up to fiscal quarter and then fiscal year. Handling the explosion of groups
has historically been the major challenge in data storage for OLAP systems.
Typical OLAP queries slice and dice different parts of the cube comparing
aggregations from one level to aggregation from another level. For instance, a query
might find sales of the grocery division for the month of January, 2002 and compare
them with total sales of the grocery division for all of 2001.
You can use one of Oracle’s new extensions to the GROUP BY clause, concatenated
grouping sets, to generate the aggregates needed for a hierarchical cube of data. By
using concatenated rollup (rolling up along the hierarchy of each dimension and
then concatenate them across multiple dimensions), you can generate all the
aggregations needed by a hierarchical cube. These extensions are discussed in detail
in Chapter 18, "SQL for Aggregation in Data Warehouses".
This concatenated rollup takes the ROLLUP aggregations listed in the table of the
prior section and perform a cross-product on them. The cross-product will create
the 16 (4x4) aggregate groups needed for a hierarchical cube of the data.
The inner hierarchical cube specified defines a simple cube, with two dimensions
and four levels in each dimension. It would generate 16 groups (4 Time levels * 4
Product levels). The GROUPING_ID function in the query identifies the specific
group each row belongs to, based on the aggregation level of the grouping-columns
in its argument.
The outer query applies the constraints needed for our specific query, limiting
Division to a value of 25 and Month to a value of 200201 (representing January 2002
in this case). In conceptual terms, it slices a small chunk of data from the cube. The
outer query's constraint on the GID column, indicated in the query by
gid-for-division-month would be the value of a key indicating that the data is
grouped as a combination of division and month. The GID constraint selects only
those rows that are aggregated at the level of a GROUP BY month, division clause.
Oracle removes unneeded aggregation groups from query processing based on the
outer query conditions. The outer conditions of the previous query limit the result
set to a single group aggregating division and month. Any other groups
involving year, month, brand, and item are unnecessary here. The group pruning
optimization recognizes this and transforms the query into:
SELECT month, division, sum_sales
FROM
(SELECT null, null, month, division,
null, null, SUM(sales) sum_sales,
GROUPING_ID(grouping-columns) gid
FROM sales, products, time
WHERE join-condition
GROUP BY
month, division)
WHERE division = 25
AND month = 200201
AND gid = gid-for-Division-Month;
The bold items highlight the changed SQL. The inner query now has a simple
GROUP BY clause of month, division. The columns year, quarter, brand and
item have been converted to null to match the simplified GROUP BY clause. Because
the query now requests just one group, fifteen out of sixteen groups are removed
from the processing, greatly reducing the work. For a cube with more dimensions
and more levels, the savings possible through group pruning can be far greater.
Note that the group pruning transformation works with all the GROUP BY
extensions: ROLLUP, CUBE, and GROUPING SETS.
While the Oracle optimizer has simplified the previous query to a simple GROUP BY,
faster response times can be achieved if the group is precomputed and stored in a
materialized view. Because OLAP queries can ask for any slice of the cube many
groups may need to be precomputed and stored in a materialized view. This is
discussed in the next section.
See Also: Oracle9i SQL Reference for data compression syntax and
restrictions and "Storage And Data Segment Compression" on
page 8-23 for details regarding compression
See Also: Oracle9i SQL Reference for data compression syntax and
restrictions and "Storage And Data Segment Compression" on
page 8-23 for details regarding compression
that has a distinct value in each query block, which, in the following example, is
columns 1 marker and 2 marker.
See "Restrictions on Fast Refresh on Materialized Views With the UNION ALL
Operator" on page 8-29 for detailed restrictions on fast refresh for materialized
views with UNION ALL.
In the case of materialized views containing only joins using fast refresh, Oracle
recommends that indexes be created on the columns that contain the rowids to
improve the performance of the refresh operation.
If a materialized view using aggregates is fast refreshable, then an index is
automatically created unless USING NO INDEX is specified in the CREATE
MATERIALIZED VIEW statement.
The state of a materialized view can be checked by querying the data dictionary
views USER_MVIEWS or ALL_MVIEWS. The column STALENESS will show one of
the values FRESH, STALE, UNUSABLE, UNKNOWN, or UNDEFINED to indicate whether
the materialized view can be used. The state is maintained automatically, but it can
be manually updated by issuing an ALTER MATERIALIZED VIEW name
COMPILE statement.
materialized view needs SELECT privileges to the tables referenced if they are from
another schema.
Moreover, if you enable query rewrite on a materialized view that references tables
outside your schema, you must have the GLOBAL QUERY REWRITE privilege or the
QUERY REWRITE object privilege on each table outside your schema.
If the materialized view is on a prebuilt container, the creator, if different from the
owner, must have SELECT WITH GRANT privilege on the container table.
If you continue to get a privilege error while trying to create a materialized view
and you believe that all the required privileges have been granted, then the problem
is most likely due to a privilege not being granted explicitly and trying to inherit the
privilege from a role instead. The owner of the materialized view must have
explicitly been granted SELECT access to the referenced tables if the tables are in a
different schema.
If the materialized view is being created with ON COMMIT REFRESH specified, then
the owner of the materialized view requires an additional privilege if any of the
tables in the defining query are outside the owner's schema. In that case, the owner
requires the ON COMMIT REFRESH system privilege or the ON COMMIT REFRESH
object privilege on each table outside the owner's schema.
See Also: Oracle9i SQL Reference for further information about the
ALTER MATERIALIZED VIEW statement and "Invalidating
Materialized Views" on page 8-50
See Also: Oracle9i Supplied PL/SQL Packages and Types Reference for
further information about the DBMS_MVIEW package
DBMS_MVIEW.EXPLAIN_MVIEW Declarations
The following PL/SQL declarations that are made for you in the DBMS_MVIEW
package show the order and datatypes of these parameters for explaining an
existing materialized view and a potential materialized view with output to a table
and to a VARRAY.
Explain an existing or potential materialized view with output to MV_
CAPABILITIES_TABLE:
DBMS_MVIEW.EXPLAIN_MVIEW
(mv IN VARCHAR2,
stmt_id IN VARCHAR2:= NULL);
Using MV_CAPABILITIES_TABLE
One of the simplest ways to use DBMS_MVIEW.EXPLAIN_MVIEW is with the MV_
CAPABILITIES_TABLE, which has the following structure:
CREATE TABLE MV_CAPABILITIES_TABLE
(
STMT_ID VARCHAR(30), -- client-supplied unique statement identifier
MV VARCHAR(30), -- NULL for SELECT based EXPLAIN_MVIEW
CAPABILITY_NAME VARCHAR(30), -- A descriptive name of particular
-- capabilities, such as REWRITE.
-- See Table 8–6
POSSIBLE CHARACTER(1), -- Y = capability is possible
-- N = capability is not possible
RELATED_TEXT VARCHAR(2000), -- owner.table.column, and so on related to
-- this message
RELATED_NUM NUMBER, -- When there is a numeric value
-- associated with a row, it goes here.
MSGNO INTEGER, -- When available, message # explaining
-- why disabled or more details when
-- enabled.
MSGTXT VARCHAR(2000), -- Text associated with MSGNO
SEQ NUMBER); -- Useful in ORDER BY clause when
-- selecting from this table.
You can use the utlxmv.sql script found in the admin directory to create MV_
CAPABILITIES_TABLE.
Example of DBMS_MVIEW.EXPLAIN_MVIEW
First, create the materialized view. Alternatively, you can use EXPLAIN_MVIEW on a
potential materialized view using its SELECT statement.
CREATE MATERIALIZED VIEW cal_month_sales_mv
BUILD IMMEDIATE
REFRESH FORCE
ENABLE QUERY REWRITE
AS
SELECT t.calendar_month_desc, SUM(s.amount_sold) AS dollars
FROM sales s, times t
WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Then, you invoke EXPLAIN_MVIEW with the materialized view to explain. You need
to use the SEQ column in an ORDER BY clause so the rows will display in a logical
order. If a capability is not possible, N will appear in the P column and an
explanation in the MSGTXT column. If a capability is not possible for more than one
reason, a row is displayed for each reason.
EXECUTE DBMS_MVIEW.EXPLAIN_MVIEW ('SH.CAL_MONTH_SALES_MV');
MV_CAPABILITIES_TABLE.CAPABILITY_NAME Details
Table 8–6 lists explanations for values in the CAPABILITY_NAME column.
The following sections will help you create and manage a data warehouse:
■ What are Dimensions?
■ Creating Dimensions
■ Viewing Dimensions
■ Using Dimensions with Constraints
■ Validating Dimensions
■ Altering Dimensions
■ Deleting Dimensions
■ Using the Dimension Wizard
Dimensions 9-1
What are Dimensions?
region
subregion
country
state
city
customer
Data analysis typically starts at higher levels in the dimensional hierarchy and
gradually drills down if the situation warrants such analysis.
Dimensions do not have to be defined, but spending time creating them can yield
significant benefits, because they help query rewrite perform more complex types of
rewrite. They are mandatory if you use the Summary Advisor (a GUI tool for
materialized view management) to recommend which materialized views to create,
drop, or retain.
Dimensions 9-3
Creating Dimensions
You must not create dimensions in any schema that does not satisfy these
relationships. Incorrect results can be returned from queries otherwise.
Creating Dimensions
Before you can create a dimension object, the dimension tables must exist in the
database, containing the dimension data. For example, if you create a customer
dimension, one or more tables must exist that contain the city, state, and country
information. In a star schema data warehouse, these dimension tables already exist.
It is therefore a simple task to identify which ones will be used.
Now you can draw the hierarchies of a dimension as shown in Figure 9–1. For
example, city is a child of state (because you can aggregate city-level data up to
state), and country. This hierarchical information will be stored in the database
object dimension.
In the case of normalized or partially normalized dimension representation (a
dimension that is stored in more than one table), identify how these tables are
joined. Note whether the joins between the dimension tables can guarantee that
each child-side row joins with one and only one parent-side row. In the case of
denormalized dimensions, determine whether the child-side columns uniquely
determine the parent-side (or attribute) columns. These constraints can be enabled
with the NOVALIDATE and RELY clauses if the relationships represented by the
constraints are guaranteed by other means.
You create a dimension using either the CREATE DIMENSION statement or the
Dimension Wizard in Oracle Enterprise Manager. Within the CREATE DIMENSION
statement, use the LEVEL clause to identify the names of the dimension levels.
Each level in the dimension must correspond to one or more columns in a table in
the database. Thus, level product is identified by the column prod_id in the
products table and level subcategory is identified by a column called prod_
subcategory in the same table.
In this example, the database tables are denormalized and all the columns exist in
the same table. However, this is not a prerequisite for creating dimensions. "Using
Normalized Dimension Tables" on page 9-9 shows how to create a dimension
customers_dim that has a normalized schema design using the JOIN KEY clause.
The next step is to declare the relationship between the levels with the HIERARCHY
statement and give that hierarchy a name. A hierarchical relationship is a functional
dependency from one level of a hierarchy to the next level in the hierarchy. Using
the level names defined previously, the CHILD OF relationship denotes that each
child's level value is associated with one and only one parent level value. The
following statements declare a hierarchy prod_rollup and define the relationship
between products, subcategory, and category.
HIERARCHY prod_rollup
(product CHILD OF
subcategory CHILD OF
category)
Dimensions 9-5
Creating Dimensions
See Also: Chapter 22, "Query Rewrite" for further details of using
dimensional information
The design, creation, and maintenance of dimensions is part of the design, creation,
and maintenance of your data warehouse schema. Once the dimension has been
created, check that it meets these requirements:
■ There must be a 1:n relationship between a parent and children. A parent can
have one or more children, but a child can have only one parent.
■ There must be a 1:1 attribute relationship between hierarchy levels and their
dependent dimension attributes. For example, if there is a column fiscal_
month_desc, then a possible attribute relationship would be fiscal_month_
desc to fiscal_month_name.
■ If the columns of a parent level and child level are in different relations, then the
connection between them also requires a 1:n join relationship. Each row of the
child table must join with one and only one row of the parent table. This
relationship is stronger than referential integrity alone, because it requires that
the child join key must be non-null, that referential integrity must be
maintained from the child join key to the parent join key, and that the parent
join key must be unique.
■ You must ensure (using database constraints if necessary) that the columns of
each hierarchy level are non-null and that hierarchical integrity is maintained.
■ The hierarchies of a dimension can overlap or be disconnected from each other.
However, the columns of a hierarchy level cannot be associated with more than
one dimension.
■ Join relationships that form cycles in the dimension graph are not supported.
For example, a hierarchy level cannot be joined to itself either directly or
indirectly.
Multiple Hierarchies
A single dimension definition can contain multiple hierarchies. Suppose our retailer
wants to track the sales of certain items over time. The first step is to define the time
Dimensions 9-7
Creating Dimensions
dimension over which sales will be tracked. Figure 9–2 illustrates a dimension
times_dim with two time hierarchies.
year fis_year
quarter fis_quarter
month fis_month
fis_week
day
From the illustration, you can construct the hierarchy of the denormalized time_
dim dimension's CREATE DIMENSION statement as follows. The complete CREATE
DIMENSION statement as well as the CREATE TABLE statement are shown in
Oracle9i Sample Schemas.
CREATE DIMENSION times_dim
LEVEL day IS TIMES.TIME_ID
LEVEL month IS TIMES.CALENDAR_MONTH_DESC
LEVEL quarter IS TIMES.CALENDAR_QUARTER_DESC
LEVEL year IS TIMES.CALENDAR_YEAR
LEVEL fis_week IS TIMES.WEEK_ENDING_DAY
LEVEL fis_month IS TIMES.FISCAL_MONTH_DESC
LEVEL fis_quarter IS TIMES.FISCAL_QUARTER_DESC
LEVEL fis_year IS TIMES.FISCAL_YEAR
HIERARCHY cal_rollup (
day CHILD OF
month CHILD OF
quarter CHILD OF
year
)
HIERARCHY fis_rollup (
day CHILD OF
fis_week CHILD OF
fis_month CHILD OF
fis_quarter CHILD OF
fis_year
) <attribute determination clauses>...
Dimensions 9-9
Viewing Dimensions
Viewing Dimensions
Dimensions can be viewed through one of two methods:
■ Using The DEMO_DIM Package
■ Using Oracle Enterprise Manager
To display all of the dimensions that have been defined, call the procedure DEMO_
DIM.PRINT_ALLDIMS without any parameters is illustrated as follows.
EXECUTE DBMS_OUTPUT.ENABLE(10000);
EXECUTE DEMO_DIM.PRINT_ALLDIMS;
Dimensions 9-11
Validating Dimensions
Validating Dimensions
The information of a dimension object is declarative only and not enforced by the
database. If the relationships described by the dimensions are incorrect, incorrect
results could occur. Therefore, you should verify the relationships specified by
CREATE DIMENSION using the DBMS_OLAP.VALIDATE_DIMENSION procedure
periodically.
This procedure is easy to use and has only five parameters:
■ Dimension name
■ Owner name
■ Set to TRUE to check only the new rows for tables of this dimension
■ Set to TRUE to verify that all columns are not null
■ Unique run ID obtained by calling the DBMS_OLAP.CREATE_ID procedure.
The ID is used to identify the result of each run
The following example validates the dimension TIME_FN in the grocery schema
VARIABLE RID NUMBER;
EXECUTE DBMS_OLAP.CREATE_ID(:RID);
EXECUTE DBMS_OLAP.VALIDATE_DIMENSION ('TIME_FN', 'GROCERY', \
FALSE, TRUE, :RID);
However, rather than query this view, it may be better to query the rowid of the
invalid row to retrieve the actual row that has violated the constraint. In this
example, the dimension TIME_FN is checking a table called month. It has found a
row that violates the constraints. Using the rowid, you can see exactly which row in
the month table is causing the problem, as in the following:
Finally, to remove results from the system table for the current run:
EXECUTE DBMS_OLAP.PURGE_RESULTS(:RID);
Altering Dimensions
You can modify the dimension using the ALTER DIMENSION statement. You can
add or drop a level, hierarchy, or attribute from the dimension using this command.
Referring to the time dimension in Figure 9–2 on page 9-8, you can remove the
attribute fis_year, drop the hierarchy fis_rollup, or remove the level
fiscal_year. In addition, you can add a new level called foyer as in the
following:
ALTER DIMENSION times_dim DROP ATTRIBUTE fis_year;
ALTER DIMENSION times_dim DROP HIERARCHY fis_rollup;
ALTER DIMENSION times_dim DROP LEVEL fis_year;
ALTER DIMENSION times_dim ADD LEVEL f_year IS times.fiscal_year;
If you try to remove anything with further dependencies inside the dimension,
Oracle rejects the altering of the dimension. A dimension becomes invalid if you
change any schema object that the dimension is referencing. For example, if the
table on which the dimension is defined is altered, the dimension becomes invalid.
To check the status of a dimension, view the contents of the column invalid in the
ALL_DIMENSIONS data dictionary view.
To revalidate the dimension, use the COMPILE option as follows:
ALTER DIMENSION times_dim COMPILE;
Dimensions 9-13
Deleting Dimensions
Deleting Dimensions
A dimension is removed using the DROP DIMENSION statement. For example:
DROP DIMENSION times_dim;
The levels in the dimension can either be shown on the General Property sheet, or
by selecting the Levels property sheet, levels can be deleted, displayed or new ones
defined for this dimension as illustrated in Figure 9–4.
Dimensions 9-15
Using the Dimension Wizard
By selecting the level name from the list on the left of the property sheet, the
columns used for this level are displayed in the Selected Columns window in the
lower half of the property sheet.
Levels can be added or removed by pressing the New or Delete buttons but they
cannot be modified.
A similar property sheet to that for Levels is provided for the attributes in the
dimension and is selected by clicking on the Attributes tab.
One of the main advantages of using Oracle Enterprise Manager to define the
dimension is that the hierarchies can be easily displayed. Figure 9–5 illustrates the
Hierarchy property sheet.
In Figure 9–5, you can see that the hierarchy called CAL_ROLLUP contains four
levels where the top level is year, followed by quarter, month, and day.
You can add or remove hierarchies by pressing the New or Delete buttons but they
cannot be modified.
Creating a Dimension
An alternative to writing the CREATE DIMENSION statement is to invoke the
Dimension wizard, which guides you through 6 steps to create a dimension.
Step 1
First, you must define which type of dimension object is to be defined. If a time
dimension is required, selecting the time dimension type ensures that your
dimension is recognized as a time dimension that has specific types of hierarchies
and attributes.
Dimensions 9-17
Using the Dimension Wizard
Step 2
Specify the name of your dimension and into which schema it should reside by
selecting from the drop down list of schemas.
Step 3
The levels in the dimension are defined in Step 3 as shown in Figure 9–6.
First, give the level a name and then select the table from where the columns which
define this level are located. Now, select one or more columns from the available list
and using the > key move them into the Selected Columns area. Your level will
now appear in the list on the left side of the property sheet.
To define another level, click the New button, or, if all the levels have been defined,
click the Next button to proceed to the next step. If a mistake is made when defining
a level, simply click the Delete button to remove it and start again.
Step 4
The levels in the dimension can also have attributes. Give the attribute a name and
then select the level on which this attribute is to be defined and using the > button
move it into the Selected Levels column. Now choose the column from the drop
down list for this attribute.
Levels can be added or removed by pressing the New or Delete buttons but they
cannot be modified.
Step 5
A hierarchy is defined as illustrated in Figure 9–7.
First, give the hierarchy a name and then select the levels to be used in this
hierarchy and move them to the Selected Levels column using the > button.
The level name at the top of the list defines the top of the hierarchy. Use the up and
down buttons to move the levels into the required order. Note that each level will
indent so you can see the relationships between the levels.
Dimensions 9-19
Using the Dimension Wizard
Step 6
Finally, the Summary screen is displayed as shown in Figure 9–8 where a graphical
representation of the dimension is shown on the left side of the property sheet and
on the right side the CREATE DIMENSION statement is shown. Clicking on the
Finish button will create the dimension.
This section deals with the tasks for managing a data warehouse.
It contains the following chapters:
■ Overview of Extraction, Transformation, and Loading
■ Extraction in Data Warehouses
■ Transportation in Data Warehouses
■ Loading and Transformation
■ Maintaining the Data Warehouse
■ Change Data Capture
■ Summary Advisor
10
Overview of Extraction, Transformation, and
Loading
Overview of ETL
You need to load your data warehouse regularly so that it can serve its purpose of
facilitating business analysis. To do this, data from one or more operational systems
needs to be extracted and copied into the warehouse. The process of extracting data
from source systems and bringing it into the data warehouse is commonly called
ETL, which stands for extraction, transformation, and loading. The acronym ETL is
perhaps too simplistic, because it omits the transportation phase and implies that
each of the other phases of the process is distinct. We refer to the entire process,
including data loading, as ETL. You should understand that ETL refers to a broad
process, and not three well-defined steps.
The methodology and tasks of ETL have been well known for many years, and are
not necessarily unique to data warehouse environments: a wide variety of
proprietary applications and database systems are the IT backbone of any
enterprise. Data has to be shared between applications or systems, trying to
integrate them, giving at least two applications the same picture of the world. This
data sharing was mostly addressed by mechanisms similar to what we now call
ETL.
Data warehouse environments face the same challenge with the additional burden
that they not only have to exchange but to integrate, rearrange and consolidate data
over many systems, thereby providing a new unified information base for business
intelligence. Additionally, the data volume in data warehouse environments tends
to be very large.
What happens during the ETL process? During extraction, the desired data is
identified and extracted from many different sources, including database systems
and applications. Very often, it is not possible to identify the specific subset of
interest, therefore more data than necessary has to be extracted, so the identification
of the relevant data will be done at a later point in time. Depending on the source
system's capabilities (for example, operating system resources), some
transformations may take place during this extraction process. The size of the
extracted data varies from hundreds of kilobytes up to gigabytes, depending on the
source system and the business situation. The same is true for the time delta
between two (logically) identical extractions: the time span may vary between
days/hours and minutes to near real-time. Web server log files for example can
easily become hundreds of megabytes in a very short period of time.
ETL Tools
Designing and maintaining the ETL process is often considered one of the most
difficult and resource-intensive portions of a data warehouse project. Many data
warehousing projects use ETL tools to manage this process. Oracle Warehouse
Builder (OWB), for example, provides ETL capabilities and takes advantage of
inherent database abilities. Other data warehouse builders create their own ETL
tools and processes, either inside or outside the database.
Besides the support of extraction, transformation, and loading, there are some other
tasks that are important for a successful ETL implementation as part of the daily
operations of the data warehouse and its support for further enhancements. Besides
the support for designing a data warehouse and the data flow, these tasks are
typically addressed by ETL tools such as OWB.
Oracle9i is not an ETL tool and does not provide a complete solution for ETL.
However, Oracle9i does provide a rich set of capabilities that can be used by both
ETL tools and customized ETL solutions. Oracle9i offers techniques for transporting
data between Oracle databases, for transforming large volumes of data, and for
quickly loading new data into a data warehouse.
Daily Operations
The successive loads and transformations must be scheduled and processed in a
specific order. Depending on the success or failure of the operation or parts of it, the
result must be tracked and subsequent, alternative processes might be started. The
control of the progress as well as the definition of a business workflow of the
operations are typically addressed by ETL tools such as OWB.
This chapter discusses extraction, which is the process of taking data from an
operational system and moving it to your warehouse or staging system. The chapter
discusses:
■ Overview of Extraction in Data Warehouses
■ Introduction to Extraction Methods in Data Warehouses
■ Data Warehousing Extraction Examples
Full Extraction
The data is extracted completely from the source system. Since this extraction
reflects all the data currently available on the source system, there’s no need to keep
track of changes to the data source since the last successful extraction. The source
data will be provided as-is and no additional logical information (for example,
timestamps) is necessary on the source site. An example for a full extraction may be
an export file of a distinct table or a remote SQL statement scanning the complete
source table.
Incremental Extraction
At a specific point in time, only the data that has changed since a well-defined event
back in history will be extracted. This event may be the last time of extraction or a
more complex business event like the last booking day of a fiscal period. To identify
this delta change there must be a possibility to identify all the changed information
since this specific time event. This information can be either provided by the source
data itself like an application column, reflecting the last-changed timestamp or a
change table where an appropriate additional mechanism keeps track of the
changes besides the originating transactions. In most cases, using the latter method
means adding extraction logic to the source system.
Many data warehouses do not use any change-capture techniques as part of the
extraction process. Instead, entire tables from the source systems are extracted to the
data warehouse or staging area, and these tables are compared with a previous
extract from the source system to identify the changed data. This approach may not
have significant impact on the source systems, but it clearly can place a considerable
burden on the data warehouse processes, particularly if the data volumes are large.
Oracle’s Change Data Capture mechanism can extract and maintain such delta
information.
See Also: Chapter 15, "Change Data Capture" for further details
about the Change Data Capture framework
Online Extraction
The data is extracted directly from the source system itself. The extraction process
can connect directly to the source system to access the source tables themselves or to
an intermediate system that stores the data in a preconfigured manner (for example,
snapshot logs or change tables). Note that the intermediate system is not necessarily
physically different from the source system.
With online extractions, you need to consider whether the distributed transactions
are using original source objects or prepared source objects.
Offline Extraction
The data is not extracted directly from the source system but is staged explicitly
outside the original source system. The data already has an existing structure (for
example, redo logs, archive logs or transportable tablespaces) or was created by an
extraction routine.
Because change data capture is often desirable as part of the extraction process and
it might not be possible to use Oracle’s Change Data Capture mechanism, this
section describes several techniques for implementing a self-developed change
capture on Oracle source systems:
■ Timestamps
■ Partitioning
■ Triggers
These techniques are based upon the characteristics of the source systems, or may
require modifications to the source systems. Thus, each of these techniques must be
carefully evaluated by the owners of the source system prior to implementation.
Each of these techniques can work in conjunction with the data extraction technique
discussed previously. For example, timestamps can be used whether the data is
being unloaded to a file or accessed through a distributed query.
See Also: Chapter 15, "Change Data Capture" for further details
Timestamps
The tables in some operational systems have timestamp columns. The timestamp
specifies the time and date that a given row was last modified. If the tables in an
operational system have columns containing timestamps, then the latest data can
easily be identified using the timestamp columns. For example, the following query
might be useful for extracting today's data from an orders table:
SELECT * FROM orders WHERE TRUNC(CAST(order_date AS date),'dd') = TO_
DATE(SYSDATE,'dd-mon-yyyy');
Partitioning
Some source systems might use Oracle range partitioning, such that the source
tables are partitioned along a date key, which allows for easy identification of new
data. For example, if you are extracting from an orders table, and the orders
table is partitioned by week, then it is easy to identify the current week's data.
Triggers
Triggers can be created in operational systems to keep track of recently updated
records. They can then be used in conjunction with timestamp columns to identify
the exact time and date when a given row was last modified. You do this by creating
a trigger on each source table that requires change data capture. Following each
DML statement that is executed on the source table, this trigger updates the
timestamp column with the current time. Thus, the timestamp column provides the
exact time and date when a given row was last modified.
A similar internalized trigger-based technique is used for Oracle materialized view
logs. These logs are used by materialized views to identify changed data, and these
logs are accessible to end users. A materialized view log can be created on each
source table requiring change data capture. Then, whenever any modifications are
made to the source table, a record is inserted into the materialized view log
indicating which rows were modified. If you want to use a trigger-based
mechanism, use change data capture.
Materialized view logs rely on triggers, but they provide an advantage in that the
creation and maintenance of this change-data system is largely managed by Oracle.
However, Oracle recommends the usage of synchronous Change Data Capture for
trigger based change capture, since CDC provides an externalized interface for
accessing the change information and provides a framework for maintaining the
distribution of this information to various clients
Trigger-based techniques affect performance on the source systems, and this impact
should be carefully considered prior to implementation on a production source
system.
The exact format of the output file can be specified using SQL*Plus system
variables.
This extraction technique offers the advantage of being able to extract the output of
any SQL statement. The example previously extracts the results of a join.
This extraction technique can be parallelized by initiating multiple, concurrent
SQL*Plus sessions, each session running a separate query representing a different
portion of the data to be extracted. For example, suppose that you wish to extract
data from an orders table, and that the orders table has been range partitioned
by month, with partitions orders_jan1998, orders_feb1998, and so on. To
extract a single year of data from the orders table, you could initiate 12 concurrent
SQL*Plus sessions, each extracting a single partition. The SQL script for one such
session could be:
SPOOL order_jan.dat
SELECT * FROM orders PARTITION (orders_jan1998);
SPOOL OFF
The physical method is based on a range of values. By viewing the data dictionary,
it is possible to identify the Oracle data blocks that make up the orders table.
Using this information, you could then derive a set of rowid-range queries for
extracting data from the orders table:
SELECT * FROM orders WHERE rowid BETWEEN value1 and value2;
Note: All parallel techniques can use considerably more CPU and
I/O resources on the source system, and the impact on the source
system should be evaluated before parallelizing any extraction
technique.
Oracle provides a direct-path export, which is quite efficient for extracting data.
However, in Oracle8i, there is no direct-path import, which should be considered
when evaluating the overall performance of an export-based extraction strategy.
This statement creates a local table in a data mart, country_city, and populates it
with data from the countries and customers tables on the source system.
This technique is ideal for moving small volumes of data. However, the data is
transported from the source system to the data warehouse through a single Oracle
Net connection. Thus, the scalability of this technique is limited. For larger data
volumes, file-based data extraction and transportation techniques are often more
scalable and thus more appropriate.
The following topics provide information about transporting data into a data
warehouse:
■ Overview of Transportation in Data Warehouses
■ Introduction to Transportation Mechanisms in Data Warehouses
Step 1: Place the Data to be Transported into its own Tablespace The current month's data
must be placed into a separate tablespace in order to be transported. In this
example, you have a tablespace ts_temp_sales, which will hold a copy of the
current month's data. Using the CREATE TABLE ... AS SELECT statement, the
current month's data can be efficiently copied to this tablespace:
CREATE TABLE temp_jan_sales
NOLOGGING
TABLESPACE ts_temp_sales
AS
SELECT * FROM sales
WHERE time_id BETWEEN '31-DEC-1999' AND '01-FEB-2000';
See Also: Oracle9i Supplied PL/SQL Packages and Types Reference for
detailed information about the DBMS_TTS package
In this step, we have copied the January sales data into a separate tablespace;
however, in some cases, it may be possible to leverage the transportable tablespace
feature without even moving data to a separate tablespace. If the sales table has
been partitioned by month in the data warehouse and if each partition is in its own
tablespace, then it may be possible to directly transport the tablespace containing
the January data. Suppose the January partition, sales_jan2000, is located in the
tablespace ts_sales_jan2000. Then the tablespace ts_sales_jan2000 could
potentially be transported, rather than creating a temporary copy of the January
sales data in the ts_temp_sales.
However, the same conditions must be satisfied in order to transport the tablespace
ts_sales_jan2000 as are required for the specially created tablespace. First, this
tablespace must be set to READ ONLY. Second, because a single partition of a
partitioned table cannot be transported without the remainder of the partitioned
table also being transported, it is necessary to exchange the January partition into a
separate table (using the ALTER TABLE statement) to transport the January data.
The EXCHANGE operation is very quick, but the January data will no longer be a
part of the underlying sales table, and thus may be unavailable to users until this
data is exchanged back into the sales table after the export of the metadata. The
January data can be exchanged back into the sales table after you complete step 3.
Step 2: Export the Metadata The Export utility is used to export the metadata
describing the objects contained in the transported tablespace. For our example
scenario, the Export command could be:
EXP TRANSPORT_TABLESPACE=y
TABLESPACES=ts_temp_sales
FILE=jan_sales.dmp
This operation will generate an export file, jan_sales.dmp. The export file will be
small, because it contains only metadata. In this case, the export file will contain
information describing the table temp_jan_sales, such as the column names,
column datatype, and all other information that the target Oracle database will
need in order to access the objects in ts_temp_sales.
Step 3: Copy the Datafiles and Export File to the Target System Copy the data files that
make up ts_temp_sales, as well as the export file jan_sales.dmp to the data
mart platform, using any transportation mechanism for flat files.
Once the datafiles have been copied, the tablespace ts_temp_sales can be set to
READ WRITE mode if desired.
Step 4: Import the Metadata Once the files have been copied to the data mart, the
metadata should be imported into the data mart:
IMP TRANSPORT_TABLESPACE=y DATAFILES='/db/tempjan.f'
TABLESPACES=ts_temp_sales
FILE=jan_sales.dmp
At this point, the tablespace ts_temp_sales and the table temp_sales_jan are
accessible in the data mart. You can incorporate this new data into the data mart's
tables.
You can insert the data from the temp_sales_jan table into the data mart's sales
table in one of two ways:
INSERT /*+ APPEND */ INTO sales SELECT * FROM temp_sales_jan;
Following this operation, you can delete the temp_sales_jan table (and even the
entire ts_temp_sales tablespace).
Alternatively, if the data mart's sales table is partitioned by month, then the new
transported tablespace and the temp_sales_jan table can become a permanent
part of the data mart. The temp_sales_jan table can become a partition of the
data mart's sales table:
ALTER TABLE sales ADD PARTITION sales_00jan VALUES
LESS THAN (TO_DATE('01-feb-2000','dd-mon-yyyy'));
ALTER TABLE sales EXCHANGE PARTITION sales_00jan
WITH TABLE temp_sales_jan
INCLUDING INDEXES WITH VALIDATION;
This chapter helps you create and manage a data warehouse, and discusses:
■ Overview of Loading and Transformation in Data Warehouses
■ Loading Mechanisms
■ Transformation Mechanisms
■ Loading and Transformation Scenarios
Transformation Flow
From an architectural perspective, you can transform your data in two ways:
■ Multistage Data Transformation
■ Pipelined Data Transformation
Table Table
sales
Insert into sales
warehouse table
Table
Flat Files
sales
Insert into sales
warehouse table
Table
Loading Mechanisms
You can use the following mechanisms for loading a warehouse:
■ SQL*Loader
■ External Tables
■ OCI and Direct-Path APIs
■ Export/Import
SQL*Loader
Before any data transformations can occur within the database, the raw data must
become accessible for the database. One approach is to load it into the database.
Chapter 12, "Transportation in Data Warehouses", discusses several techniques for
transporting data to an Oracle data warehouse. Perhaps the most common
technique for transporting data is by way of flat files.
SQL*Loader is used to move data from flat files into an Oracle data warehouse.
During this data load, SQL*Loader can also be used to implement basic data
transformations. When using direct-path SQL*Loader, basic data manipulation,
such as datatype conversion and simple NULL handling, can be automatically
resolved during the data load. Most data warehouses use direct-path loading for
performance reasons.
Oracle's conventional-path loader provides broader capabilities for data
transformation than a direct-path loader: SQL functions can be applied to any
column as those values are being loaded. This provides a rich capability for
transformations during the data load. However, the conventional-path loader is
slower than direct-path loader. For these reasons, the conventional-path loader
should be considered primarily for loading and transforming smaller amounts of
data.
The following is a simple example of a SQL*Loader controlfile to load data into the
sales table of the sh sample schema from an external file sh_sales.dat. The
external flat file sh_sales.dat consists of sales transaction data, aggregated on a
daily level. Not all columns of this external file are loaded into sales. This external
file will also be used as source for loading the second fact table of the sh sample
schema, which is done using an external table:
The following shows the controlfile (sh_sales.ctl) to load the sales table:
LOAD DATA
INFILE sh_sales.dat
APPEND INTO TABLE sales
FIELDS TERMINATED BY "|"
( PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID,
QUANTITY_SOLD, AMOUNT_SOLD)
External Tables
Another approach for handling external data sources is using external tables.
Oracle9i‘s external table feature enables you to use external data as a virtual table
that can be queried and joined directly and in parallel without requiring the
external data to be first loaded in the database. You can then use SQL, PL/SQL, and
Java to access the external data.
External tables enable the pipelining of the loading phase with the transformation
phase. The transformation process can be merged with the loading process without
any interruption of the data streaming. It is no longer necessary to stage the data
inside the database for further processing inside the database, such as comparison
or transformation. For example, the conversion functionality of a conventional load
can be used for a direct-path INSERT AS SELECT statement in conjunction with the
SELECT from an external table.
The main difference between external tables and regular tables is that externally
organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE)
are possible and no indexes can be created on them.
Oracle9i’s external tables are a complement to the existing SQL*Loader
functionality, and are especially useful for environments where the complete
external source has to be joined with existing database objects and transformed in a
complex manner, or where the external data volume is large and used only once.
SQL*Loader, on the other hand, might still be the better choice for loading of data
where additional indexing of the staging table is necessary. This is true for
operations where the data is used in independent complex transformations or the
data is only partially used in further processing.
(
RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII
BADFILE log_file_dir:'sh_sales.bad_xt'
LOGFILE log_file_dir:'sh_sales.log_xt'
FIELDS TERMINATED BY "|" LDRTRIM
)
location
(
'sh_sales.dat'
)
)REJECT LIMIT UNLIMITED;
The external table can now be used from within the database, accessing some
columns of the external data only, grouping the data, and inserting it into the
costs fact table:
INSERT /*+ APPEND */ INTO COSTS
(
TIME_ID,
PROD_ID,
UNIT_COST,
UNIT_PRICE
)
SELECT
TIME_ID,
PROD_ID,
SUM(UNIT_COST),
SUM(UNIT_PRICE)
FROM sales_transactions_ext
GROUP BY time_id, prod_id;
Export/Import
Export and import are used when the data is inserted as is into the target system.
No large volumes of data should be handled and no complex extractions are
possible.
Transformation Mechanisms
You have the following choices for transforming data inside the database:
■ Transformation Using SQL
■ Transformation Using PL/SQL
■ Transformation Using Table Functions
value. For example, You can do this efficiently using a SQL function as part of the
insertion into the target sales table statement:
The structure of source table sales_activity_direct is as follows:
SQL> DESC sales_activity_direct
Name Null? Type
------------ ----- ----------------
SALES_DATE DATE
PRODUCT_ID NUMBER
CUSTOMER_ID NUMBER
PROMOTION_ID NUMBER
AMOUNT NUMBER
QUANTITY NUMBER
When to Use Merge There are several benefits of the new MERGE statement as
compared with the two other existing approaches.
■ The entire operation can be expressed much more simply as a single SQL
statement.
■ You can parallelize statements transparently.
■ You can use bulk DML.
■ Performance will improve because your statements will require fewer scans of
the source table.
The advantage of this approach is its simplicity and lack of new language
extensions. The disadvantage is its performance. It requires an extra scan and a join
of both the products_delta and the products tables.
IF SQL%notfound THEN
INSERT INTO products
(prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcat_desc, prod_category,
prod_cat_desc, prod_status, prod_list_price, prod_min_price)
VALUES
(crec.prod_id, crec.prod_name, crec.prod_desc, crec.prod_subcategory,
crec.prod_subcat_desc, crec.prod_category,
Figure 13–3 illustrates a typical aggregation where you input a set of rows and
output a set of rows, in that case, after performing a SUM operation.
In Out
Region Sales Region Sum of Sales
North 10 Table North 35
South 20 Function South 30
North 25 West 10
East 5 East 5
West 10
South 10
... ...
The table function takes the result of the SELECT on In as input and delivers a set
of records in a different format as output for a direct insertion into Out.
Additionally, a table function can fan out data within the scope of an atomic
transaction. This can be used for many occasions like an efficient logging
mechanism or a fan out for other independent transformations. In such a scenario, a
single staging table will be needed.
tf1 tf2
Source Target
Stage Table 1
tf3
This will insert into target and, as part of tf1, into Stage Table 1 within the
scope of an atomic transaction.
INSERT INTO target SELECT * FROM tf3(SELT * FROM stage_table1);
sales NUMBER:=0;
objset product_t_table := product_t_table();
i NUMBER := 0;
BEGIN
LOOP
-- Fetch from cursor variable
FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcat_desc, prod_category, prod_cat_desc, prod_weight_class,
prod_unit_of_measure, prod_pack_size, supplier_id, prod_status,
prod_list_price, prod_min_price;
EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched
IF prod_status='obsolete' AND prod_category != 'Boys' THEN
-- append to collection
i:=i+1;
objset.extend;
objset(i):=product_t( prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcat_desc, prod_category, prod_cat_desc, prod_weight_class, prod_unit_
of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_
min_price);
END IF;
END LOOP;
CLOSE cur;
RETURN objset;
END;
/
You can use the table function in a SQL statement to show the results. Here we use
additional SQL functionality for the output.
SELECT DISTINCT UPPER(prod_category), prod_status
FROM TABLE(obsolete_products(CURSOR(SELECT * FROM products)));
UPPER(PROD_CATEGORY) PROD_STATUS
-------------------- -----------
GIRLS obsolete
MEN obsolete
2 rows selected.
The following example implements the same filtering than the first one. The main
differences between those two are:
■ This example uses a strong typed REF cursor as input and can be parallelized
based on the objects of the strong typed cursor, as shown in one of the following
examples.
■ The table function returns the result set incrementally as soon as records are
created.
REM Same example, pipelined implementation
REM strong ref cursor (input type is defined)
REM a table without a strong typed input ref cursor cannot be parallelized
REM
CREATE OR
REPLACE FUNCTION obsolete_products_pipe(cur cursor_pkg.strong_refcur_t)
RETURN product_t_table
PIPELINED
PARALLEL_ENABLE (PARTITION cur BY ANY) IS
prod_id NUMBER(6);
prod_name VARCHAR2(50);
prod_desc VARCHAR2(4000);
prod_subcategory VARCHAR2(50);
prod_subcat_desc VARCHAR2(2000);
prod_category VARCHAR2(50);
prod_cat_desc VARCHAR2(2000);
prod_weight_class NUMBER(2);
prod_unit_of_measure VARCHAR2(20);
prod_pack_size VARCHAR2(30);
supplier_id NUMBER(6);
prod_status VARCHAR2(20);
prod_list_price NUMBER(8,2);
prod_min_price NUMBER(8,2);
sales NUMBER:=0;
BEGIN
LOOP
-- Fetch from cursor variable
FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory, prod_subcat_
desc, prod_category, prod_cat_desc, prod_weight_class, prod_unit_of_measure,
prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price;
EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched
IF prod_status='obsolete' AND prod_category !='Boys' THEN
PIPE ROW (product_t(prod_id, prod_name, prod_desc, prod_subcategory, prod_
subcat_desc, prod_category, prod_cat_desc, prod_weight_class, prod_unit_of_
measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_
price));
END IF;
END LOOP;
CLOSE cur;
RETURN;
END;
/
PROD_CATEGORY DECODE(PROD_STATUS,
------------- -------------------
Girls NO LONGER AVAILABLE
Men NO LONGER AVAILABLE
2 rows selected.
We now change the degree of parallelism for the input table products and issue the
same statement again:
ALTER TABLE products PARALLEL 4;
The session statistics show that the statement has been parallelized:
SELECT * FROM V$PQ_SESSTAT WHERE statistic='Queries Parallelized';
1 row selected.
Table functions are also capable to fanout results into persistent table structures.
This is demonstrated in the next example. The function filters returns all obsolete
products except a those of a specific prod_category (default Men), which was set
to status obsolete by error. The detected wrong prod_id’s are stored in a separate
table structure. Its result set consists of all other obsolete product categories. It
furthermore demonstrates how normal variables can be used in conjunction with
table functions:
CREATE OR REPLACE FUNCTION obsolete_products_dml(cur cursor_pkg.strong_refcur_t,
prod_cat VARCHAR2 DEFAULT 'Men') RETURN product_t_table
PIPELINED
PARALLEL_ENABLE (PARTITION cur BY ANY) IS
PRAGMA AUTONOMOUS_TRANSACTION;
prod_id NUMBER(6);
prod_name VARCHAR2(50);
prod_desc VARCHAR2(4000);
prod_subcategory VARCHAR2(50);
prod_subcat_desc VARCHAR2(2000);
prod_category VARCHAR2(50);
prod_cat_desc VARCHAR2(2000);
prod_weight_class NUMBER(2);
prod_unit_of_measure VARCHAR2(20);
prod_pack_size VARCHAR2(30);
supplier_id NUMBER(6);
prod_status VARCHAR2(20);
prod_list_price NUMBER(8,2);
prod_min_price NUMBER(8,2);
sales NUMBER:=0;
BEGIN
LOOP
-- Fetch from cursor variable
FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory, prod_subcat_
desc, prod_category, prod_cat_desc, prod_weight_class, prod_unit_of_measure,
prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price;
EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched
IF prod_status='obsolete' THEN
IF prod_category=prod_cat THEN
INSERT INTO obsolete_products_errors VALUES
(prod_id, ’correction: category '||UPPER(prod_cat)||' still available');
ELSE
PIPE ROW (product_t( prod_id, prod_name, prod_desc, prod_subcategory, prod_
subcat_desc, prod_category, prod_cat_desc, prod_weight_class, prod_unit_of_
measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_
price));
END IF;
END IF;
END LOOP;
COMMIT;
CLOSE cur;
RETURN;
END;
/
The following query shows all obsolete product groups except the prod_
category Men, which was wrongly set to status obsolete.
SELECT DISTINCT prod_category, prod_status FROM TABLE(obsolete_products_
dml(CURSOR(SELECT * FROM products)));
PROD_CATEGORY PROD_STATUS
------------- -----------
Boys obsolete
Girls obsolete
2 rows selected.
As you can see, there are some products of the prod_category Men that were
obsoleted by accident:
SELECT DISTINCT msg FROM obsolete_products_errors;
MSG
----------------------------------------
correction: category MEN still available
1 row selected.
Taking advantage of the second input variable changes the result set as follows:
SELECT DISTINCT prod_category, prod_status FROM TABLE(obsolete_products_
dml(CURSOR(SELECT * FROM products), 'Boys'));
PROD_CATEGORY PROD_STATUS
------------- -----------
Girls obsolete
Men obsolete
2 rows selected.
MSG
-----------------------------------------
correction: category BOYS still available
1 row selected.
Because table functions can be used like a normal table, they can be nested, as
shown in the following:
SELECT DISTINCT prod_category, prod_status
FROM TABLE(obsolete_products_dml(CURSOR(SELECT *
FROM TABLE(obsolete_products_pipe(CURSOR(SELECT * FROM products))))));
PROD_CATEGORY PROD_STATUS
------------- -----------
Girls obsolete
1 row selected.
The biggest advantage of Oracle9i ETL is its toolkit functionality, where you can
combine any of the latter discussed functionality to improve and speed up your
ETL processing. For example, you can take an external table as input, join it with an
existing table and use it as input for a parallelized table function to process complex
business logic. This table function can be used as input source for a MERGE
operation, thus streaming the new information for the data warehouse, provided in
a flat file within one single statement through the complete ETL process.
,
, ,
datafile when it is added to a tablespace), it is best if each of the four tablespaces
on each group of 10 disks has its first datafile on a different disk. Thus the first
tablespace has /dev/D1.1 as its first datafile, the second tablespace has
/dev/D4.2 as its first datafile, and so on, as illustrated in Figure 13–5.
,
,
,
Figure 13–5 Datafile Layout for Parallel Load Example
,,,,,,,,
,
,
,
TSfacts1 /dev/D1.1 /dev/D2.1 ... /dev/D10.1
TSfacts2 /dev/D1.2 /dev/D2.2 ... /dev/D10.2
TSfacts3 /dev/D1.3 /dev/D2.3 ... /dev/D10.3
TSfacts4 /dev/D1.4 /dev/D2.4 ... /dev/D10.4
,
TSfacts5 /dev/D11.1 /dev/D12.1 ... /dev/D20.1
TSfacts6 /dev/D11.2 /dev/D12.2 ... /dev/D20.2
TSfacts7 /dev/D11.3 /dev/D12.3 ... /dev/D20.3
TSfacts8 /dev/D11.4 /dev/D12.4 ... /dev/D20.4
Extent sizes in the STORAGE clause should be multiples of the multiblock read size,
where blocksize * MULTIBLOCK_READ_COUNT = multiblock read size.
INITIAL and NEXT should normally be set to the same value. In the case of parallel
load, make the extent size large enough to keep the number of extents reasonable,
and to avoid excessive overhead and serialization due to bottlenecks in the data
dictionary. When PARALLEL=TRUE is used for parallel loader, the INITIAL extent
is not used. In this case you can override the INITIAL extent size specified in the
tablespace default storage clause with the value specified in the loader control file,
for example, 64KB.
Tables or indexes can have an unlimited number of extents, provided you have set
the COMPATIBLE initialization parameter to match the current release number, and
use the MAXEXTENTS keyword on the CREATE or ALTER statement for the
tablespace or object. In practice, however, a limit of 10,000 extents for each object is
reasonable. A table or index has an unlimited number of extents, so set the
PERCENT_INCREASE parameter to zero to have extents of equal size.
However, regardless of the setting of this keyword, if you have one loader process
for each partition, you are still effectively loading into the table in parallel.
The advantage of this approach is that local indexes are maintained by SQL*Loader.
You still get parallel loading, but on a partition level—without the restrictions of the
PARALLEL keyword.
A disadvantage is that you must partition the input prior to loading manually.
Oracle partitions the input data so that it goes into the correct partitions. In this case
all the loader sessions can share the same control file, so there is no need to mention
it in the statement.
The keyword PARALLEL=TRUE must be used, because each of the seven loader
sessions can write into every partition. In Case 1, every loader session would write
into only one partition, because the data was partitioned prior to loading. Hence all
the PARALLEL keyword restrictions are in effect.
In this case, Oracle attempts to spread the data evenly across all the files in each of
the 12 tablespaces—however an even spread of data is not guaranteed. Moreover,
there could be I/O contention during the load when the loader processes are
attempting to write to the same device simultaneously.
For Oracle Real Application Clusters, divide the loader session evenly among the
nodes. The datafile being read should always reside on the same node as the loader
session.
The keyword PARALLEL=TRUE must be used, because multiple loader sessions can
write into the same partition. Hence all the restrictions entailed by the PARALLEL
keyword are in effect. An advantage of this approach, however, is that it guarantees
that all of the data is precisely balanced, exactly reflecting your partitioning.
in parallel. The statement starting up the first session would be similar to the
following:
SQLLDR DATA=file1.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D1
. . .
SQLLDR DATA=file30.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D30
The advantage of this approach is that as in Case 3, you have control over the exact
placement of datafiles because you use the FILE keyword. However, you are not
required to partition the input data by value because Oracle does that for you.
A disadvantage is that this approach requires all the partitions to be in the same
tablespace. This minimizes availability.
In this example, stage_dir is a directory where the external flat files reside.
Note that loading data in parallel can be performed in Oracle9i by using
SQL*Loader. But external tables are probably easier to use and the parallel load is
automatically coordinated. Unlike SQL*Loader, dynamic load balancing between
parallel execution servers will be performed as well because there will be intra-file
parallelism. The latter implies that the user will not have to manually split input
files before starting the parallel load. This will be accomplished dynamically.
This CTAS statement will convert each valid UPC code to a valid product_id
value. If the ETL process has guaranteed that each UPC code is valid, then this
statement alone may be sufficient to implement the entire transformation.
Using this outer join, the sales transactions that originally contained invalidated
UPC codes will be assigned a product_id of NULL. These transactions can be
handled later.
Additional approaches to handling invalid UPC codes exist. Some data warehouses
may choose to insert null-valued product_id values into their sales table, while
other data warehouses may not allow any new data from the entire batch to be
inserted into the sales table until all invalid UPC codes have been addressed. The
correct approach is determined by the business requirements of the data warehouse.
Regardless of the specific requirements, exception handling can be addressed by the
same basic SQL techniques as transformations.
Pivoting Scenarios
A data warehouse can receive data from many different sources. Some of these
source systems may not be relational databases and may store data in very different
formats from the data warehouse. For example, suppose that you receive a set of
sales records from a nonrelational database having the form:
product_id, customer_id, weekly_start_date, sales_sun, sales_mon, sales_tue,
sales_wed, sales_thu, sales_fri, sales_sat
PRODUCT_ID CUSTOMER_ID WEEKLY_ST SALES_SUN SALES_MON SALES_TUE SALES_WED SALES_THU SALES_FRI SALES_SAT
---------- ----------- --------- ---------- ---------- ---------- -------------------- ---------- ----------
111 222 01-OCT-00 100 200 300 400 500 600 700
222 333 08-OCT-00 200 300 400 500 600 700 800
333 444 15-OCT-00 300 400 500 600 700 800 900
In your data warehouse, you would want to store the records in a more typical
relational form in a fact table sales of the Sales History sample schema:
prod_id, cust_id, time_id, amount_sold
Thus, you need to build a transformation such that each record in the input stream
must be converted into seven records for the data warehouse's sales table. This
operation is commonly referred to as pivoting, and Oracle offers several ways to do
this.
The result of the previous example will resemble the following:
SELECT prod_id, cust_id, time_id, amount_sold FROM sales;
Like all CTAS operations, this operation can be fully parallelized. However, the
CTAS approach also requires seven separate scans of the data, one for each day of
the week. Even with parallelism, the CTAS approach can be time-consuming.
This PL/SQL procedure can be modified to provide even better performance. Array
inserts can accelerate the insertion phase of the procedure. Further performance can
be gained by parallelizing this transformation operation, particularly if the temp_
sales_step1 table is partitioned, using techniques similar to the parallelization of
data unloading described in Chapter 11, "Extraction in Data Warehouses". The
primary advantage of this PL/SQL procedure over a CTAS approach is that it
requires only a single scan of the data.
This statement only scans the source table once and then inserts the appropriate
data for each day.
This chapter discusses how to load and refresh a data warehouse, and discusses:
■ Using Partitioning to Improve Data Warehouse Refresh
■ Optimizing DML Operations During Refresh
■ Refreshing Materialized Views
■ Using Materialized Views with Partitioned Tables
Apply all constraints to the sales_01_2001 table that are present on the
sales table. This includes referential integrity constraints. A typical constraint
would be:
ALTER TABLE sales_01_2001 ADD CONSTRAINT sales_customer_id
REFERENCES customer(customer_id) ENABLE NOVALIDATE;
If the partitioned table sales has a primary or unique key that is enforced with
a global index structure, ensure that the constraint on sales_pk_jan01 is
validated without the creation of an index structure, as in the following:
ALTER TABLE sales_01_2001 ADD CONSTRAINT sales_pk_jan01
PRIMARY KEY (sales_transaction_id) DISABLE VALIDATE;
The creation of the constraint with ENABLE clause would cause the creation of a
unique index, which does not match a local index structure of the partitioned
table. You must not have any index structure built on the nonpartitioned table
to be exchanged for existing global indexes of the partitioned table. The
exchange command would fail.
3. Add the sales_01_2001 table to the sales table.
In order to add this new data to the sales table, we need to do two things.
First, we need to add a new partition to the sales table. We will use the ALTER
TABLE ... ADD PARTITION statement. This will add an empty partition to the
sales table:
ALTER TABLE sales ADD PARTITION sales_01_2001
VALUES LESS THAN (TO_DATE('01-FEB-2001', 'DD-MON-YYYY'));
Then, we can add our newly created table to this partition using the EXCHANGE
PARTITION operation. This will exchange the new, empty partition with the
newly loaded table.
ALTER TABLE sales EXCHANGE PARTITION sales_01_2001 WITH TABLE sales_01_2001
INCLUDING INDEXES WITHOUT VALIDATION UPDATE GLOBAL INDEXES;
The EXCHANGE operation will preserve the indexes and constraints that were
already present on the sales_01_2001 table. For unique constraints (such as
the unique constraint on sales_transaction_id), you can use the UPDATE
GLOBAL INDEXES clause, as shown previously. This will automatically
maintain your global index structures as part of the partition maintenance
operation and keep them accessible throughout the whole process. If there were
only foreign-key constraints, the exchange operation would be instantaneous.
The benefits of this partitioning technique are significant. First, the new data is
loaded with minimal resource utilization. The new data is loaded into an entirely
separate table, and the index processing and constraint processing are applied only
to the new partition. If the sales table was 50 GB and had 12 partitions, then a new
month's worth of data contains approximately 4 GB. Only the new month's worth of
data needs to be indexed. None of the indexes on the remaining 46 GB of data needs
to be modified at all. This partitioning scheme additionally ensures that the load
processing time is directly proportional to the amount of new data being loaded,
not to the total size of the sales table.
Second, the new data is loaded with minimal impact on concurrent queries. All of
the operations associated with data loading are occurring on a separate sales_01_
2001 table. Therefore, none of the existing data or indexes of the sales table is
affected during this data refresh process. The sales table and its indexes remain
entirely untouched throughout this refresh process.
Third, in case of the existence of any global indexes, those are incrementally
maintained as part of the exchange command. This maintenance does not affect the
availability of the existing global index structures.
The exchange operation can be viewed as a publishing mechanism. Until the data
warehouse administrator exchanges the sales_01_2001 table into the sales
table, end users cannot see the new data. Once the exchange has occurred, then any
end user query accessing the sales table will immediately be able to see the
sales_01_2001 data.
Partitioning is useful not only for adding new data but also for removing and
archiving data. Many data warehouses maintain a rolling window of data. For
example, the data warehouse stores the most recent 36 months of sales data. Just
as a new partition can be added to the sales table (as described earlier), an old
partition can be quickly (and independently) removed from the sales table. These
two benefits (reduced resources utilization and minimal end-user impact) are just as
pertinent to removing a partition as they are to adding a partition.
Removing data from a partitioned table does not necessarily mean that the old data
is physically deleted from the database. There are two alternatives for removing old
data from a partitioned table:
You can physically delete all data from the database by dropping the partition
containing the old data, thus freeing the allocated space:
ALTER TABLE sales DROP PARTITION sales_01_1998;
You can exchange the old partition with an empty table of the same structure; this
empty table is created equivalent to step1 and 2 described in the load process.
Note that the old data is still existent, as the exchanged, nonpartitioned table
sales_archive_01_1998.
If the partitioned table was setup in a way that every partition is stored in a
separate tablespace, you can archive (or transport) this table using Oracle’s
transportable tablespace framework before dropping the actual data (the
tablespace). See "Transportation Using Transportable Tablespaces" on page 12-3 for
further details regarding transportable tablespaces.
In some situations, you might not want to drop the old data immediately, but keep
it as part of the partitioned table; although the data is no longer of main interest,
there are still potential queries accessing this old, read-only data. You can use
Oracle’s data compression to minimize the space usage of the old data. We also
assume that at least one compressed partition is already part of the partitioned
table.
Refresh Scenarios
A typical scenario might not only need to compress old data, but also to merge
several old partitions to reflect the granularity for a later backup of several merged
partitions. Let’s assume that a backup (partition) granularity is on a quarterly base
for any quarter, where the oldest month is more than 36 months behind the most
recent month. In this case, we are therefore compressing and merging sales_01_
1998, sales_02_1998, and sales_03_1998 into a new, compressed partition
sales_q1_1998.
1. Create the new merged partition in parallel another tablespace. The partition
will be compressed as part of the MERGE operation:
ALTER TABLE sales MERGE PARTITION sales_01_1998, sales_02_1998, sales_03_
1998 INTO PARTITION sales_q1_1998 TABLESPACE archive_q1_1998 COMPRESS UPDATE
GLOBAL INDEXES PARALLEL 4;
2. The partition MERGE operation invalidates the local indexes for the new merged
partition. We therefore have to rebuild them:
ALTER TABLE sales MODIFY PARTITION sales_1_1998 REBUILD UNUSABLE LOCAL
INDEXES;
Alternatively, you can choose to create the new compressed data segment outside
the partitioned table and exchange it back. The performance and the temporary
space consumption is identical for both methods:
1. Create an intermediate table to hold the new merged information. The
following statement inherits all NOT NULL constraints from the origin table by
default:
CREATE TABLE sales_q1_1998_out TABLESPACE archive_q1_1998 NOLOGGING COMPRESS
PARALLEL 4 AS SELECT * FROM sales
WHERE time_id >= TO_DATE('01-JAN-1998','dd-mon-yyyy')
AND time_id < TO_DATE('01-JUN-1998','dd-mon-yyyy');
2. Create the equivalent index structure for table sales_q1_1998_out than for
the existing table sales.
3. Prepare the existing table sales for the exchange with the new compressed table
sales_q1_1998_out. Because the table to be exchanged contains data
actually covered in three partition, we have to ‘create one matching partition,
having the range boundaries we are looking for. You simply have to drop two
of the existing partitions. Note that you have to drop the lower two partitions
sales_01_1998 and sales_02_1998; the lower boundary of a range
partition is always defined by the upper (exclusive) boundary of the previous
partition:
ALTER TABLE sales DROP PARTITION sales_01_1998;
ALTER TABLE sales DROP PARTITION sales_02_1998;
Both methods apply to slightly different business scenarios: Using the MERGE
PARTITION approach invalidates the local index structures for the affected
partition, but it keeps all data accessible all the time. Any attempt to access the
affected partition through one of the unusable index structures raises an error. The
limited availability time is approximately the time for re-creating the local bitmap
index structures. In most cases this can be neglected, since this part of the
partitioned table shouldn’t be touched too often.
The CTAS approach, however, minimizes unavailability of any index structures
close to zero, but there is a specific time window, where the partitioned table does
not have all the data, because we dropped two partitions. The limited availability
time is approximately the time for exchanging the table. Depending on the existence
and number of global indexes, this time window varies. Without any existing global
indexes, this time window a matter of a fraction to few seconds.
Refresh Scenario 1
Data is loaded daily. However, the data warehouse contains two years of data, so
that partitioning by day might not be desired.
Solution: Partition by week or month (as appropriate). Use INSERT to add the new
data to an existing partition. The INSERT operation only affects a single partition,
so the benefits described previously remain intact. The INSERT operation could
occur while the partition remains a part of the table. Inserts into a single partition
can be parallelized:
INSERT /*+ APPEND*/ INTO sales PARTITION (sales_01_2001)
SELECT * FROM new_sales;
Refresh Scenario 2
New data feeds, although consisting primarily of data for the most recent day,
week, and month, also contain some data from previous time periods.
Solution 1: Use parallel SQL operations (such as CREATE TABLE ... AS SELECT) to
separate the new data from the data in previous time periods. Process the old data
separately using other techniques.
New data feeds are not solely time based. You can also feed new data into a data
warehouse with data from multiple operational systems on a business need basis.
For example, the sales data from direct channels may come into the data warehouse
separately from the data from indirect channels. For business reasons, it may
furthermore make sense to keep the direct and indirect data in separate partitions.
Solution 2: Oracle supports composite range list partitioning. The primary
partitioning strategy of the sales table could be range partitioning based on time_
id as shown in the example. However, the subpartitioning is a list based on the
channel attribute. Each subpartition can now be loaded independently of each other
(for each distinct channel) and added in a rolling window operation as discussed
before. The partitioning strategy addresses the business needs in the most optimal
manner.
The new, faster way of merging data is illustrated in Example 14–2 as follows.
As in previous examples, we assume that the new data for the sales table will be
staged in a separate table, new_sales. Using a single INSERT statement (which
can be parallelized), the product table can be altered to reflect the new products:
INSERT INTO PRODUCT_ID
(SELECT sales_product_id, 'Unknown Product Name', NULL, NULL ...
FROM new_sales WHERE sales_product_id NOT IN
(SELECT product_id FROM product));
Purging Data
Occasionally, it is necessary to remove large amounts of data from a data
warehouse. A very common scenario is the rolling window discussed previously, in
which older data is rolled out of the data warehouse to make room for new data.
However, sometimes other data might need to be removed from a data warehouse.
Suppose that a retail company has previously sold products from MS Software,
and that MS Software has subsequently gone out of business. The business users
of the warehouse may decide that they are no longer interested in seeing any data
related to MS Software, so this data should be deleted.
One approach to removing a large volume of data is to use parallel delete as shown
in the following statement:
DELETE FROM sales WHERE sales_product_id IN
(SELECT product_id
FROM product WHERE product_category = 'MS Software');
This SQL statement will spawn one parallel process for each partition. This
approach will be much more efficient than a serial DELETE statement, and none of
the data in the sales table will need to be moved.
However, this approach also has some disadvantages. When removing a large
percentage of rows, the DELETE statement will leave many empty row-slots in the
existing partitions. If new data is being loaded using a rolling window technique (or
is being loaded using direct-path INSERT or load), then this storage space will not
be reclaimed. Moreover, even though the DELETE statement is parallelized, there
might be more efficient methods. An alternative method is to re-create the entire
sales table, keeping the data for all product categories except MS Software.
CREATE TABLE sales2 AS
SELECT * FROM sales, product
WHERE sales.sales_product_id = product.product_id
AND product_category <> 'MS Software'
NOLOGGING PARALLEL (DEGREE 8)
Performing a refresh operation requires temporary space to rebuild the indexes and
can require additional space for performing the refresh operation itself. Some sites
might prefer not to refresh all of their materialized views at the same time: as soon
as some underlying detail data has been updated, all materialized views using this
data will become stale. Therefore, if you defer refreshing your materialized views,
you can either rely on your chosen rewrite integrity level to determine whether or
not a stale materialized view can be used for query rewrite, or you can temporarily
disable query rewrite with an ALTER SYSTEM SET QUERY_REWRITE_ENABLED =
false statement. After refreshing the materialized views, you can re-enable query
rewrite as the default for all sessions in the current database instance by specifying
ALTER SYSTEM SET QUERY_REWRITE_ENABLED as true. Refreshing a
materialized view automatically updates all of its indexes. In the case of full refresh,
this requires temporary sort space to rebuild all indexes during refresh. This is
because the full refresh truncates or deletes the table before inserting the new full
data volume. If insufficient temporary space is available to rebuild the indexes, then
you must explicitly drop each index or mark it UNUSABLE prior to performing the
refresh operation.
If you anticipate performing insert, update or delete operations on tables referenced
by a materialized view concurrently with the refresh of that materialized view, and
that materialized view includes joins and aggregation, Oracle recommends you use
ON COMMIT fast refresh rather than ON DEMAND fast refresh.
Complete Refresh
A complete refresh occurs when the materialized view is initially defined as BUILD
IMMEDIATE, unless the materialized view references a prebuilt table. For
materialized views using BUILD DEFERRED, a complete refresh must be requested
before it can be used for the first time. A complete refresh may be requested at any
time during the life of any materialized view. The refresh involves reading the detail
tables to compute the results for the materialized view. This can be a very
time-consuming process, especially if there are huge amounts of data to be read and
processed. Therefore, you should always consider the time required to process a
complete refresh before requesting it.
However, there are cases when the only refresh method available for an already
built materialized view is complete refresh because the materialized view does not
satisfy the conditions specified in the following section for a fast refresh.
Fast Refresh
Most data warehouses have periodic incremental updates to their detail data. As
described in "Materialized View Schema Design" on page 8-8, you can use the
SQL*Loader or any bulk load utility to perform incremental loads of detail data.
Fast refresh of your materialized views is usually efficient, because instead of
having to recompute the entire materialized view, the changes are applied to the
existing data. Thus, processing only the changes can result in a very fast refresh
time.
ON COMMIT Refresh
A materialized view can be refreshed automatically using the ON COMMIT method.
Therefore, whenever a transaction commits which has updated the tables on which
a materialized view is defined, those changes will be automatically reflected in the
materialized view. The advantage of using this approach is you never have to
remember to refresh the materialized view. The only disadvantage is the time
required to complete the commit will be slightly longer because of the extra
processing involved. However, in a data warehouse, this should not be an issue
because there is unlikely to be concurrent processes trying to update the same table.
Three refresh procedures are available in the DBMS_MVIEW package for performing
ON DEMAND refresh. Each has its own unique set of parameters.
See Also: Oracle9i Supplied PL/SQL Packages and Types Reference for
detailed information about the DBMS_MVIEW package and Oracle9i
Replication explains how to use it in a replication environment
Multiple materialized views can be refreshed at the same time, and they do not all
have to use the same refresh method. To give them different refresh methods,
specify multiple method codes in the same order as the list of materialized views
(without commas). For example, the following specifies that cal_month_sales_
mv be completely refreshed and fweek_pscat_sales_mv receive a fast refresh.
If the refresh method is not specified, the default refresh method as specified in the
materialized view definition will be used.
To obtain the list of materialized views that are directly dependent on a given object
(table or materialized view), use the procedure DBMS_MVIEW.GET_MV_
DEPENDENCIES to determine the dependent materialized views for a given table,
or for deciding the order to refresh nested materialized views.
DBMS_MVIEW.GET_MV_DEPENDENCIES(mvlist IN VARCHAR2, deplist OUT VARCHAR2)
The input to this function is the name or names of the materialized view. The
output is a comma separated list of the materialized views that are defined on it.
For example, the following statement:
GET_MV_DEPENDENCIES("JOHN.SALES_REG, SCOTT.PROD_TIME", deplist)
This populates deplist with the list of materialized views defined on the input
arguments. For example:
deplist <= "JOHN.SUM_SALES_WEST, JOHN.SUM_SALES_EAST, SCOTT.SUM_PROD_MONTH".
Monitoring a Refresh
While a job is running, you can query the V$SESSION_LONGOPS view to tell you
the progress of each materialized view being refreshed.
SELECT * FROM V$SESSION_LONGOPS;
Include all columns from the table likely to be used in materialized views in the
materialized view logs.
Fast refresh may be possible even if the SEQUENCE option is omitted from the
materialized view log. If it can be determined that only inserts or deletes will
occur on all the detail tables, then the materialized view log does not require the
SEQUENCE clause. However, if updates to multiple tables are likely or required
or if the specific update scenarios are unknown, make sure the SEQUENCE
clause is included.
■ Use Oracle's bulk loader utility or direct-path INSERT (INSERT with the
APPEND hint for loads).
This is a lot more efficient than conventional insert. During loading, disable all
constraints and re-enable when finished loading. Note that materialized view
logs are required regardless of whether you use direct load or conventional
DML.
Try to optimize the sequence of conventional mixed DML operations,
direct-path INSERT and the fast refresh of materialized views. You can use fast
refresh with a mixture of conventional DML and direct loads. Fast refresh can
perform significant optimizations if it finds that only direct loads have
occurred, as illustrated in the following:
1. Direct-path INSERT (SQL*Loader or INSERT /*+ APPEND */) into the
detail table
2. Refresh materialized view
3. Conventional mixed DML
4. Refresh materialized view
You can use fast refresh with conventional mixed DML (INSERT, UPDATE, and
DELETE) to the detail tables. However, fast refresh will be able to perform
significant optimizations in its processing if it detects that only inserts or deletes
have been done to the tables, such as:
■ DML INSERT or DELETE to the detail table
■ Refresh materialized views
■ DML update to the detail table
■ Refresh materialized view
Even more optimal is the separation of INSERT and DELETE.
If possible, refresh should be performed after each type of data change (as
shown earlier) rather than issuing only one refresh at the end. If that is not
possible, restrict the conventional DML to the table to inserts only, to get much
better refresh performance. Avoid mixing deletes and direct loads.
Furthermore, for refresh ON COMMIT, Oracle keeps track of the type of DML
done in the committed transaction. Therefore, do not perform direct-path
INSERT and DML to other tables in the same transaction, as Oracle may not be
able to optimize the refresh phase.
For ON COMMIT materialized views, where refreshes automatically occur at the
end of each transaction, it may not be possible to isolate the DML statements, in
which case keeping the transactions short will help. However, if you plan to
make numerous modifications to the detail table, it may be better to perform
them in one transaction, so that refresh of the materialized view will be
performed just once at commit time rather than after each update.
■ Oracle recommends partitioning the tables because it enables you to use:
■ Parallel DML
For large loads or refresh, enabling parallel DML will help shorten the
length of time for the operation.
■ Partition Change Tracking (PCT) fast refresh
You can refresh your materialized views fast after partition maintenance
operations on the detail tables. "Partition Change Tracking" on page 8-35 for
details on enabling PCT for materialized views.
Partitioning the materialized view will also help refresh performance as refresh
can update the materialized view using parallel DML. For example, assume
that the detail tables and materialized view are partitioned and have a parallel
clause. The following sequence would enable Oracle to parallelize the refresh of
the materialized view.
1. Bulk load into the detail table
2. Enable parallel DML with an ALTER SESSION ENABLE PARALLEL DML
statement
3. Refresh the materialized view
Also, Oracle recommends that the refresh be invoked after each table is loaded,
rather than load all the tables and then perform the refresh.
For refresh ON COMMIT, Oracle keeps track of the type of DML done in the
committed transaction. Oracle therefore recommends that you do not perform
direct-path and conventional DML to other tables in the same transaction because
Oracle may not be able to optimize the refresh phase. For example, the following is
not recommended:
1. Direct load new data into the fact table
2. DML into the store table
3. Commit
Also, try not to mix different types of conventional DML statements if possible. This
would again prevent using various optimizations during fast refresh. For example,
try to avoid the following:
1. Insert into the fact table
2. Delete from the fact table
3. Commit
If many updates are needed, try to group them all into one transaction because
refresh will be performed just once at commit time, rather than after each update.
When you use the DBMS_MVIEW package to refresh a number of materialized views
containing only joins with the ATOMIC parameter set to true, if you disable parallel
DML, refresh performance may degrade.
In a data warehousing environment, assuming that the materialized view has a
parallel clause, the following sequence of steps is recommended:
1. Bulk load into the fact table
2. Enable parallel DML
3. An ALTER SESSION ENABLE PARALLEL DML statement
4. Refresh the materialized view
Consider the schema in Figure 8–3. Assume all the materialized views are defined
for ON COMMIT refresh. If table sales changes, then, at commit time, join_
sales_cust_time would refresh first, and then sum_sales_cust_time and
join_sales_cust_time_prod. No specific order would apply for sum_sales_
cust_time and join_sales_cust_time_prod as they do not have any
dependencies between them.
In other words, Oracle builds a partially ordered set of materialized views and
refreshes them such that, after the successful completion of the refresh, all the
materialized views are fresh. The status of the materialized views can be checked by
querying the appropriate USER_, DBA_, or ALL_MVIEWS view.
If any of the materialized views are defined as ON DEMAND refresh (irrespective of
whether the refresh method is FAST, FORCE, or COMPLETE), you will need to refresh
them in the correct order (taking into account the dependencies between the
materialized views) because the nested materialized view will be refreshed with
respect to the current contents of the other materialized views (whether fresh or
not).
If a refresh fails during commit time, the list of materialized views that has not been
refreshed is written to the alert log, and you must manually refresh them along with
all their dependent materialized views.
Use the same DBMS_MVIEW procedures on nested materialized views that you use
on regular materialized views.
These procedures have the following behavior when used with nested materialized
views:
■ If REFRESH is applied to a materialized view my_mv that is built on other
materialized views, then my_mv will be refreshed with respect to the current
contents of the other materialized views (that is, they will not be made fresh
first).
■ If REFRESH_DEPENDENT is applied to materialized view my_mv, then only
materialized views that directly depend on my_mv will be refreshed (that is, a
materialized view that depends on a materialized view that depends on my_mv
will not be refreshed).
■ If REFRESH_ALL_MVIEWS is used, the order in which the materialized views
will be refreshed is not guaranteed.
■ GET_MV_DEPENDENCIES provides a list of the immediate (or direct)
materialized view dependencies for an object.
If the materialized view is being refreshed using the ON COMMIT method, then,
following refresh operations, consult the alert log alert_SID.log and the trace
file ora_SID_number.trc to check that no errors have occurred.
The following examples will illustrate the use of this feature. In "PCT Fast Refresh
Scenario 1", assume sales is a partitioned table using the time_id column and
products is partitioned by the prod_category column. The table times is not a
partitioned table.
As can be seen from the partial sample output from EXPLAIN_MVIEW, any
partition maintenance operation performed on the sales table will allow PCT
fast refresh. However, PCT is not possible after partition maintenance
operations or updates to the products table as there is insufficient information
contained in cust_mth_sales_mv for PCT refresh to be possible. Note that
the times table is not partitioned and hence can never allow for PCT refresh.
Oracle will apply PCT refresh if it can determine that the materialized view has
sufficient information to support PCT for all the updated tables.
4. Suppose at some later point, a SPLIT operation of one partition in the sales
table becomes necessary.
ALTER TABLE SALES
SPLIT PARTITION month3 AT (TO_DATE('05-02-1998', 'DD-MM-YYYY'))
INTO (
PARTITION month3_1
TABLESPACE summ,
PARTITION month3
TABLESPACE summ
);
Fast refresh will automatically do a PCT refresh as it is the only fast refresh
possible in this scenario. However, fast refresh will not occur if a partition
maintenance operation occurs when any update has taken place to a table on
which PCT is not enabled. This is shown in "PCT Fast Refresh Scenario 2".
"PCT Fast Refresh Scenario 1" would also be appropriate if the materialized view
was created using the PMARKER clause as illustrated in the following.
CREATE MATERIALIZED VIEW cust_sales_marker_mv
BUILD IMMEDIATE
REFRESH FAST ON DEMAND
ENABLE QUERY REWRITE
AS
SELECT DBMS_MVIEW.PMARKER(s.rowid) s_marker,
SUM(s.quantity_sold), SUM(s.amount_sold),
p.prod_name, t.calendar_month_name, COUNT(*),
COUNT(s.quantity_sold), COUNT(s.amount_sold)
FROM sales s, products p, times t
WHERE s.time_id = t.time_id AND
s.prod_id = p.prod_id
GROUP BY DBMS_MVIEW.PMARKER(s.rowid),
p.prod_name, t.calendar_month_name;
6. Refresh cust_mth_sales_mv.
EXECUTE DBMS_MVIEW.REFRESH('CUST_MTH_SALES_MV', 'F',
'',TRUE,FALSE,0,0,0,FALSE);
ORA-12052: cannot fast refresh materialized view SH.CUST_MTH_SALES_MV
The materialized view is not fast refreshable because DML has occurred to a table
on which PCT fast refresh is not possible. To avoid this occurring, Oracle
recommends performing a fast refresh immediately after any partition maintenance
operation on detail tables for which partition tracking fast refresh is available.
If the situation in "PCT Fast Refresh Scenario 2" occurs, there are two possibilities;
perform a complete refresh or switch to the CONSIDER FRESH option outlined in
the following, if suitable. However, it should be noted that CONSIDER FRESH and
partition change tracking fast refresh are not compatible. Once the ALTER
MATERIALIZED VIEW cust_mth_sales_mv CONSIDER FRESH statement has
been issued, PCT refresh will not longer be applied to this materialized view, until a
complete refresh is done.
A common situation in a warehouse is the use of rolling windows of data. In this
case, the detail table and the materialized view may contain say the last 12 months
of data. Every month, new data for a month is added to the table and the oldest
month is deleted (or maybe archived). PCT refresh provides a very efficient
mechanism to maintain the materialized view in this case.
3. Now, if the materialized view satisfies all conditions for PCT refresh.
EXECUTE DBMS_MVIEW.REFRESH('CUST_MTH_SALES_MV', 'F', '',
TRUE, FALSE,0,0,0,FALSE);
Fast refresh will automatically detect that PCT is available and perform a PCT
refresh.
■ Use CONSIDER FRESH to declare that the materialized view has been refreshed.
ALTER MATERIALIZED VIEW cust_mth_sales_mv CONSIDER FRESH;
this scenario in conjunction with query rewrite because you may see unexpected
results.
After using CONSIDER FRESH in an historical scenario, you will be able to apply
traditional fast refresh after DML and direct loads to the materialized view, but not
PCT fast refresh. This is because if the detail table partition at one time contained
data that is currently kept in aggregated form in the materialized view, PCT refresh
in attempting to resynchronize the materialized view with that partition could
delete historical data which cannot be recomputed.
Assume the sales table stores the prior year's data and the cust_mth_sales_mv
keeps the prior 10 years of data in aggregated form.
1. Remove old data from a partition in the sales table:
ALTER TABLE sales TRUNCATE PARTITION month1;
The materialized view is now considered stale and requires a refresh because
of the partition operation. However, as the detail table no longer contains all the
data associated with the partition fast refresh cannot be attempted.
2. Therefore, alter the materialized view to tell Oracle to consider it fresh.
ALTER MATERIALIZED VIEW cust_mth_sales_mv CONSIDER FRESH;
Because the fast refresh detects that only INSERT statements occurred against
the sales table it will update the materialized view with the new data.
However, the status of the materialized view will remain UNKNOWN. The only
way to return the materialized view to FRESH status is with a complete refresh
which, also will remove the historical data from the materialized view.
Change Data Capture efficiently identifies and captures data that has been added to,
updated, or removed from, Oracle relational tables, and makes the change data
available for use by applications. Change Data Capture is provided as an Oracle
database server component with Oracle9i.
This chapter introduces Change Data Capture in the following sections:
■ About Change Data Capture
■ Installation and Implementation
■ Security
■ Columns in a Change Table
■ Change Data Capture Views
■ Synchronous Mode of Data Capture
■ Publishing Change Data
■ Managing Change Tables and Subscriptions
■ Subscribing to Change Data
■ Export and Import Considerations
See Also: Oracle9i Supplied PL/SQL Packages and Types Reference for
more information about the Change Data Capture publish and
subscribe PL/SQL packages.
Table 15–1 Database Extraction With and Without Change Data Capture
Database
Extraction With Change Data Capture Without Change Data Capture
Extraction Database extraction from Database extraction is marginal at best for
INSERT, UPDATE, and DELETE INSERT operations, and problematic for
operations occurs immediately, UPDATE and DELETE operations, because
at the same time the changes the data is no longer in the table.
occur to the source tables.
Staging Stages data directly to relational The entire contents of tables are moved
tables; there is no need to use flat into flat files.
files.
Interface Provides an easy-to-use publish Error prone and manpower intensive to
and subscribe interface using administer.
DBMS_LOGMNR_CDC_PUBLISH
and DBMS_LOGMNR_CDC_
SUBSCRIBE packages.
Cost Supplied with the Oracle9i (and Expensive because you must write and
later) database server. Reduces maintain the capture software yourself, or
overhead cost by simplifying the purchase it from a third-party vendors.
extraction of change data.
Publisher
The publisher is usually a database administrator (DBA) who is in charge of
creating and maintaining schema objects that make up the Change Data Capture
system. The publisher performs these tasks:
■ Determines the relational tables (called source tables) from which the data
warehouse application is interested in capturing change data.
■ Uses the Oracle supplied package, DBMS_LOGMNR_CDC_PUBLISH, to set up the
system to capture data from one or more source tables.
■ Publishes the change data in the form of change tables.
■ Allows controlled access to subscribers by using the SQL GRANT and REVOKE
statements to grant and revoke the SELECT privilege on change tables for users
and roles.
Subscribers
The subscribers, usually applications, are consumers of the published change data.
Subscribers subscribe to one or more sets of columns in source tables. Subscribers
perform the following tasks:
■ Use the Oracle supplied package, DBMS_LOGMNR_CDC_SUBSCRIBE, to
subscribe to source tables for controlled access to the published change data for
analysis.
■ Extend the subscription window and create a new subscriber view when the
subscriber is ready to receive a set of change data.
■ Use SELECT statements to retrieve change data from the subscriber views.
■ Drop the subscriber view and purge the subscription window when finished
processing a block of changes.
■ Drop the subscription when the subscriber no longer needs its change data.
Figure 15–1 Publish and Subscribe Model in a Change Data Capture System
For example, assume that the change tables in Figure 15–1 contains all of the
changes that occurred between Monday and Friday, and also assume that:
■ Subscriber 1 is viewing and processing data from Tuesday.
■ Subscriber 2 is viewing and processing data from Wednesday to Thursday.
Subscribers 1 and 2 each have a unique subscription window that contains a block
of transactions. Change Data Capture manages the subscription window for each
subscriber by creating a subscriber view that returns a range of transactions of
interest to that subscriber. The subscriber accesses the change data by performing
SELECT statements on the subscriber view that was generated by Change Data
Capture.
When a subscriber needs to read additional change data, the subscriber makes
procedure calls to extend the window and to create a new subscriber view. Each
subscriber can walk through the data at its own pace, while Change Data Capture
manages the data storage. As each subscriber finishes processing the data in its
subscription window, it calls procedures to drop the subscriber view and purge the
contents of the subscription window. Extending and purging windows is necessary
to prevent the change table from growing indefinitely, and to prevent the subscriber
from seeing the same data again.
Thus, Change Data Capture provides the following benefits for subscribers:
■ Guarantees that each subscriber sees all of the changes, does not miss any
changes, and does not see the same change data more than once.
■ Keeps track of multiple subscribers and gives each subscriber shared access to
change data.
■ Handles all of the storage management, automatically removing data from
change tables when it is no longer required by any of the subscribers.
Operational
Databases
Change Data Capture System
Source SYNC_
Tables Change Data Capture SOURCE
.. . .
SYNC_SET
Subscriber Subscriber
View 1 View 2
C1 C2 C3 C4 C1 C2 C3 C4 C5 C6 C7 C8 C1 C4 C6 C8
Change Source Table 1 Change Source Table 2 Change Source Table 2 Change Source Table 3
Source System
A source system is a production database that contains source tables for which
Change Data Capture will capture changes.
Source Table
A source table is a database table that resides on the source system that contains the
data you want to capture. Changes made to the source table are immediately
reflected in the change table.
Change Source
A change source represents a source system. There is a system-generated change
source named SYNC_SOURCE.
Change Set
A change set represents the collection of change tables. There is a system-generated
change set named SYNC_SET.
Change Table
A change table contains the change data resulting from DML statements made to a
single source table. A change table consists of two things: the change data itself,
which is stored in a database table, and the system metadata necessary to maintain
the change table. A given change table can capture changes from only one source
table. In addition to published columns, the change table contains control columns
that are managed by Change Data Capture. See "Columns in a Change Table" on
page 15-9 for more information.
Publication
A publication provides a way for publishers to publish multiple change tables on
the same source table, and control subscriber access to the published change data.
For example, Publication A consists of a change table that contains all the columns
from the EMPLOYEE source table, while Publication B contains all the columns
except the salary column from the EMPLOYEE source table. Because each change
table is a separate publication, the publisher can implement security on the salary
column by allowing only selected subscribers to access Publication A.
Subscriber View
A subscriber view is a view created by Change Data Capture that returns all of the
rows in the subscription window. In Figure 15–2, the subscribers have created two
views: one on columns 7 and 8 of Source Table 3 and one on columns 4, 6, and 8 of
Source Table 4 The columns included in the view are based on the actual columns
that the subscribers subscribed to in the source table.
Subscription Window
A subscription window defines the time range of change rows that the subscriber
can currently see. The oldest row in the window is the low watermark; the newest
row in the window is the high watermark. Each subscriber has a subscription
window.
Security
You grant privileges for a change table separately from the privileges you grant for
a source table. For example, a subscriber that has privileges to perform a SELECT
operation on a source table might not have privileges to perform a SELECT
operation on a change table.
The publisher controls subscribers' access to change data by using the SQL GRANT
and REVOKE statements to grant and revoke the SELECT privilege on change tables
for users and roles. The publisher must grant the SELECT privilege before a user or
application can subscribe to the change table.
The publisher must not grant any DML access (using either the INSERT, UPDATE, or
DELETE statements) to the subscribers on the change tables because of the risk that
a subscriber might inadvertently change the data in the change table, making it
inconsistent with its source. Furthermore, the publisher should avoid creating
change tables in schemas to which users have DML access.
Step 2: Create the Change Tables that will Contain the Changes
You need to create the change tables that will contain the changes to individual
source tables. Use the DBMS_LOGMNR_CDC_PUBLISH.CREATE_CHANGE_TABLE
procedure to create change tables.
Create a change table for each source table to be published, and decide which
columns should be included. For update operations, decide whether to capture old
values, new values, or both.
This statement creates a change table named emp_ct within the change set SYNC_
SET. The column_type_list parameter identifies the columns captured by the
change table. The source_schema and source_table parameters identify the
schema and source table that reside on the production system.
The capture_values setting in the example indicates that for UPDATE operations,
the change data will contain two separate rows for each row that changed: one row
will contain the row values before the update occurred, and the other row will
contain the row values after the update occurred.
If the publisher really wants to drop the change table in spite of active
subscriptions, DROP_CHANGE_TABLE procedure must be called using the parameter
FORCE => ’Y’. This tells CDC to override its normal safeguards and allow the
change table to be dropped despite active subscriptions. The subscriptions will no
longer be valid, and subscribers will lose access to the change data.
Note: The DROP USER CASCADE statement will drop all of a users
change tables by using the FORCE => ’Y’ option. Therefore, if
any other users have active subscriptions to the (dropped) change
table, these will no longer be valid. In addition to dropping the
user’s change tables, DROP USER CASCADE also drops any
subscriptions that were held by that user.
Step 1: Find the Source Tables for which the Subscriber has Access Privileges
Query the ALL_SOURCE_TABLES view to see all of the published source tables for
which the subscriber has access privileges.
The following example shows how the subscriber first names the change set of
interest (SYNC_SET), and then returns a unique subscription handle that will be
used throughout the session.
EXECUTE SYS.DBMS_LOGMNR_CDC_SUBSCRIBE.GET_SUBSCRIPTION_HANDLE ( \
CHANGE_SET => 'SYNC_SET',\
DESCRIPTION => 'Change data for emp',\
SUBSCRIPTION_HANDLE => :subhandle);
At this point, the subscriber has created a new window that begins where the
previous window ends. The new window contains any data that was added to the
change table. If no new data has been added, the EXTEND_WINDOW procedure has
no effect. To access the new change data, the subscriber must call the CREATE_
SUBSCRIBER_VIEW procedure, and select from the new subscriber view that is
generated by Change Data Capture.
– Take out a dummy subscription to preserve the change table data until real
subscriptions appear. Then, you can drop the dummy subscription.
■ When importing data into a source table for which a change table already exists,
the imported data is also recorded in any associated change tables.
Assume that you have a source table Employees that has an associated change
table CT_Employees. When you import data into Employees, that data is also
recorded in CT_Employees.
■ When importing a source table and its change table to a database where the
tables did not previously exist, Change Data Capture for that source table will
not be established until the import process completes. This protects you from
duplicating activity in the change table.
■ When exporting a source table and its associated change table, and then
importing them into a new instance, the imported source table data is not
recorded in the change table because it is already in the change table.
■ When importing a change table having the optional control ROW_ID column,
the ROW_ID columns stored in the change table have meaning only if the
associated source table has not been imported. If a source table is re-created or
imported, each row will have a new ROW_ID that is unrelated to the ROW_ID
that was previously recorded in a change table.
■ Any time a table is exported from one database and imported to another, there
is a risk that the import target already has tables or objects with the same name.
Moving a change table to a different database where a table exists that has the
same name as the source table may result in import errors.
■ If you need to move a synchronous change table or its source table, then move
both tables together and check the import log for error messages.
This chapter illustrates how to use the Summary Advisor, a tool for choosing and
understanding materialized views. The chapter contains:
■ Overview of the Summary Advisor in the DBMS_OLAP Package
■ Using the Summary Advisor
■ Estimating Materialized View Size
■ Is a Materialized View Being Used?
■ Summary Advisor Wizard
Oracle9i
SQL
Cache
Warehouse
Trace
Materialized Log
View and
Dimensions
Oracle Trace
Workload Manager
Format
Workload Collection
(optional)
The Summary Advisor uses four types of schema objects, some of which are defined
in the user's schema and some are in the system schema:
■ User schema
For both V-table and workload tables, before the workload is available to the
recommendation process. It must be loaded into the advisor workload
repository.
■ V-tables
V-tables are generated by Oracle Trace for storing results of formatting
server-collected trace. Please note that these V-tables are different from the
V$ tables.
■ Workload tables
Workload tables are user tables that store workload information, and can
reside in any schema.
■ System schema
■ Result tables
Result tables are internal tables that store both intermediate and final
results from all Summary Advisor components.
■ Read-only views
Read-only views allow you to access recommendations, filters and
workloads.These views are MVIEW_RECOMMENDATIONS, MVIEW_
EVALUATIONS, MVIEW_FILTER, and MVIEW_WORKLOAD.
Whenever the Summary Advisor is run, the results, with the exception of
estimated size, are placed in internal tables, which can be accessed from
read-only views in the database. These results can be queried, so you do not
have to keep running the Advisor process.
If you want to view the results of the last materialized view recommendation, you
can issue the following statement:
SELECT MVIEW_OWNER, MVIEW_NAME, RECOMMENDED_ACTION, PCT_PERFORMANCE_GAIN,
BENEFIT_TO_COST_RATIO
FROM SYSTEM.MVIEW_RECOMMENDATIONS
WHERE RUNID= (SELECT MAX(RUNID) FROM SYSTEM.MVIEW_RECOMMENDATIONS)
ORDER BY RECOMMENDATION_NUMBER ASC
The advisory functions and procedures of the DBMS_OLAP package require you to
gather structural statistics about fact and dimension table cardinalities, and the
distinct cardinalities of every dimension level column, JOIN KEY column, and fact
table key column. You do this by loading your data warehouse, then gathering
either exact or estimated statistics with the DBMS_STATS package or the ANALYZE
TABLE statement. Because gathering statistics is time-consuming and extreme
statistical accuracy is not required, it is generally preferable to estimate statistics.
Using information from the system workload table, schema metadata and statistical
information generated by the DBMS_STATS package, the Advisor engine generates
summary recommendations and summary usage evaluations and stores the results
in result tables.
To use the Summary Advisor with a workload, some or all of the following steps
must be followed:
■ Optionally obtain an identifier number as a filter ID and define one or more
filter items.
■ Obtain an identifier number as a workload ID and load a workload. If a filter
was defined in step 1, then it can be used during the operation to refine the SQL
statements as they are collected from the workload source. Load the workload.
■ Call the procedure RECOMMEND_MVIEW_STRATEGY to generate the
recommendations.
These steps can be repeated several times with different workloads to see the effect
on the materialized views.
Identifier Numbers
Most of the DBMS_OLAP procedures require a unique identifier as one of their
parameters. You obtain this by calling the procedure CREATE_ID, which is
illustrated in the following section.
DBMS_OLAP.CREATE_ID Procedure
Workload Management
The Advisor performs best when a workload based on usage is available. The
Advisor Workload Repository is capable of storing multiple workloads, so that the
different uses of a real-world data warehousing environment can be viewed over a
long period of time and across the life cycle of database instance startup and
shutdown.
To facilitate wider use of the Summary Advisor, three types of workload are
supported:
■ Current contents of the SQL cache
■ Oracle Trace collection
■ User-specified workload
When the workload is loaded using the appropriate load_workload procedure, it
is stored in a new workload repository in the SYSTEM schema called MVIEW_
WORKLOAD whose format is shown in Table 16–2. A specific workload can be
removed by calling the PURGE_WORKLOAD routine and passing it a valid workload
ID. To remove all workloads for the current user, call PURGE_WORKLOAD and pass
the constant value DBMS_OLAP.WORKLOAD_ALL.
Once the workload has been collected using the appropriate LOAD_WORKLOAD
routine, there is also a filter mechanism that may be applied, this lets you specify
the portion of workload that is to be loaded into the repository. You can also use the
DBMS_OLAP.LOAD_WORKLOAD_USER Procedure
The actual workload is defined in a separate table and the two parameters owner_
name and table_name describe where it is stored. There is no restriction on which
schema the workload resides in, the name for the table, or how many of these
user-defined tables exist. The only restriction is that the format of the user table
must correspond to the USER_WORKLOAD table, as described in Table 16–4:
3. Insert into the MY_WORKLOAD tables the queries you want advice on.
INSERT INTO advisor_user_workload VALUES
(
'SELECT SUM(s.quantity_sold)
FROM sales s, products p
WHERE s.prod_id = p.prod_id AND p.prod_category = ''Boys''
GROUP BY p.prod_category', 'SH', 'app1', 10, NULL, 5, NULL, NULL)
DBMS_OLAP.LOAD_WORKLOAD_TRACE Procedure
Oracle Trace collects two types of data. One is a duration event which causes a data
item to be collected twice: once at the start of the operation and once at the end of
the operation. The duration of the data item is the difference between the start and
end of the operation. For example, execution time is collected as a duration event. It
first collects the clock time when the operation starts. Then it collects the clock time
when the operation ends. Execution time is calculated by subtracting the start time
from the end time.
A point event is a static data item that doesn't change over time. For example, an
owner name is a static data item that would be the same at the start and the end of
an operation.
To collect, analyze and load the summary event set, you must do the following:
1. Set six initialization parameters to collect data using Oracle Trace. Enabling
these parameters incurs some additional overhead at database connection, but
is otherwise transparent.
■ ORACLE_TRACE_COLLECTION_NAME = oraclesm or oraclee
ORACLEE is the Oracle Expert collection which contains Summary Advisor
data and additional data that is only used by Oracle Expert.
ORACLESM is the Summary Advisor collection that contains only Summary
Advisor data and is the preferred collection type.
■ ORACLE_TRACE_COLLECTION_PATH = location of collection
files
■ ORACLE_TRACE_COLLECTION_SIZE = 0
■ ORACLE_TRACE_ENABLE = TRUE
■ ORACLE_TRACE_FACILITY_NAME = oraclesm or oralcee
■ ORACLE_TRACE_FACILITY_PATH = location of trace facility
files
2. Run the Oracle Trace Manager, specify a collection name, and select the
SUMMARY_EVENT set. Oracle Trace Manager reads information from the
associated configuration file and registers events to be logged with Oracle.
While collection is enabled, the workload information defined in the event set
gets written to a flat log file.
3. When collection is complete, Oracle Trace automatically formats the Oracle
Trace log file into a set of relations, which have the predefined synonyms
beginning with V_192216243_. Alternatively, the collection file, which usually
has an extension of .CDF, can be formatted manually using the otrcfmt utility,
as shown in this example:
otrcfmt collection_name.cdf user/password@database
DBMS_OLAP.LOAD_WORKLOAD_CACHE Procedure
Validating a Workload
Prior to loading a workload, you can call one of the three VALIDATE_WORKLOAD
procedures to check that the workload exists:
■ VALIDATE_WORKLOAD_USER
■ VALIDATE_WORKLOAD_CACHE
■ VALIDATE_WORKLOAD_TRACE
These procedures do not check that the contents of the workload are valid, they
merely check that the workload exists.
DECLARE
isitgood NUMBER;
err_text VARCHAR2(200);
BEGIN
DBMS_OLAP.VALIDATE_WORKLOAD_TRACE ('SH', isitgood, err_text);
END;
DECLARE
isitgood NUMBER;
err_text VARCHAR2(200);
BEGIN
DBMS_OLAP.VALIDATE_WORKLOAD_USER ('SH', 'USER_WORKLOAD', isitgood, err_text);
END;
Removing a Workload
When workloads are no longer needed, they can be removed using the procedure
PURGE_WORKLOAD. You can delete all workloads or a specific collection.
DBMS_OLAP.PURGE_WORKLOAD Procedure
DBMS_OLAP.ADD_FILTER_ITEM Procedure
The Advisor supports ten different filter item types. For each filter item, Oracle
stores an attribute that tells Advisor how to apply the selection rule. For example,
an APPLICATION item requires a string attribute that can be either a single name as
in GREG, or it can be a comma-delimited list of names like GREG, ROSE, KALLIE,
HANNAH. For a single name, the Advisor takes the value and only accept the
workload query if the application name exactly matches the supplied name. For a
list of names, the queries application name must appear in the list. Referring to my
example, a query whose application name is GREG would match either a single
application filter item containing GREG or the list GREG, ROSE, KALLIE, HANNAH.
Conversely, a query whose application is KALLIE will only match the filter item list
GREG, ROSE, KALLIE, HANNAH.
For numeric filter items such as CARDINALITY, the attribute represents a possible
range of values. Advisor will determine if the filter item represents a bounded
range such as 500 to 1000000, or it could be an exact match like 1000 to 1000. When
the range value is specified as NULL, then the value is infinitely small or large,
depending upon which attribute is set.
Data filters, such as LASTUSE behave similar to numeric filter except Advisor treats
the range test as two dates. A value of NULL indicates infinity.
You can define a number of different types of filter as shown in Table 16–9.
When dealing with a workload, the client can optionally attach a filter to reduce or
refine the set of target SQL statements. If no filter is attached, then all target SQL
statements will be collected or used.
A new filter can be created with the CREATE_ID call. Filter items can be added to
the filter by using the ADD_FILTER_ITEM call. When a filter is created, an entry is
stored in the read-only view SYSTEM.MVIEW_FILTER.
The following is an example illustrating how to add three different types of filter
1. Declare an output variable to receive the new identifier.
VARIABLE MY_ID NUMBER:
This example defines a filter with three filter items. The first filter will only allow
queries that reference the table SCOTT.EMP. The second item will accept queries
that were executed by one of the users SCOTT, PAYROLL or PERSONNEL. Finally, the
third filter item accepts queries that execute at least 500 times.
Note, all filter items must match for a single query to be accepted. If any of the
items fail to match, then the query will not be accepted.
In the previous example, three filters will be applied against the data. However,
each filter item could have created with its only unique filter id, thus creating three
different filters as illustrated in the following:
VARIABLE MY_ID NUMBER:
EXECUTE DBMS_OLAP.CREATE_ID(:MY_ID);
EXECUTE DBMS_OLAP.ADD_FILTER_ITEM(:MY_ID,'BASETABLE',
'SCOTT.EMP', NULL, NULL, NULL, NULL);
EXECUTE DBMS_OLAP.CREATE_ID(:MY_ID);
EXECUTE DBMS_OLAP.ADD_FILTER_ITEM(:MY_ID, 'OWNER',
'SCOTT, PAYROLL,PERSONNEL', NULL, NULL, NULL, ULL);
EXECUTE DBMS_OLAP.CREATE_ID(:MY_ID);
EXECUTE DBMS_OLAP.ADD_FILTER_ITEM(:MY_ID, 'FREQUENCY', NULL, 500,NULL,
NULL,NULL);
Removing a Filter
A filter can be removed at anytime by calling the procedure PURGE_FILTER, which
is described in the following table. You can delete a specific filter or all filters. You
can remove all filters using the purge_filter call by specifying DBMS_
OLAP.FILTER_ALL as the filter ID.
DBMS_OLAP.PURGE_FILTER Procedure
DBMS_OLAP.PURGE_FILTER Example
VARIABLE MY_FILTER_ID NUMBER:
EXECUTE DBMS_OLAP.PURGE_FILTER(:MY_FILTER_ID);
EXECUTE DBMS_OLAP.PURGE_FILTER(DBMS_OLAP.FILTER_ALL);
See Also: Oracle9i Supplied PL/SQL Packages and Types Reference for
detailed information about the DBMS_OLAP package
The results from calling this package are put in the table SYSTEM.MVIEW_
RECOMMENDATIONS shown in Table 16–12. The output can be queried directly using
the MVIEW_RECOMMENDATION table or a structured report can be generated using
the DBMS_OLAP.GENERATE_MVIEW_REPORT procedure.
BEGIN
-- load the workload
DBMS_OLAP.CREATE_ID (workload_id);
DBMS_OLAP.LOAD_WORKLOAD_USER(workload_id, DBMS_OLAP.WORKLOAD_NEW,
DBMS_OLAP.FILTER_NONE,'SH','USER_WORKLOAD' );
-- run recommend_mv
DBMS_OLAP.CREATE_ID (run_id);
DBMS_OLAP.RECOMMEND_MVIEW_STRATEGY(run_id, workload_id, NULL, 1000000, 100,
NULL, 'sales');
END;
-- run recommend_mv
DBMS_OLAP.CREATE_ID(run_id);
DBMS_OLAP.RECOMMEND_MVIEW_STRATEGY(run_id, workload_id, NULL,10000000, 100,
NULL, 'sales');
END;
The resulting script is a executable SQL file that can contain DROP and CREATE
statements for materialized views. For new materialized views, the name of the
materialized views is auto-generated by combining the user-specified ID and the
Rank value of the materialized views. It is recommended that the user review the
generated SQL script before attempting to execute it.
The filename specification requires the same security model as described in the
GENERATE_MVIEW_REPORT routine.
*****************************************************************************/
/*****************************************************************************
** Rank 1
** Storage 0 bytes
** Gain 0.00%
** Benefit Ratio 0.00
** SELECT COUNT(*), AVG(dollar_cost)
** FROM sales
** GROUP BY store_key
*****************************************************************************/
/*****************************************************************************
** Rank 2
** Storage 6,000 bytes
** Gain 13.00%
** Benefit Ratio 874.00
*****************************************************************************/
/*****************************************************************************
** Rank 3
** Storage 6,000 bytes
** Gain 76.00%
** Benefit Ratio 8,744.00
**
** SELECT COUNT(*), MAX(dollar_cost), MIN(dollar_cost)
** FROM sh.sales
** WHERE store_key IN (10, 23)
** AND unit_sales > 5000
** GROUP BY store_key, promotion_key
*****************************************************************************/
REFRESH COMPLETE
ENABLE QUERY REWRITE AS
SELECT COUNT(*), MAX(dollar_cost), MIN(dollar_cost) FROM sh.sales
WHERE store_key IN (10,23) AND unit_sales > 5000 GROUP BY
store_key, promotion_key;
Because of the Oracle security model, report output file directories must be granted
read and write permission prior to executing this call. The call is described in
Oracle9i Java Developer’s Guide and is as follows:
EXECUTE DBMS_JAVA.GRANT_PERMISSION('Oracle-user-goes-here',
'java.io.FilePermission', 'directory-spec-goes-here/*', 'read, write');
In this table, xxxx is the filename portion of the user-supplied file specification.
All files appear in the same directory, which is the one you specify.
DBMS_OLAP.PURGE_RESULTS Procedure
DBMS_OLAP.SET_CANCELLED Procedure
Sample Session 1
REM***************************************************************
REM * Demo 1: Materialized View Recommendation With User Workload*
REM***************************************************************
REM===============================================================
REM Step 1. Define user workload table and add artificial workload queries.
REM===============================================================
CONNECT sh/sh
CREATE TABLE user_workload(
query VARCHAR2(40),
owner VARCHAR2(40),
application VARCHAR2(30),
frequency NUMBER,
lastuse DATE,
priority NUMBER,
responsetime NUMBER,
resultsize NUMBER
)
/
INSERT INTO user_workload values
(
'SELECT SUM(s.quantity_sold)
FROM sales s, products p
WHERE s.prod_id = p.prod_id and p.prod_category = ''Boys''
GROUP BY p.prod_category', 'SH', 'app1', 10, NULL, 5, NULL, NULL
)
/
INSERT INTO user_workload values
(
'SELECT SUM(s.amount)
FROM sales s, products p
WHERE s.prod_id = p.prod_id AND
p.prod_category = ''Girls''
GROUP BY p.prod_category',
'SH', 'app1', 10, NULL, 6, NULL, NULL
)
/
INSERT INTO user_workload values
(
'SELECT SUM(quantity_sold)
FROM sales s, products p
WHERE s.prod_id = p.prod_id and
p.prod_category = ''Men''
GROUP BY p.prod_category
',
'SH', 'app1', 11, NULL, 3, NULL, NULL
)
/
INSERT INTO user_workload VALUES
(
'SELECT SUM(quantity_sold)
FROM sales s, products p
WHERE s.prod_id = p.prod_id and
p.prod_category in (''Women'', ''Men'')
GROUP BY p.prod_category ', 'SH', 'app1', 1, NULL, 8, NULL, NULL
)
/
REM===================================================================
REM Step 2. Create a new identifier to identify a new collection in the
REM internal repository and load the user-defined workload into the
REM workload collection without filtering the workload.
REM
=======================================================================
VARIABLE WORKLOAD_ID NUMBER;
EXECUTE DBMS_OLAP.CREATE_ID(:workload_id);
EXECUTE DBMS_OLAP.LOAD_WORKLOAD_USER(:workload_id,\
DBMS_OLAP.WORKLOAD_NEW,\
DBMS_OLAP.FILTER_NONE, 'SH', 'USER_WORKLOAD');
SELECT COUNT(*) FROM SYSTEM.MVIEW_WORKLOAD
WHERE workloadid = :workload_id;
REM====================================================================
REM Step 3. Create a new identifier to identify a new filter object. Add
REM two filter items such that the filter can filter out workload
REM queries with priority >= 5 and frequency <= 10.
REM=====================================================================
VARIABLE filter_id NUMBER;
EXECUTE DBMS_OLAP.CREATE_ID(:filter_id);
EXECUTE DBMS_OLAP.ADD_FILTER_ITEM(:filter_id, 'PRIORITY',
NULL, 5, NULL, NULL, NULL);
EXECUTE DBMS_OLAP.ADD_FILTER_ITEM(:filter_id, 'FREQUENCY', NULL,
NULL, 10, NULL, NULL);
SELECT COUNT(*) FROM SYSTEM.MVIEW_FILTER
WHERE filterid = :filter_id;
REM=====================================================================
REM Step 4. Recommend materialized views with part of the previous workload
REM collection that satisfy the filter conditions. Create a new
REM===================================================================
REM Step 5. Generate HTML reports on the output.
REM===================================================================
EXECUTE DBMS_OLAP.GENERATE_MVIEW_REPORT('/tmp/output1.html', :run_id, DBMS_
OLAP.RPT_RECOMMENDATION);
REM====================================================================
REM Step 6. Cleanup current output, filter and workload collection
REM FROM the internal repository, truncate the user workload table
REM for new user workloads.
REM====================================================================
EXECUTE DBMS_OLAP.PURGE_RESULTS(:run_id);
EXECUTE DBMS_OLAP.PURGE_FILTER(:filter_id);
EXECUTE DBMS_OLAP.PURGE_WORKLOAD(:workload_id);
SELECT COUNT(*) FROM SYSTEM.MVIEW_WORKLOAD
WHERE workloadid = :WORKLOAD_ID;
TRUNCATE TABLE user_workload;
Sample Session 2
REM*******************************************************************
REM * Demo 2: Materialized View Recommendation With SQL Cache. *
REM*******************************************************************
CONNECT sh/sh
REM===================================================================
REM Step 1. Run some applications or some SQL queries, so that the
REM Oracle SQL Cache is populated with target queries.
REM===================================================================
REM Clear Pool of SQL queries
SELECT SUM(s.quantity_sold)
SELECT SUM(s.amount_sold)
FROM sales s, products p
WHERE s.prod_id = p.prod_id
GROUP BY p.prod_category;
REM====================================================================
REM Step 2. Create a new identifier to identify a new collection in the
REM internal repository and grab a snapshot of the Oracle SQL cache
REM into the new collection.
REM====================================================================
EXECUTE DBMS_OLAP.CREATE_ID(:WORKLOAD_ID);
EXECUTE DBMS_OLAP.LOAD_WORKLOAD_CACHE(:WORKLOAD_ID,
DBMS_OLAP.WORKLOAD_NEW, DBMS_OLAP.FILTER_NONE, NULL, 1);
SELECT COUNT(*) FROM SYSTEM.MVIEW_WORKLOAD
WHERE workloadid = :WORKLOAD_ID;
REM====================================================================
REM Step 3. Recommend materialized views with all of the workload workload
REM and no filtering.
REM=====================================================================
EXECUTE DBMS_OLAP.RECOMMEND_MVIEW_STRATEGY(:run_id, :workload_id, DBMS_
OLAP.FILTER_NONE, 10000000, 100, NULL, NULL);
SELECT COUNT(*) FROM SYSTEM.MVIEW_RECOMMENDATIONS;
REM===================================================================
REM Step 4. Generate HTML reports on the output.
REM====================================================================
EXECUTE DBMS_OLAP.GENERATE_MVIEW_REPORT('/tmp/output2.html', :run_id,
DBMS_OLAP.RPT_RECOMMENDATION);
REM====================================================================
If this error occurs, then at least one table or column is missing the required
statistics. To determine which object has missing statistics, issue the following
statement:
SELECT runid#, text FROM system.mview$_adv_journal
To avoid missing critical workload queries, the current database user must have
select privileges on the tables that are targeted for materialized view analysis.
Moreover, these select privileges cannot be obtained through a role.
ESTIMATE_MVIEW_SIZE Parameters
Table 16–17 ESTIMATE_MVIEW_SIZE Procedure Parameters
Parameter Description
stmt_id Arbitrary string used to identify the statement in an EXPLAIN
PLAN
select_clause The SELECT statement to be analyzed
num_rows Estimated cardinality
num_bytes Estimated number of bytes
In the following example, the query specified in the materialized view is passed
into the ESTIMATE_SUMMARY_SIZE procedure. Note that the SQL statement is
passed in without a semicolon at the end.
DBMS_OLAP.ESTIMATE_SUMMARY_SIZE ('simple_store',
'SELECT product_key1, product_key2,
SUM(dollar_sales) AS sum_dollar_sales,
SUM(unit_sales) AS sum_unit_sales,
SUM(dollar_cost) AS sum_dollar_cost,
SUM(customer_count) AS no_of_customers
FROM fact GROUP BY product_key1, product_key2', no_of_rows, mv_size );
The procedure returns two values: an estimate for the number of rows, and the size
of the materialized view in bytes, as illustrated in the following.
No of Rows: 17284
Size of Materialized view (bytes): 2281488
DBMS_OLAP.EVALUATE_MVIEW_STRATEGY Procedure
Table 16–18 EVALUATE_MVIEW_STRATEGY Procedure Parameters
Parameter Datatype Description
run_id NUMBER The Advisor-assigned ID for the current session
workload_id NUMBER An optional workload ID that maps to a user-supplied
workload
In the following example, the utilization of materialized views is analyzed and the
results are displayed:
DBMS_OLAP.EVALUATE_MVIEW_STRATEGY(:run_id, NULL, DBMS_OLAP.FILTER_NONE);
All of the steps required to maintain your materialized views can be completed by
answering the Wizard's questions. No subsequent DML operations are required.
You cannot use it to review or delete the recommendations, display the reports, or
purge the workloads or filters.
If there are any materialized views that already exist, the Summary Advisor wizard
shows how much space they are using and asks if they should be retained. Then, it
actually generates its recommendations and the screen shown in Figure 16–4 is
displayed.
The graph shown on the left of the screen shows the calculated gains for these
recommendations. By sliding the marker along the line of the graph, depending on
whether more performance is required or less storage space is used.
A set of materialized views will be recommended for that point on the graph. The
actual recommendations are viewed by clicking on the View/Modify
Recommendations button.
Default schema, tablespace and refresh method can be supplied for all
recommendations. Then by pressing the View/Modify Recommendations button,
each recommendation can be accepted or rejected and customized to your own
requirements as to its name and other characteristics as shown in Figure 16–5.
Finally, once you are satisfied with the recommendations, Figure 16–6 is displayed
where you can see the actual script which will be used to implement the
recommendations. At this time, this script can be saved to a file and run later, or, if
the Finish button is clicked, the recommendations are implemented.
Figure 16–7 shows the progress of the process implementing the recommendations.
When finished, the materialized views can now be displayed in Oracle Enterprise
Manager as illustrated in Figure 16–8.
This section deals with ways to improve your data warehouse’s performance, and
contains the following chapters:
■ Schema Modeling Techniques
■ SQL for Aggregation in Data Warehouses
■ SQL for Analysis in Data Warehouses
■ OLAP and Data Mining
■ Using Parallel Execution
■ Query Rewrite
17
Schema Modeling Techniques
customers orders
order products
items
Star Schemas
The star schema is perhaps the simplest data warehouse schema. It is called a star
schema because the entity-relationship diagram of this schema resembles a star,
with points radiating from a central table. The center of the star consists of a large
fact table and the points of the star are the dimension tables.
A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller
dimension tables (or lookup tables), each of which contains information about the
entries for a particular attribute in the fact table.
A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join,
but the dimension tables are not joined to each other. The cost-based optimizer
recognizes star queries and generates efficient execution plans for them.
A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount,
and cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_
id. The dimension tables are customers, times, products, channels, and
promotions. The product dimension table, for example, contains information
about each product number that appears in the fact table.
A star join is a primary key to foreign key join of the dimension tables to a fact table.
The main advantages of star schemas are that they:
■ Provide a direct and intuitive mapping between the business entities being
analyzed by end users and the schema design.
■ Provide highly optimized performance for typical star queries.
■ Are widely supported by a large number of business intelligence tools, which
may anticipate or even require that the data-warehouse schema contain
dimension tables
Star schemas are used for both simple data marts and very large data warehouses.
products times
sales
(amount_sold,
quantity_sold)
Fact Table
customers channels
Snowflake Schemas
The snowflake schema is a more complex data warehouse model than a star
schema, and is a type of star schema. It is called a snowflake schema because the
diagram of the schema resembles a snowflake.
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table. For
example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer
table in a snowflake schema. While this saves space, it increases the number of
dimension tables and requires more foreign key joins. The result is more complex
queries and reduced query performance. Figure 17–3 presents a graphical
representation of a snowflake schema.
suppliers
products times
sales
(amount_sold,
quantity_sold)
customers channels
countries
When a data warehouse satisfies these conditions, the majority of the star queries
running in the data warehouse will use a query execution strategy known as the
star transformation. The star transformation provides very efficient query
performance for star queries.
Note: Bitmap indexes are available only if you have purchased the
Oracle9i Enterprise Edition. In Oracle9i Standard Edition, bitmap
indexes and star transformation are not available.
Oracle processes this query in two phases. In the first phase, Oracle uses the bitmap
indexes on the foreign key columns of the fact table to identify and retrieve only the
necessary rows from the fact table. That is, Oracle will retrieve the result set from
the fact table using essentially the following query:
SELECT ... FROM sales
WHERE time_id IN
(SELECT time_id FROM times
WHERE calendar_quarter_desc IN('1999-Q1','1999-Q2'))
AND cust_id IN
(SELECT cust_id FROM customers WHERE cust_state_province='CA')
AND channel_id IN
(SELECT channel_id FROM channels WHERE channel_desc IN('Internet','Catalog'));
This is the transformation step of the algorithm, because the original star query has
been transformed into this subquery representation. This method of accessing the
fact table leverages the strengths of Oracle's bitmap indexes. Intuitively, bitmap
indexes provide a set-based processing scheme within a relational database. Oracle
has implemented very fast methods for doing set operations such as AND (an
intersection in standard set-based terminology), OR (a set-based union), MINUS, and
COUNT.
In this star query, a bitmap index on time_id is used to identify the set of all rows
in the fact table corresponding to sales in 1999-Q1. This set is represented as a
bitmap (a string of 1's and 0's that indicates which rows of the fact table are
members of the set).
A similar bitmap is retrieved for the fact table rows corresponding to the sale from
1999-Q2. The bitmap OR operation is used to combine this set of Q1 sales with the
set of Q2 sales.
Additional set operations will be done for the customer dimension and the
product dimension. At this point in the star query processing, there are three
bitmaps. Each bitmap corresponds to a separate dimension table, and each bitmap
represents the set of rows of the fact table that satisfy that individual dimension's
constraints.
These three bitmaps are combined into a single bitmap using the bitmap AND
operation. This final bitmap represents the set of rows in the fact table that satisfy
all of the constraints on the dimension table. This is the result set, the exact set of
rows from the fact table needed to evaluate the query. Note that none of the actual
data in the fact table has been accessed. All of these operations rely solely on the
bitmap indexes and the dimension tables. Because of the bitmap indexes'
compressed data representations, the bitmap set-based operations are extremely
efficient.
Once the result set is identified, the bitmap is used to access the actual data from the
sales table. Only those rows that are required for the end user's query are retrieved
from the fact table. At this point, Oracle has effectively joined all of the dimension
tables to the fact table using bitmap indexes. This technique provides excellent
performance because Oracle is joining all of the dimension tables to the fact table
with one logical join operation, rather than joining each dimension table to the fact
table independently.
The second phase of this query is to join these rows from the fact table (the result
set) to the dimension tables. Oracle will use the most efficient method for accessing
and joining the dimension tables. Many dimension are very small, and table scans
are typically the most efficient access method for these dimension tables. For large
dimension tables, table scans may not be the most efficient access method. In the
previous example, a bitmap index on product.department can be used to
quickly identify all of those products in the grocery department. Oracle's cost-based
optimizer automatically determines which access method is most appropriate for a
given dimension table, based upon the cost-based optimizer's knowledge about the
sizes and data distributions of each dimension table.
The specific join method (as well as indexing method) for each dimension table will
likewise be intelligently determined by the cost-based optimizer. A hash join is
often the most efficient algorithm for joining the dimension tables. The final answer
is returned to the user once all of the dimension tables have been joined. The query
technique of retrieving only the matching rows from one table and then joining to
another table is commonly known as a semi-join.
In this plan, the fact table is accessed through a bitmap access path based on a
bitmap AND, of three merged bitmaps. The three bitmaps are generated by the
BITMAP MERGE row source being fed bitmaps from row source trees underneath it.
Each such row source tree consists of a BITMAP KEY ITERATION row source which
fetches values from the subquery row source tree, which in this example is a full
table access. For each such value, the BITMAP KEY ITERATION row source retrieves
the bitmap from the bitmap index. After the relevant fact table rows have been
retrieved using this access path, they are joined with the dimension tables and
temporary tables to produce the answer to the query.
The processing of the same star query using the bitmap join index is similar to the
previous example. The only difference is that Oracle will utilize the join index,
instead of a single-table bitmap index, to access the customer data in the first phase
of the star query.
The difference between this plan as compared to the previous one is that the inner
part of the bitmap index scan for the customer dimension has no subselect. This is
because the join predicate information on customer.cust_state_province
can be satisfied with the bitmap join index sales_c_state_bjix.
versions of the query, the optimizer will then decide whether to use the best plan
for the transformed or untransformed version.
If the query requires accessing a large percentage of the rows in the fact table, it
might be better to use a full table scan and not use the transformations. However, if
the constraining predicates on the dimension tables are sufficiently selective that
only a small portion of the fact table needs to be retrieved, the plan based on the
transformation will probably be superior.
Note that the optimizer generates a subquery for a dimension table only if it decides
that it is reasonable to do so based on a number of criteria. There is no guarantee
that subqueries will be generated for all dimension tables. The optimizer may also
decide, based on the properties of the tables and the query, that the transformation
does not merit being applied to a particular query. In this case the best regular plan
will be used.
D
O
PR
Market
Time
You can retrieve slices of data from the cube. These correspond to cross-tabular
reports such as the one shown in Table 18–1. Regional managers might study the
data by comparing slices of the cube applicable to different markets. In contrast,
product managers might compare slices that apply to different products. An ad hoc
user might work with a wide variety of constraints, working in a subset cube.
Answering multidimensional questions often involves accessing and querying huge
quantities of data, sometimes in millions of rows. Because the flood of detailed data
generated by large organizations cannot be interpreted at the lowest level,
aggregated views of the information are essential. Aggregations, such as sums and
counts, across many dimensions are vital to multidimensional analyses. Therefore,
analytical tasks require convenient and efficient data aggregation.
Optimized Performance
Not only multidimensional issues, but all types of processing can benefit from
enhanced aggregation facilities. Transaction processing, financial and
manufacturing systems—all of these generate large numbers of production reports
An Aggregate Scenario
To illustrate the use of the GROUP BY extension, this chapter uses the sh data of the
sample schema. All the examples refer to data from this scenario. The hypothetical
company has sales across the world and tracks sales by both dollars and quantities
information. Because there are many rows of data, the queries shown here typically
have tight constraints on their WHERE clauses to limit the results to a small number
of rows.
Consider that even a simple report such as this, with just nine values in its grid,
generates four subtotals and a grand total. The subtotals are the shaded numbers.
Half of the values needed for this report would not be calculated with a query that
CHANNEL_DESC CO SALES$
-------------------- -- --------------
Direct Sales UK 1,378,126
Direct Sales US 2,835,557
Direct Sales 4,213,683
Internet UK 911,739
Internet US 1,732,240
Internet 2,643,979
UK 2,289,865
US 4,567,797
6,857,662
The action of ROLLUP is straightforward: it creates subtotals that roll up from the
most detailed level to a grand total, following a grouping list specified in the
ROLLUP clause. ROLLUP takes as its argument an ordered list of grouping columns.
First, it calculates the standard aggregate values specified in the GROUP BY clause.
Then, it creates progressively higher-level subtotals, moving from right to left
through the list of grouping columns. Finally, it creates a grand total.
ROLLUP creates subtotals at n+1 levels, where n is the number of grouping columns.
For instance, if a query specifies ROLLUP on grouping columns of time, region,
and department (n=3), the result set will include rows at four aggregation levels.
You might want to compress your data when using ROLLUP. This is particularly
useful when there are few updates to older partitions.
See Also: Oracle9i SQL Reference for data compression syntax and
restrictions
ROLLUP Syntax
ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT … GROUP BY ROLLUP(grouping_column_reference_list)
Partial Rollup
You can also roll up so that only some of the sub-totals will be included. This partial
rollup uses the following syntax:
GROUP BY expr1, ROLLUP(expr2, expr3);
In this case, the GROUP BY clause creates subtotals at (2+1=3) aggregation levels.
That is, at level (expr1, expr2, expr3), (expr1, expr2), and (expr1).
CUBE is typically most suitable in queries that use columns from multiple
dimensions rather than columns representing different levels of a single dimension.
For instance, a commonly requested cross-tabulation might need subtotals for all
the combinations of month, state, and product. These are three independent
dimensions, and analysis of all possible subtotal combinations is commonplace. In
contrast, a cross-tabulation showing all possible combinations of year, month, and
day would have several values of limited interest, because there is a natural
hierarchy in the time dimension. Subtotals such as profit by day of month summed
across year would be unnecessary in most analyses. Relatively few users need to
ask "What were the total sales for the 16th of each month across the year?" See
"Hierarchy Handling in ROLLUP and CUBE" on page 18-28 for an example of
handling rollup calculations efficiently.
CUBE Syntax
CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT … GROUP BY CUBE (grouping_column_reference_list)
US 9,370,256
13,924,743
Partial CUBE
Partial CUBE resembles partial ROLLUP in that you can limit it to certain dimensions
and precede it with columns outside the CUBE operator. In this case, subtotals of all
possible combinations are limited to the dimensions within the cube list (in
parentheses), and they are combined with the preceding items in the GROUP BY list.
GROUPING Functions
Two challenges arise with the use of ROLLUP and CUBE. First, how can you
programmatically determine which result set rows are subtotals, and how do you
find the exact level of aggregation for a given subtotal? You often need to use
subtotals in calculations such as percent-of-totals, so you need an easy way to
determine which rows are the subtotals. Second, what happens if query results
contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE?
How can you differentiate between the two?
GROUPING Function
GROUPING handles these problems. Using a single column as its argument,
GROUPING returns 1 when it encounters a NULL value created by a ROLLUP or CUBE
operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1.
Any other type of value, including a stored NULL, returns a 0.
GROUPING Syntax
GROUPING appears in the selection list portion of a SELECT statement. Its form is:
SELECT … [GROUPING(dimension_column)…] …
GROUP BY … {CUBE | ROLLUP| GROUPING SETS} (dimension_column)
A program can easily identify the detail rows by a mask of "0 0 0" on the T, R, and D
columns. The first level subtotal rows have a mask of "0 0 1", the second level
subtotal rows have a mask of "0 1 1", and the overall total row has a mask of "1 1 1".
You can improve the readability of result sets by using the GROUPING and DECODE
functions as shown in Example 18–7.
To understand the previous statement, note its first column specification, which
handles the channel_desc column. Consider the first line of the previous
statement:
SELECT DECODE(GROUPING(channel_desc), 1, 'All Channels', channel_desc)AS Channel
CHANNEL_DESC C CO SALES$ CH MO CO
-------------------- - -- -------------- --------- --------- ---------
UK 4,554,487 1 1 0
US 9,370,256 1 1 0
Direct Sales 8,510,440 0 1 1
Internet 5,414,303 0 1 1
13,924,743 1 1 1
Compare the result set of Example 18–8 with that in Example 18–3 on page 18-9 to
see how Example 18–8 is a precisely specified group: it contains only the yearly
totals, regional totals aggregated over time and department, and the grand total.
GROUPING_ID Function
To find the GROUP BY level of a particular row, a query must return GROUPING
function information for each of the GROUP BY columns. If we do this using the
GROUPING function, every GROUP BY column requires another column using the
GROUPING function. For instance, a four-column GROUP BY clause needs to be
analyzed with four GROUPING functions. This is inconvenient to write in SQL and
increases the number of columns required in the query. When you want to store the
query result sets in tables, as with materialized views, the extra columns waste
storage space.
To address these problems, Oracle9i introduces the GROUPING_ID function.
GROUPING_ID returns a single number that enables you to determine the exact
GROUP BY level. For each row, GROUPING_ID takes the set of 1’s and 0’s that would
be generated if you used the appropriate GROUPING functions and concatenates
them, forming a bit vector. The bit vector is treated as a binary number, and the
number’s base-10 value is returned by the GROUPING_ID function. For instance, if
you group with the expression CUBE(a, b) the possible values are as shown in
Table 18–2.
GROUP_ID Function
While the extensions to GROUP BY offer power and flexibility, they also allow
complex result sets that can include duplicate groupings. The GROUP_ID function
lets you distinguish among duplicate groupings. If there are multiple sets of rows
calculated for a given level, GROUP_ID assigns the value of 0 to all the rows in the
first set. All other sets of duplicate rows for a particular grouping are assigned
higher values, starting with 1. For example, consider the following query, which
generates a duplicate grouping:
This statement computes all the 8 (2 *2 *2) groupings, though only the previous 3
groups are of interest to you.
Another alternative is the following statement, which is lengthy due to several
unions. This statement requires three scans of the base table, making it inefficient.
CUBE and ROLLUP can be thought of as grouping sets with very specific semantics.
For example, consider the following statement:
CUBE(a, b, c)
ROLLUP(a, b, c)
In the absence of an optimizer that looks across query blocks to generate the
execution plan, a query based on UNION would need multiple scans of the base
table, sales. This could be very inefficient as fact tables will normally be huge. Using
GROUPING SETS statements, all the groupings of interest are available in the same
query block.
Composite Columns
A composite column is a collection of columns that are treated as a unit during the
computation of groupings. You specify the columns in parentheses as in the
following statement:
ROLLUP (year, (quarter, month), day)
In this statement, the data is not rolled up across year and quarter, but is instead
equivalent to the following groupings of a UNION ALL:
■ (year, quarter, month, day),
■ (year, quarter, month),
■ (year)
■ ()
Here, (quarter, month) form a composite column and are treated as a unit. In
general, composite columns are useful in ROLLUP, CUBE, GROUPING SETS, and
concatenated groupings. For example, in CUBE or ROLLUP, composite columns
would mean skipping aggregation across certain levels. That is, the following
statement:
GROUP BY ROLLUP(a, (b, c))
Here, (b, c) are treated as a unit and rollup will not be applied across (b, c). It is
as if you have an alias, for example z, for (b, c) and the GROUP BY expression
reduces to GROUP BY ROLLUP(a, z). Compare this with the normal rollup as in the
following:
GROUP BY ROLLUP(a, b, c)
Concatenated Groupings
Concatenated groupings offer a concise way to generate useful combinations of
groupings. Groupings specified with concatenated groupings yield the
cross-product of groupings from each grouping set. The cross-product operation
enables even a small number of concatenated groupings to generate a large number
of final groups. The concatenated groupings are specified simply by listing multiple
grouping sets, cubes, and rollups, and separating them with commas. Here is an
example of concatenated grouping sets:
GROUP BY GROUPING SETS(a, b), GROUPING SETS(c, d)
The ROLLUPs in the GROUP BY specification generate the following groups, four for
each dimension.
The concatenated grouping sets specified in the previous SQL will take the ROLLUP
aggregations listed in the table and perform a cross-product on them. The
cross-product will create the 96 (4x4x6) aggregate groups needed for a hierarchical
cube of the data. There are major advantages in using three ROLLUP expressions to
replace what would otherwise require 96 grouping set expressions: the concise SQL
is far less error-prone to develop and far easier to maintain, and it enables much
better query optimization. You can picture how a cube with more dimensions and
more levels would make the use of concatenated groupings even more
advantageous.
CHANNEL_DESC CHANNEL_TOTAL
-------------------- -------------
Direct Sales 312829530
Note that this example could also be performed efficiently using the reporting
aggregate functions described in Chapter 19, "SQL for Analysis in Data
Warehouses".
The following topics provide information about how to improve analytical SQL
queries in a data warehouse:
■ Overview of SQL for Analysis in Data Warehouses
■ Ranking Functions
■ Windowing Aggregate Functions
■ Reporting Aggregate Functions
■ LAG/LEAD Functions
■ FIRST/LAST Functions
■ Linear Regression Functions
■ Inverse Percentile Functions
■ Hypothetical Rank and Distribution Functions
■ WIDTH_BUCKET Function
■ User-Defined Aggregate Functions
■ CASE Expressions
To perform these operations, the analytic functions add several new elements to
SQL processing. These elements build on existing SQL to allow flexible and
powerful calculation expressions. With just a few exceptions, the analytic functions
have these new elements. The processing flow is represented in Figure 19–1.
Partitions created;
Joins,
Analytic functions Final
WHERE, GROUP BY,
applied to each row in ORDER BY
and HAVING clauses
each partition
processed to allow for precise output ordering. The processing order is shown
in Figure 19–1.
■ Result set partitions
The analytic functions allow users to divide query result sets into groups of
rows called partitions. Note that the term partitions used with analytic
functions is unrelated to Oracle's table partitions feature. Throughout this
chapter, the term partitions refers to only the meaning related to analytic
functions. Partitions are created after the groups defined with GROUP BY
clauses, so they are available to any aggregate results such as sums and
averages. Partition divisions may be based upon any desired columns or
expressions. A query result set may be partitioned into just one partition
holding all the rows, a few large partitions, or many small partitions holding
just a few rows each.
■ Window
For each row in a partition, you can define a sliding window of data. This
window determines the range of rows used to perform the calculations for the
current row. Window sizes can be based on either a physical number of rows or
a logical interval such as time. The window has a starting row and an ending
row. Depending on its definition, the window may move at one or both ends.
For instance, a window defined for a cumulative sum function would have its
starting row fixed at the first row of its partition, and its ending row would
slide from the starting point all the way to the last row of the partition. In
contrast, a window defined for a moving average would have both its starting
and end points slide so that they maintain a constant physical or logical range.
A window can be set as large as all the rows in a partition or just a sliding
window of one row within a partition. When a window is near a border, the
function returns results for only the available rows, rather than warning you
that the results are not what you want.
When using window functions, the current row is included during calculations,
so you should only specify (n-1) when you are dealing with n items.
■ Current row
Each calculation performed with an analytic function is based on a current row
within a partition. The current row serves as the reference point determining
the start and end of the window. For instance, a centered moving average
calculation could be defined with a window that holds the current row, the six
preceding rows, and the following six rows. This would create a sliding
window of 13 rows, as shown in Figure 19–2.
Window Start
Window Finish
Ranking Functions
A ranking function computes the rank of a record compared to other records in the
dataset based on the values of a set of measures. The types of ranking function are:
■ RANK and DENSE_RANK
■ CUME_DIST and PERCENT_RANK
■ NTILE
■ ROW_NUMBER
The difference between RANK and DENSE_RANK is that DENSE_RANK leaves no gaps
in ranking sequence when there are ties. That is, if you were ranking a competition
using DENSE_RANK and had three people tie for second place, you would say that
all three were in second place and that the next person came in third. The RANK
function would also give three people in second place, but the next person would
be in fifth place.
The following are some relevant points about RANK:
■ Ascending is the default sort order, which you may want to change to
descending.
■ The expressions in the optional PARTITION BY clause divide the query result
set into groups within which the RANK function operates. That is, RANK gets
reset whenever the group changes. In effect, the value expressions of the
PARTITION BY clause define the reset boundaries.
■ If the PARTITION BY clause is missing, then ranks are computed over the entire
query result set.
■ The ORDER BY clause specifies the measures (<value expression>s) on which
ranking is done and defines the order in which rows are sorted in each group
(or partition). Once the data is sorted within each partition, ranks are given to
each row starting from 1.
■ The NULLS FIRST | NULLS LAST clause indicates the position of NULLs in the
ordered sequence, either first or last in the sequence. The order of the sequence
would make NULLs compare either high or low with respect to non-NULL
values. If the sequence were in ascending order, then NULLS FIRST implies that
NULLs are smaller than all other non-NULL values and NULLS LAST implies
they are larger than non-NULL values. It is the opposite for descending order.
See the example in "Treatment of NULLs" on page 19-11.
■ If the NULLS FIRST | NULLS LAST clause is omitted, then the ordering of the
null values depends on the ASC or DESC arguments. Null values are considered
larger than any other values. If the ordering sequence is ASC, then nulls will
appear last; nulls will appear first otherwise. Nulls are considered equal to
other nulls and, therefore, the order in which nulls are presented is
non-deterministic.
Ranking Order
The following example shows how the [ASC | DESC] option changes the ranking
order.
While the data in this result is ordered on the measure SALES$, in general, it is not
guaranteed by the RANK function that the data will be sorted on the measures. If
you want the data to be sorted on SALES$ in your result, you must specify it
explicitly with an ORDER BY clause, at the end of the SELECT statement.
The sales_count column breaks the ties for three pairs of values.
Note that, in the case of DENSE_RANK, the largest rank value gives the number of
distinct values in the dataset.
A single query block can contain more than one ranking function, each partitioning
the data into different groups (that is, reset on different boundaries). The groups can
be mutually exclusive. The following query ranks products based on their dollar
sales within each month (rank_of_product_per_region) and within each
channel (rank_of_product_total).
Treatment of NULLs
NULLs are treated like normal values. Also, for rank computation, a NULL value is
assumed to be equal to another NULL value. Depending on the ASC | DESC options
provided for measures and the NULLS FIRST | NULLS LAST clause, NULLs will
either sort low or high and hence, are given ranks appropriately. The following
example shows how NULLs are ranked in different cases:
SELECT calendar_year AS YEAR, calendar_quarter_number AS QTR,
calendar_month_number AS MO, SUM(amount_sold),
RANK() OVER (ORDER BY SUM(amount_sold) ASC NULLS FIRST) AS NFIRST,
RANK() OVER (ORDER BY SUM(amount_sold) ASC NULLS LAST) AS NLASST,
RANK() OVER (ORDER BY SUM(amount_sold) DESC NULLS FIRST) AS NFIRST_DESC,
RANK() OVER (ORDER BY SUM(amount_sold) DESC NULLS LAST) AS NLAST_DESC
FROM (
SELECT sales.time_id, sales.amount_sold, products.*, customers.*
FROM sales, products, customers
WHERE
sales.prod_id=products.prod_id AND
sales.cust_id=customers.cust_id AND
prod_name IN ('Ruckpart Eclipse', 'Ukko Plain Gortex Boot')
AND country_id ='UK') v, times
WHERE v.time_id (+) =times.time_id AND
calendar_year=1999
GROUP BY calendar_year, calendar_quarter_number, calendar_month_number;
1999 2 5 27431 7 3 10 6
1999 2 4 20602 6 2 11 7
1999 3 7 15296 5 1 12 8
1999 1 1 1 9 1 9
1999 4 10 1 9 1 9
1999 4 11 1 9 1 9
1999 4 12 1 9 1 9
If the value for two rows is NULL, the next group expression is used to resolve the
tie. If they cannot be resolved even then, the next expression is used and so on till
the tie is resolved or else the two rows are given the same rank. For example:
Top N Ranking
You can easily obtain top N ranks by enclosing the RANK function in a subquery and
then applying a filter condition outside the subquery. For example, to obtain the top
five countries in sales for a specific month, you can issue the following statement:
SELECT * FROM
(SELECT country_id,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$,
RANK() OVER (ORDER BY SUM(amount_sold) DESC ) AS COUNTRY_RANK
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND
sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND
sales.channel_id=channels.channel_id AND
times.calendar_month_desc='2000-09'
GROUP BY country_id)
WHERE COUNTRY_RANK <= 5;
CO SALES$ COUNTRY_RANK
-- -------------- ------------
US 6,517,786 1
NL 3,447,121 2
UK 3,207,243 3
DE 3,194,765 4
FR 2,125,572 5
Bottom N Ranking
Bottom N is similar to top N except for the ordering sequence within the rank
expression. Using the previous example, you can order SUM(s_amount) ascending
instead of descending.
CUME_DIST
The CUME_DIST function (defined as the inverse of percentile in some statistical
books) computes the position of a specified value relative to a set of values. The
order can be ascending or descending. Ascending is the default. The range of values
for CUME_DIST is from greater than 0 to 1. To compute the CUME_DIST of a value x
in a set S of size N, you use the formula:
CUME_DIST(x) = number of values in S coming before and including x
in the specified order/ N
The semantics of various options in the CUME_DIST function are similar to those in
the RANK function. The default order is ascending, implying that the lowest value
gets the lowest CUME_DIST (as all other values come later than this value in the
order). NULLs are treated the same as they are in the RANK function. They are
counted toward both the numerator and the denominator as they are treated like
non-NULL values. The following example finds cumulative distribution of sales by
channel within each month:
SELECT calendar_month_desc AS MONTH, channel_desc,
TO_CHAR(SUM(amount_sold) , '9,999,999,999') SALES$ ,
CUME_DIST() OVER ( PARTITION BY calendar_month_desc ORDER BY
SUM(amount_sold) ) AS
CUME_DIST_BY_CHANNEL
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND
sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND
sales.channel_id=channels.channel_id AND
times.calendar_month_desc IN ('2000-09', '2000-07','2000-08')
GROUP BY calendar_month_desc, channel_desc;
PERCENT_RANK
PERCENT_RANK is similar to CUME_DIST, but it uses rank values rather than row
counts in its numerator. Therefore, it returns the percent rank of a value relative to a
group of values. The function is available in many popular spreadsheets. PERCENT_
RANK of a row is calculated as:
(rank of row in its partition - 1) / (number of rows in the partition - 1)
PERCENT_RANK returns values in the range zero to one. The row(s) with a rank of 1
will have a PERCENT_RANK of zero.
Its syntax is:
PERCENT_RANK ( ) OVER ( [query_partition_clause] order_by_clause )
NTILE
NTILE allows easy calculation of tertiles, quartiles, deciles and other common
summary statistics. This function divides an ordered partition into a specified
number of groups called buckets and assigns a bucket number to each row in the
partition. NTILE is a very useful calculation because it lets users divide a data set
into fourths, thirds, and other groupings.
The buckets are calculated so that each bucket has exactly the same number of rows
assigned to it or at most 1 row more than the others. For instance, if you have 100
rows in a partition and ask for an NTILE function with four buckets, 25 rows will be
assigned a value of 1, 25 rows will have value 2, and so on. These buckets are
referred to as equiheight buckets.
If the number of rows in the partition does not divide evenly (without a remainder)
into the number of buckets, then the number of rows assigned for each bucket will
differ by one at most. The extra rows will be distributed one for each bucket starting
from the lowest bucket number. For instance, if there are 103 rows in a partition
which has an NTILE(5) function, the first 21 rows will be in the first bucket, the
next 21 in the second bucket, the next 21 in the third bucket, the next 20 in the
fourth bucket and the final 20 in the fifth bucket.
The NTILE function has the following syntax:
NTILE ( expr ) OVER ( [query_partition_clause] order_by_clause )
NTILE Example
The following is an example assigning each month's sales total into one of 4
buckets:
SELECT calendar_month_desc AS MONTH ,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$,
NTILE(4) OVER (ORDER BY SUM(amount_sold)) AS TILE4
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND
sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND
sales.channel_id=channels.channel_id AND
times.calendar_year=1999 AND
prod_category= 'Men'
GROUP BY calendar_month_desc;
ROW_NUMBER
The ROW_NUMBER function assigns a unique number (sequentially, starting from 1,
as defined by ORDER BY) to each row within the partition. It has the following
syntax:
ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause )
ROW_NUMBER Example
SELECT channel_desc, calendar_month_desc,
TO_CHAR(TRUNC(SUM(amount_sold), -6), '9,999,999,999') SALES$,
ROW_NUMBER() OVER (ORDER BY TRUNC(SUM(amount_sold), -6) DESC)
AS ROW_NUMBER
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND
sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND
sales.channel_id=channels.channel_id AND
times.calendar_month_desc IN ('2000-09', '2000-10')
GROUP BY channel_desc, calendar_month_desc;
Note that there are three pairs of tie values in these results. Like NTILE, ROW_
NUMBER is a non-deterministic function, so each tied value could have its row
number switched. To ensure deterministic results, you must order on a unique key.
Inmost cases, that will require adding a new tie breaker column to the query and
using it in the ORDER BY specification.
In this example, the analytic function SUM defines, for each row, a window that
starts at the beginning of the partition (UNBOUNDED PRECEDING) and ends, by
default, at the current row.
Nested SUMs are needed in this example since we are performing a SUM over a value
that is itself a SUM. Nested aggregations are used very often in analytic aggregate
functions.
Note that the first two rows for the three month moving average calculation in the
output data are based on a smaller interval size than specified because the window
calculation cannot reach past the data retrieved by the query. You need to consider
the different window sizes found at the borders of result sets. In other words, you
may need to modify the query to include exactly what you want.
The starting and ending rows for each product's centered moving average
calculation in the output data are based on just two days, since the window
calculation cannot reach past the data retrieved by the query. Users need to consider
the different window sizes found at the borders of result sets: the query may need
to be adjusted.
6 rows selected.
In the output, values within parentheses are from the rows with the tied ordering
key value, 04-NOV-98.
Consider the row with the output of "04-NOV-98, 3, 24". In this case, all the
other rows with TIME_ID of 04-NOV-98 (ties) are considered to belong to one
group. Therefore, the CURRENT_GROUP_SUM should include this row (that is, 3) and
its ties (that is, 2 and 2) in the window. It also includes any rows with dates up to 10
days earlier. In this data, that includes the row with date 27-OCT-98. Hence the
result is 17+(2+3+2) = 24. The calculation of CURRENT_GROUP_SUM is identical for
each of the tied rows, so the output shows three rows with the value 24.
Note that this example applies only when you use the RANGE keyword rather than
the ROWS keyword. It is also important to remember that with RANGE, you can only
use 1 ORDER BY expression in the analytic function’s ORDER BY clause. With the
ROWS keyword, you can use multiple order by expressions in the analytic function’s
order by clause.
■ 2 otherwise
■ If any of the previous days are holidays, it adjusts the count appropriately.
Note that, when window is specified using a number in a window function with
ORDER BY on a date column, then it is converted to mean the number of days. You
could have also used the interval literal conversion function, as
NUMTODSINTERVAL(fn(t_timekey), 'DAY') instead of just fn(t_timekey)
to mean the same thing. You can also write a PL/SQL function that returns an
INTERVAL datatype value.
One way to handle this problem would be to add the prod_id column to the result
set and order on both time_id and prod_id.
RATIO_TO_REPORT
The RATIO_TO_REPORT function computes the ratio of a value to the sum of a set
of values. If the expression value expression evaluates to NULL, RATIO_TO_
REPORT also evaluates to NULL, but it is treated as zero for computing the sum of
values for the denominator. Its syntax is:
RATIO_TO_REPORT ( expr ) OVER ( [query_partition_clause] )
LAG/LEAD Functions
The LAG and LEAD functions are useful for comparing values when the relative
positions of rows can be known reliably. They work by specifying the count of rows
which separate the target row from the current row. Since the functions provide
access to more than one row of a table at the same time without a self-join, they can
enhance processing speed. The LAG function provides access to a row at a given
offset prior to the current position, and the LEAD function provides access to a row
at a given offset after the current position.
LAG/LEAD Syntax
These functions have the following syntax:
{LAG | LEAD} ( value_expr [, offset] [, default] )
OVER ( [query_partition_clause] order_by_clause )
FIRST/LAST Functions
The FIRST/LAST aggregate functions allow you to return the result of an aggregate
applied over a set of rows that rank as the first or last with respect to a given order
specification. FIRST/LAST lets you order on column A but return an result of an
aggregate applied on column B. This is valuable because it avoids the need for a
self-join or subquery, thus improving performance. These functions begin with a
tiebreaker function, which is a regular aggregate function (MIN, MAX, SUM, AVG,
COUNT, VARIANCE, STDDEV) that produces the return value. The tiebreaker function
is performed on the set rows (1 or more rows) that rank as first or last respect to the
order specification to return a single value.
To specify the ordering used within each group, the FIRST/LAST functions add a
new clause starting with the word KEEP.
FIRST/LAST Syntax
These functions have the following syntax:
aggregate_function KEEP
( DENSE_RANK LAST ORDER BY
expr [ DESC | ASC ] [NULLS { FIRST | LAST }]
[, expr [ DESC | ASC ] [NULLS { FIRST | LAST }]]...
)
[OVER query_partitioning_clause]
FROM products
WHERE prod_category='Men'
GROUP BY prod_subcategory;
A query like this can be useful for understanding the sales patterns of your different
channels. For instance, the result set here highlights that Telesales sell relatively
small volumes.
Using the FIRST and LAST functions as reporting aggregates makes it easy to
include the results in calculations such "Salary as a percent of the highest salary."
■ REGR_SXY
Oracle applies the function to the set of (e1, e2) pairs after eliminating all pairs for
which either of e1 or e2 is null. e1 is interpreted as a value of the dependent
variable (a "y value"), and e2 is interpreted as a value of the independent variable
(an "x value"). Both expressions must be numbers.
The regression functions are all computed simultaneously during a single pass
through the data. They are frequently combined with the COVAR_POP, COVAR_
SAMP, and CORR functions.
REGR_COUNT
REGR_COUNT returns the number of non-null number pairs used to fit the
regression line. If applied to an empty set (or if there are no (e1, e2) pairs where
neither of e1 or e2 is null), the function returns 0.
REGR_R2
The REGR_R2 function computes the coefficient of determination (usually called
"R-squared" or "goodness of fit") for the regression line.
REGR_R2 returns values between 0 and 1 when the regression line is defined (slope
of the line is not null), and it returns NULL otherwise. The closer the value is to 1,
the better the regression line fits the data.
PERC_DISC PERC_CONT
--------- ---------
5000 5000
PERC_DISC PERC_CONT
--------- ---------
7000 7000
Inverse percentile aggregate functions can appear in the HAVING clause of a query
like other existing aggregate functions.
As Reporting Aggregates
You can also use the aggregate functions PERCENTILE_CONT, PERCENTILE_DISC
as reporting aggregate functions. When used as reporting aggregate functions, the
syntax is similar to those of other reporting aggregates.
[PERCENTILE_CONT | PERCENTILE_DISC](constant expression)
WITHIN GROUP ( ORDER BY single order by expression
[ASC|DESC] [NULLS FIRST| NULLS LAST])
OVER ( [PARTITION BY value expression [,...]] )
This query computes the same thing (median credit limit for customers in this result
set, but reports the result for every row in the result set, as shown in the following
output:
SELECT cust_id, cust_credit_limit,
PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDER BY cust_credit_limit) OVER () AS perc_disc,
PERCENTILE_CONT(0.5) WITHIN GROUP
(ORDER BY cust_credit_limit) OVER () AS perc_cont
FROM customers WHERE cust_city='Marshal';
Unlike the inverse percentile aggregates, the ORDER BY clause in the sort
specification for hypothetical rank and distribution functions may take multiple
expressions. The number of arguments and the expressions in the ORDER BY clause
should be the same and the arguments must be constant expressions of the same or
compatible type to the corresponding ORDER BY expression. The following is an
example using two arguments in several hypothetical ranking functions.
These functions can appear in the HAVING clause of a query just like other
aggregate functions. They cannot be used as either reporting aggregate functions or
windowing aggregate functions.
WIDTH_BUCKET Function
For a given expression, the WIDTH_BUCKET function returns the bucket number
that the result of this expression will be assigned after it is evaluated. You can
generate equiwidth histograms with this function. Equiwidth histograms divide
data sets into buckets whose interval size (highest value to lowest value) is equal.
The number of rows held by each bucket will vary. A related function, NTILE,
creates equiheight buckets.
Equiwidth histograms can be generated only for numeric, date or datetime types.
So the first three parameters should be all numeric expressions or all date
expressions. Other types of expressions are not allowed. If the first parameter is
NULL, the result is NULL. If the second or the third parameter is NULL, an error
message is returned, as a NULL value cannot denote any end point (or any point) for
a range in a date or numeric value dimension. The last parameter (number of
buckets) should be a numeric expression that evaluates to a positive integer value;
0, NULL, or a negative value will result in an error.
Buckets are numbered from 0 to (n+1). Bucket 0 holds the count of values less than
the minimum. Bucket(n+1) holds the count of values greater than or equal to the
maximum specified value.
WIDTH_BUCKET Syntax
The WIDTH_BUCKET takes four expressions as parameters. The first parameter is the
expression that the equiwidth histogram is for. The second and third parameters are
expressions that denote the end points of the acceptable range for the first
parameter. The fourth parameter denotes the number of buckets.
WIDTH_BUCKET(expression, minval expression, maxval expression, num buckets)
Consider the following data from table customers, that shows the credit limits of
17 customers. This data is gathered in the query shown in Example 19–15 on
page 19-42.
CUST_ID CUST_CREDIT_LIMIT
-------- -----------------
22110 11000
28340 5000
40800 11000
121790 9000
165400 3000
171630 1500
184090 7000
215240 5000
227700 3000
246390 11000
346070 1500
364760 5000
370990 7000
383450 1500
408370 7000
420830 1500
464440 15000
Credit Limits
0 5000 10000 15000 20000
Bucket #
0 1 2 3 4 5
You can specify the bounds in the reverse order, for example, WIDTH_BUCKET
(cust_credit_limit, 20000, 0, 4). When the bounds are reversed, the buckets
will be open-closed intervals. In this example, bucket number 1 is (15000,20000],
bucket number 2 is (10000,15000], and bucket number 4, is (0,5000]. The
overflow bucket will be numbered 0 (20000, +infinity), and the underflow
bucket will be numbered 5 (-infinity, 0].
It is an error if the bucket count parameter is 0 or negative.
383450 1500 1 4
408370 7000 2 3
420830 1500 1 4
464440 15000 4 2
USERDEF_SKEW
============
0.583891
Staying with regular SQL will enable simpler development, and many query
operations are already well-parallelized in SQL. Even the earlier example, the skew
statistic, can be created using standard, albeit lengthy, SQL.
CASE Expressions
Oracle now supports simple and searched CASE statements. CASE statements are
similar in purpose to the Oracle DECODE statement, but they offer more flexibility
and logical power. They are also easier to read than traditional DECODE statements,
and offer better performance as well. They are commonly used when breaking
categories into buckets like age (for example, 20-29, 30-39, and so on). The syntax
for simple statements is:
expr WHEN comparison_expr THEN return_expr
[, WHEN comparison_expr THEN return_expr]...
You can specify only 255 arguments and each WHEN ... THEN pair counts as two
arguments. For a workaround to this limit, see Oracle9i SQL Reference.
CASE Example
Suppose you wanted to find the average salary of all employees in the company. If
an employee's salary is less than $2000, you want the query to use $2000 instead.
With a CASE statement, you would have to write this query as follows,
SELECT AVG(foo(e.sal)) FROM emps e;
In this, foo is a function that returns its input if the input is greater than 2000, and
returns 2000 otherwise. The query has performance implications because it needs to
invoke a function for each row. Writing custom functions can also add to the
development load.
Using CASE expressions in the database without PL/SQL, this query can be
rewritten as:
SELECT AVG(CASE when e.sal > 2000 THEN e.sal ELSE 2000 end) FROM emps e;
Using a CASE expression lets you avoid developing custom functions and can also
perform faster.
Histogram Example 1
SELECT
SUM(CASE WHEN cust_credit_limit BETWEEN 0 AND 3999 THEN 1 ELSE 0 END)
AS "0-3999",
SUM(CASE WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN 1 ELSE 0 END)
AS "4000-7999",
SUM(CASE WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN 1 ELSE 0 END)
AS "8000-11999",
SUM(CASE WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN 1 ELSE 0 END)
AS "12000-16000"
FROM customers WHERE cust_city='Marshal';
Histogram Example 2
SELECT
(CASE WHEN cust_credit_limit BETWEEN 0 AND 3999
THEN ' 0 - 3999'
WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN ' 4000 - 7999'
WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN ' 8000 - 11999'
WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN '12000 - 16000' END)
AS BUCKET,
COUNT(*) AS Count_in_Group
FROM customers.WHERE cust_city = 'Marshal'
GROUP BY
(CASE WHEN cust_credit_limit BETWEEN 0 AND 3999
THEN ' 0 - 3999'
WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN ' 4000 - 7999'
WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN ' 8000 - 11999'
WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN '12000 - 16000'
END);
BUCKET COUNT_IN_GROUP
------------- --------------
0 - 3999 6
4000 - 7999 6
8000 - 11999 4
12000 - 16000 1
In large data warehouse environments, many different types of analysis can occur.
In addition to SQL queries, you may also apply more advanced analytical
operations to your data. Two major types of such analysis are OLAP (On-Line
Analytic Processing) and data mining. Rather than having a separate OLAP or data
mining engine, Oracle has integrated OLAP and data mining capabilities directly
into the database server. Oracle OLAP and Oracle Data Mining are options to the
Oracle9i Database. This chapter provides a brief introduction to these technologies,
and more detail can be found in these products’ respective documentation.
The following topics provide an introduction to Oracle’s OLAP and data mining
capabilities:
■ OLAP
■ Data Mining
OLAP
Oracle9i OLAP adds the query performance and calculation capability previously
found only in multidimensional databases to Oracle’s relational platform. In
addition, it provides a Java OLAP API that is appropriate for the development of
internet-ready analytical applications. Unlike other combinations of OLAP and
RDBMS technology, Oracle9i OLAP is not a multidimensional database using
bridges to move data from the relational data store to a multidimensional data
store. Instead, it is truly an OLAP-enabled relational database. As a result, Oracle9i
provides the benefits of a multidimensional database along with the scalability,
accessibility, security, manageability, and high availability of the Oracle9i database.
The Java OLAP API, which is specifically designed for internet-based analytical
applications, offers productive data access.
Scalability
Oracle9i OLAP is highly scalable. In today’s environment, there is tremendous
growth along three dimensions of analytic applications: number of users, size of
data, complexity of analyses. There are more users of analytical applications, and
they need access to more data to perform more sophisticated analysis and target
marketing. For example, a telephone company might want a customer dimension to
include detail such as all telephone numbers as part of an application that is used to
analyze customer turnover. This would require support for multi-million row
dimension tables and very large volumes of fact data. Oracle9i can handle very
large data sets using parallel execution and partitioning, as well as offering support
for advanced hardware and clustering.
Availability
Oracle9i includes many features that support high availability. One of the most
significant is partitioning, which allows management of precise subsets of tables
and indexes, so that management operations affect only small pieces of these data
structures. By partitioning tables and indexes, data management processing time is
reduced, thus minimizing the time data is unavailable. Another feature supporting
high availability is transportable tablespaces. With transportable tablespaces, large
data sets, including tables and indexes, can be added with almost no processing to
other databases. This enables extremely rapid data loading and updates.
Manageability
Oracle enables you to precisely control resource utilization. The Database Resource
Manager, for example, provides a mechanism for allocating the resources of a data
warehouse among different sets of end-users. Consider an environment where the
marketing department and the sales department share an OLAP system. Using the
Database Resource Manager, you could specify that the marketing department
receive at least 60 percent of the CPU resources of the machines, while the sales
department receive 40 percent of the CPU resources. You can also further specify
limits on the total number of active sessions, and the degree of parallelism of
individual queries for each department.
Another resource management facility is the progress monitor, which gives end users
and administrators the status of long-running operations. Oracle9i maintains
statistics describing the percent-complete of these operations. Oracle Enterprise
Manager enables you to view a bar-graph display of these operations showing what
percent complete they are. Moreover, any other tool or any database administrator
can also retrieve progress information directly from the Oracle data server, using
system views.
Security
Just as the demands of real-world transaction processing required Oracle to develop
robust features for scalability, manageability and backup and recovery, they lead
Oracle to create industry-leading security features. The security features in Oracle
have reached the highest levels of U.S. government certification for database
trustworthiness. Oracle’s fine grained access control feature, enables cell-level
security for OLAP users. Fine grained access control works with minimal burden on
query processing, and it enables efficient centralized security management.
Data Mining
Oracle enables data mining inside the database for performance and scalability.
Some of the capabilities are:
■ An API that provides programmatic control and application integration
■ Analytical capabilities with OLAP and statistical functions in the database
■ Multiple algorithms: Naïve Bayes, decision trees, clustering, and association
rules
■ Real-time and batch scoring modes
■ Multiple prediction types
■ Association insights
Data Preparation
Data preparation can create new tables or views of existing data. Both options
perform faster than moving data to an external data mining utility and offer the
programmer the option of snap-shots or real-time updates.
Oracle Data Mining provides utilities for complex, data mining-specific tasks.
Binning improves model build time and model performance, so ODM provides a
utility for user-defined binning. ODM accepts data in either single record format or
in transactional format and performs mining on transactional formats. Single record
format is most common in applications, so ODM provides a utility for transforming
single record format.
Associated analysis for preparatory data exploration and model evaluation is
extended by Oracle’s statistical functions and OLAP capabilities. Because these also
operate within the database, they can all be incorporated into a seamless application
that shares database objects. This allows for more functional and faster applications.
Model Building
Oracle Data Mining provides four algorithms: Naïve Bayes, Decision Tree,
Clustering, and Association Rules. These algorithms address a broad spectrum of
business problems, ranging from predicting the future likelihood of a customer
purchasing a given product, to understand which products are likely be purchased
together in a single trip to the grocery store. All model building takes place inside
the database. Once again, the data does not need to move outside the database in
order to build the model, and therefore the entire data-mining process is
accelerated.
Model Evaluation
Models are stored in the database and directly accessible for evaluation, reporting,
and further analysis by a wide variety of tools and application functions. ODM
provides APIs for calculating traditional confusion matrixes and lift charts. It stores
the models, the underlying data, and these analysis results together in the database
to allow further analysis, reporting and application specific model management.
Scoring
Oracle Data Mining provides both batch and real-time scoring. In batch mode,
ODM takes a table as input. It scores every record, and returns a scored table as a
result. In real-time mode, parameters for a single record are passed in and the scores
are returned in a Java object.
In both modes, ODM can deliver a variety of scores. It can return a rating or
probability of a specific outcome. Alternatively it can return a predicted outcome
and the probability of that outcome occurring. Some examples follow.
■ How likely is this event to end in outcome A?
■ Which outcome is most likely to result from this event?
■ What is the probability of each possible outcome for this event?
Java API
The Oracle Data Mining API lets you build analytical models and deliver real-time
predictions in any application that supports Java. The API is based on the emerging
JSR-073 standard.
When a user issues a SQL statement, the optimizer decides whether to execute the
operations in parallel and determines the degree of parallelism (DOP) for each
operation. You can specify the number of parallel execution servers required for an
operation in various ways.
If the optimizer targets the statement for parallel processing, the following sequence
of events takes place:
1. The SQL statement's foreground process becomes a parallel execution
coordinator.
2. The parallel execution coordinator obtains as many parallel execution servers as
needed (determined by the DOP) from the server pool or creates new parallel
execution servers as needed.
3. Oracle executes the statement as a sequence of operations. Each operation is
performed in parallel, if possible.
4. When statement processing is completed, the coordinator returns any resulting
data to the user process that issued the statement and returns the parallel
execution servers to the server pool.
The parallel execution coordinator calls upon the parallel execution servers during
the execution of the SQL statement, not during the parsing of the statement.
Therefore, when parallel execution is used with the shared server, the server process
that processes the EXECUTE call of a user's statement becomes the parallel execution
coordinator for the statement.
See Also:
■ "Minimum Number of Parallel Execution Servers" on
page 21-36 for information about using the initialization
parameter PARALLEL_MIN_PERCENT
■ Oracle9i Database Performance Tuning Guide and Reference for
information about monitoring an instance's pool of parallel
execution servers and determining the appropriate values for
the initialization parameters
... Parallel
execution
server set 1
Parallel
... execution
server set 2
connections
message
buffer
When a connection is between two processes on the same instance, the servers
communicate by passing the buffers back and forth. When the connection is
between processes in different instances, the messages are sent using external
high-speed network protocols. In Figure 21–1, the DOP is equal to the number of
parallel execution servers, which in this case is n. Figure 21–1 does not show the
parallel execution coordinator. Each parallel execution server actually has an
additional connection to the parallel execution coordinator.
See Also:
■ "Setting the Degree of Parallelism" on page 21-32
■ "Parallelization Rules for SQL Statements" on page 21-38
Figure 21–2 Data Flow Diagram for a Join of EMPLOYEES and DEPARTMENTS
Parallel
Execution
Coordinator
GROUP
BY
SORT
HASH
JOIN
Parent operations can begin consuming rows as soon as the child operations have
produced rows. In the previous example, while the parallel execution servers are
producing rows in the FULL SCAN dept operation, another set of parallel execution
servers can begin to perform the HASH JOIN operation to consume the rows.
Each of the two operations performed concurrently is given its own set of parallel
execution servers. Therefore, both query operations and the data flow tree itself
have parallelism. The parallelism of an individual operation is called intraoperation
parallelism and the parallelism between operations in a data flow tree is called
interoperation parallelism.
Due to the producer-consumer nature of the Oracle server's operations, only two
operations in a given tree need to be performed simultaneously to minimize
execution time.
To illustrate intraoperation and interoperation parallelism, consider the following
statement:
SELECT * FROM employees ORDER BY last_name;
The execution plan implements a full scan of the employees table. This operation
is followed by a sorting of the retrieved rows, based on the value of the last_name
column. For the sake of this example, assume the last_name column is not
indexed. Also assume that the DOP for the query is set to 4, which means that four
parallel execution servers can be active for any given operation.
Figure 21–3 illustrates the parallel execution of the example query.
A-G
employees Table
H-M
Parallel
User Execution
Process Coordinator
N-S
SELECT *
from employees
T-Z
ORDER BY last_name;
As you can see from Figure 21–3, there are actually eight parallel execution servers
involved in the query even though the DOP is 4. This is because a parent and child
operator can be performed at the same time (interoperation parallelism).
Also note that all of the parallel execution servers involved in the scan operation
send rows to the appropriate parallel execution server performing the SORT
operation. If a row scanned by a parallel execution server contains a value for the
ename column between A and G, that row gets sent to the first ORDER BY parallel
execution server. When the scan operation is complete, the sorting processes can
return the sorted results to the coordinator, which, in turn, returns the complete
query results to the user.
Types of Parallelism
The following types of parallelism are discussed in this section:
■ Parallel Query
■ Parallel DDL
■ Parallel DML
■ Parallel Execution of Functions
■ Other Types of Parallelism
Parallel Query
You can parallelize queries and subqueries in SELECT statements. You can also
parallelize the query portions of DDL statements and DML statements (INSERT,
UPDATE, and DELETE).
However, you cannot parallelize the query portion of a DDL or DML statement if it
references a remote object. When you issue a parallel DML or DDL statement in
which the query portion references a remote object, the operation is automatically
executed serially.
See Also:
■ "Operations That Can Be Parallelized" on page 21-3 for
information on the query operations that Oracle can parallelize
■ "Parallelizing SQL Statements" on page 21-6 for an explanation
of how the processes perform parallel queries
■ "Distributed Transaction Restrictions" on page 21-27 for
examples of queries that reference a remote object
■ "Rules for Parallelizing Queries" on page 21-38 for information
on the conditions for parallelizing a query and the factors that
determine the DOP
These scan methods can be used for index-organized tables with overflow areas and
for index-organized tables that contain LOBs.
Parallel DDL
This section includes the following topics on parallelism for DDL statements:
■ DDL Statements That Can Be Parallelized
■ CREATE TABLE ... AS SELECT in Parallel
■ Recoverability and Parallel DDL
■ Space Management for Parallel DDL
See Also:
■ Oracle9i SQL Reference for information about the syntax and use
of parallel DDL statements
■ Oracle9i Application Developer’s Guide - Large Objects (LOBs) for
information about LOB restrictions
Parallel Execution
Coordinator
Use the NOLOGGING clause of the CREATE TABLE, CREATE INDEX, ALTER TABLE,
and ALTER INDEX statements to disable undo and redo log generation.
See Also:
■ Oracle9i SQL Reference for a discussion of the syntax of the
CREATE TABLE statement
■ Oracle9i Database Administrator’s Guide for information about
dictionary-managed tablespaces
segments used by the parallel execution servers are larger than what is needed to
store the rows.
■ If the unused space in each temporary segment is larger than the value of the
MINIMUM EXTENT parameter set at the tablespace level, then Oracle trims the
unused space when merging rows from all of the temporary segments into the
table or index. The unused space is returned to the system free space and can be
allocated for new extents, but it cannot be coalesced into a larger segment
because it is not contiguous space (external fragmentation).
■ If the unused space in each temporary segment is smaller than the value of the
MINIMUM EXTENT parameter, then unused space cannot be trimmed when the
rows in the temporary segments are merged. This unused space is not returned
to the system free space; it becomes part of the table or index (internal
fragmentation) and is available only for subsequent inserts or for updates that
require additional space.
For example, if you specify a DOP of 3 for a CREATE TABLE ... AS SELECT
statement, but there is only one datafile in the tablespace, then internal
fragmentation may occur, as shown in Figure 21–5 on page 21-18. The pockets of
free space within the internal table extents of a datafile cannot be coalesced with
other free space and cannot be allocated as extents.
USERS Tablespace
DATA1.ORA
EXTENT 1
Parallel
Execution
Server Free space
for INSERTs
Parallel EXTENT 2
CREATE TABLE emp Execution
AS SELECT ... Server Free space
for INSERTs
EXTENT 3
Parallel
Execution Free space
Server for INSERTs
Parallel DML
Parallel DML (PARALLEL, INSERT, UPDATE, and DELETE) uses parallel execution
mechanisms to speed up or scale up large DML operations against large database
tables and indexes.
running an Oracle Real Application Clusters. You also have to find out about
current resource usage to balance workload across instances.
Parallel DML removes these disadvantages by performing inserts, updates, and
deletes in parallel automatically.
Refreshing Tables in a Data Warehouse System In a data warehouse system, large tables
need to be refreshed (updated) periodically with new or modified data from the
production system. You can do this efficiently by using parallel DML combined
with updatable join views. You can also use the MERGE statement.
The data that needs to be refreshed is generally loaded into a temporary table before
starting the refresh process. This table contains either new rows or rows that have
been updated since the last refresh of the data warehouse. You can use an updatable
join view with parallel UPDATE to refresh the updated rows, and you can use an
anti-hash join with parallel INSERT to refresh the new rows.
large intermediate summary tables. These summary tables are often temporary and
frequently do not need to be logged. Parallel DML can speed up the operations
against these large intermediate tables. One benefit is that you can put incremental
results in the intermediate tables and perform parallel update.
In addition, the summary tables may contain cumulative or comparison
information which has to persist beyond application sessions; thus, temporary
tables are not feasible. Parallel DML operations can speed up the changes to these
large summary tables.
Using Scoring Tables Many DSS applications score customers periodically based on a
set of criteria. The scores are usually stored in large DSS tables. The score
information is then used in making a decision, for example, inclusion in a mailing
list.
This scoring activity queries and updates a large number of rows in the large table.
Parallel DML can speed up the operations against these large tables.
Running Batch Jobs Batch jobs executed in an OLTP database during off hours have a
fixed time window in which the jobs must complete. A good way to ensure timely
job completion is to parallelize their operations. As the work load increases, more
machine resources can be added; the scaleup property of parallel operations ensures
that the time constraint can be met.
The default mode of a session is DISABLE PARALLEL DML. When parallel DML is
disabled, no DML will be executed in parallel even if the PARALLEL hint is used.
When parallel DML is enabled in a session, all DML statements in this session will
be considered for parallel execution. However, even if parallel DML is enabled, the
DML operation may still execute serially if there are no parallel hints or no tables
with a parallel attribute or if restrictions on parallel operations are violated.
The session's PARALLEL DML mode does not influence the parallelism of SELECT
statements, DDL statements, and the query portions of DML statements. Thus, if
this mode is not set, the DML operation is not parallelized, but scans or join
operations within the DML statement may still be parallelized.
See Also:
■ "Space Considerations for Parallel DML" on page 21-24
■ "Lock and Enqueue Resources for Parallel DML" on page 21-24
■ "Restrictions on Parallel DML" on page 21-24
Rollback Segments
Oracle assigns transactions to rollback segments that have the fewest active
transactions. To speed up both forward and undo operations, you should create and
bring online enough rollback segments so that at most two parallel process
transactions are assigned to one rollback segment.
The SET TRANSACTION USE ROLLBACK SEGMENT statement is ignored when
parallel DML is used because parallel DML requires more than one rollback
segment for performance.
You should create the rollback segments in tablespaces that have enough space for
them to extend when necessary. You can then set the MAXEXTENTS storage
parameters for the rollback segments to UNLIMITED. Also, set the OPTIMAL value
for the rollback segments so that after the parallel DML transactions commit, the
rollback segments are shrunk to the OPTIMAL size.
See Also: Oracle9i Backup and Recovery Concepts for details about
parallel rollback
System Recovery Recovery from a system failure requires a new startup. Recovery is
performed by the SMON process and any recovery server processes spawned by
SMON. Parallel DML statements may be recovered using parallel rollback. If the
initialization parameter COMPATIBLE is set to 8.1.3 or greater, Fast-Start
On-Demand Rollback enables terminated transactions to be recovered, on demand
one block at a time.
Instance Recovery (Oracle Real Application Clusters) Recovery from an instance failure
in Oracle Real Application Clusters is performed by the recovery processes (that is,
the SMON processes and any recovery server processes they spawn) of other live
instances. Each recovery process of the live instances can recover the parallel
execution coordinator or parallel execution server transactions of the failed instance
independently.
subsequent serial or parallel statement (DML or query) can access the same
table again in that transaction.
– This restriction also exists after a serial direct-path INSERT statement: no
subsequent SQL statement (DML or query) can access the modified table
during that transaction.
– Queries that access the same table are allowed before a parallel DML or
direct-path INSERT statement, but not after.
– Any serial or parallel statements attempting to access a table that has
already been modified by a parallel UPDATE, DELETE, or MERGE, or a
direct-path INSERT during the same transaction are rejected with an error
message.
■ If the initialization parameter ROW_LOCKING is set to intent, then inserts,
updates, merges, and deletes are not parallelized (regardless of the serializable
mode).
■ Parallel DML operations cannot be done on tables with triggers.
■ Replication functionality is not supported for parallel DML.
■ Parallel DML cannot occur in the presence of certain constraints: self-referential
integrity, delete cascade, and deferred integrity. In addition, for direct-path
INSERT, there is no support for any referential integrity.
■ Parallel DML can be done on tables with object columns provided you are not
touching the object columns.
■ Parallel DML can be done on tables with LOB columns provided the table is
partitioned. However, intra-partition parallelism is not supported.
■ A transaction involved in a parallel DML operation cannot be or become a
distributed transaction.
■ Clustered tables are not supported.
Violations of these restrictions cause the statement to execute serially without
warnings or error messages (except for the restriction on statements accessing the
same table in a transaction, which can cause error messages). For example, an
update is serialized if it is on a nonpartitioned table.
Partitioning Key Restriction You can only update the partitioning key of a partitioned
table to a new value if the update does not cause the row to move to a new
partition. The update is possible if the table is defined with the row movement
clause enabled.
Function Restrictions The function restrictions for parallel DML are the same as those
for parallel DDL and parallel query.
NOT NULL and CHECK These types of integrity constraints are allowed. They are not a
problem for parallel DML because they are enforced on the column and row level,
respectively.
UNIQUE and PRIMARY KEY These types of integrity constraints are allowed.
Delete Cascade Delete on tables having a foreign key with delete cascade is not
parallelized because parallel execution servers will try to delete rows from multiple
partitions (parent and child tables).
Deferrable Integrity Constraints If any deferrable constraints apply to the table being
operated on, the DML operation will not be parallelized.
Trigger Restrictions
A DML operation will not be parallelized if the affected tables contain enabled
triggers that may get fired as a result of the statement. This implies that DML
statements on tables that are being replicated will not be parallelized.
Relevant triggers must be disabled in order to parallelize DML on the table. Note
that, if you enable or disable triggers, the dependent shared cursors are invalidated.
See Also:
■ Oracle9i Application Developer’s Guide - Fundamentals for
information about the PRAGMA RESTRICT_REFERENCES
■ Oracle9i SQL Reference for information about CREATE
FUNCTION
See Also:
■ Oracle9i Database Utilities for information about parallel load
and SQL*Loader
■ Oracle9i User-Managed Backup and Recovery Guide for
information about parallel media recovery
■ Oracle9i Database Performance Tuning Guide and Reference for
information about parallel instance recovery
■ Oracle9i Replication for information about parallel propagation
As mentioned, you can manually adjust the parameters shown in Table 21–2, even if
you set PARALLEL_AUTOMATIC_TUNING to true. You might need to do this if you
have a highly customized environment or if your system does not perform
optimally using the completely automated settings.
See Also:
■ Oracle9i Database Reference and Oracle9i Database Performance
Tuning Guide and Reference for information
■ Oracle9i SQL Reference for the syntax of the ALTER SYSTEM
statement
■ "Forcing Parallel Execution for a Session" on page 21-48
See Also:
■ "The Parallel Execution Server Pool" on page 21-3
■ "Parallelism Between Operations" on page 21-8
■ "Default Degree of Parallelism" on page 21-35
■ "Parallelization Rules for SQL Statements" on page 21-38
Hints
You can specify hints in a SQL statement to set the DOP for a table or index and for
the caching behavior of the operation.
■ The PARALLEL hint is used only for operations on tables. You can use it to
parallelize queries and DML statements (INSERT, UPDATE, and DELETE).
These factors determine the default number of parallel execution servers to use.
However, the actual number of processes used is limited by their availability on the
requested instances during run time. The initialization parameter PARALLEL_MAX_
SERVERS sets an upper limit on the total number of parallel execution servers that
an instance can have.
If a minimum fraction of the desired parallel execution servers is not available
(specified by the initialization parameter PARALLEL_MIN_PERCENT), a user error is
produced. The user can then retry the query with less parallelism.
In general, you cannot assume that the time taken to perform a parallel operation
on a given number of partitions (N) with a given number of parallel execution
servers (P) will be N/P. This formula does not take into account the possibility that
some processes might have to wait while others finish working on the last
partitions. By choosing an appropriate DOP, however, you can minimize the
workload skew and optimize performance.
Degree of Parallelism The DOP for a query is determined by the following rules:
■ The query uses the maximum DOP taken from all of the table declarations
involved in the query and all of the potential indexes that are candidates to
satisfy the query (the reference objects). That is, the table or index that has the
greatest DOP determines the query's DOP (maximum query directive).
■ If a table has both a parallel hint specification in the query and a parallel
declaration in its table specification, the hint specification takes precedence over
parallel declaration specification. See Table 21–3 on page 21-45 for precedence
rules.
Decision to Parallelize The following rule determines whether the UPDATE, MERGE, or
DELETE operation should be parallelized:
The UPDATE or DELETE operation will be parallelized if and only if at least one
of the following is true:
■ The table being updated or deleted has a PARALLEL specification.
■ The PARALLEL hint is specified in the DML statement.
■ An ALTER SESSION FORCE PARALLEL DML statement has been issued
previously during the session.
If the statement contains subqueries or updatable views, then they may have their
own separate parallel hints or clauses. However, these parallel directives do not
affect the decision to parallelize the UPDATE, MERGE, or DELETE.
The parallel hint or clause on the tables is used by both the query and the UPDATE,
MERGE, DELETE portions to determine parallelism, the decision to parallelize the
UPDATE, MERGE, or DELETE portion is made independently of the query portion,
and vice versa.
Degree of Parallelism The DOP is determined by the same rules as for the queries.
Note that in the case of UPDATE and DELETE operations, only the target table to be
modified (the only reference object) is involved. Thus, the UPDATE or DELETE
parallel hint specification takes precedence over the parallel declaration
specification of the target table. In other words, the precedence order is: MERGE,
UPDATE, DELETE hint > Session > Parallel declaration specification of target table
See Table 21–3 on page 21-45 for precedence rules.
The maximum DOP you can achieve is equal to the number of partitions (or
subpartitions in the case of composite subpartitions) in the table. A parallel
execution server can update or merge into, or delete from multiple partitions, but
each partition can only be updated or deleted by one parallel execution server.
If the DOP is less than the number of partitions, then the first process to finish work
on one partition continues working on another partition, and so on until the work is
finished on all partitions. If the DOP is greater than the number of partitions
involved in the operation, then the excess parallel execution servers will have no
work to do.
If tbl_1 is a partitioned table and its table definition has a parallel clause, then the
update operation is parallelized even if the scan on the table is serial (such as an
index scan), assuming that the table has more than one partition with c1 greater
than 100.
Both the scan and update operations on tbl_2 will be parallelized with degree
four.
Decision to Parallelize The following rule determines whether the INSERT operation
should be parallelized in an INSERT ... SELECT statement:
The INSERT operation will be parallelized if and only if at least one of the
following is true:
■ The PARALLEL hint is specified after the INSERT in the DML statement.
■ The table being inserted into (the reference object) has a PARALLEL
declaration specification.
■ An ALTER SESSION FORCE PARALLEL DML statement has been issued
previously during the session.
The decision to parallelize the INSERT operation is made independently of the
SELECT operation, and vice versa.
Parallel clauses in CREATE TABLE and ALTER TABLE statements specify table
parallelism. If a parallel clause exists in a table definition, it determines the
parallelism of DDL statements as well as queries. If the DDL statement contains
explicit parallel hints for a table, however, those hints override the effect of parallel
clauses for that table. You can use the ALTER SESSION FORCE PARALLEL DDL
statement to override parallel clauses.
Parallel CREATE INDEX or ALTER INDEX ... REBUILD The CREATE INDEX and ALTER
INDEX ... REBUILD statements can be parallelized only by a PARALLEL clause or an
ALTER SESSION FORCE PARALLEL DDL statement.
ALTER INDEX ... REBUILD can be parallelized only for a nonpartitioned index, but
ALTER INDEX ... REBUILD PARTITION can be parallelized by a PARALLEL clause
or an ALTER SESSION FORCE PARALLEL DDL statement.
The scan operation for ALTER INDEX ... REBUILD (nonpartitioned), ALTER INDEX ...
REBUILD PARTITION, and CREATE INDEX has the same parallelism as the
REBUILD or CREATE operation and uses the same DOP. If the DOP is not specified
for REBUILD or CREATE, the default is the number of CPUs.
Parallel MOVE PARTITION or SPLIT PARTITION The ALTER INDEX ... MOVE PARTITION
and ALTER INDEX ... SPLIT PARTITION statements can be parallelized only by a
PARALLEL clause or an ALTER SESSION FORCE PARALLEL DDL statement. Their
scan operations have the same parallelism as the corresponding MOVE or SPLIT
operations. If the DOP is not specified, the default is the number of CPUs.
Decision to Parallelize (Query Part) The query part of a CREATE TABLE ... AS SELECT
statement can be parallelized only if the following conditions are satisfied:
■ The query includes a parallel hint specification (PARALLEL or PARALLEL_
INDEX) or the CREATE part of the statement has a PARALLEL clause
Degree of Parallelism (Query Part) The DOP for the query part of a CREATE TABLE ...
AS SELECT statement is determined by one of the following rules:
■ The query part uses the values specified in the PARALLEL clause of the CREATE
part.
■ If the PARALLEL clause is not specified, the default DOP is the number of CPUs.
■ If the CREATE is serial, then the DOP is determined by the query.
Note that any values specified in a hint for parallelism are ignored.
Decision to Parallelize (CREATE Part) The CREATE operation of CREATE TABLE ... AS
SELECT can be parallelized only by a PARALLEL clause or an ALTER SESSION
FORCE PARALLEL DDL statement.
When the CREATE operation of CREATE TABLE ... AS SELECT is parallelized, Oracle
also parallelizes the scan operation if possible. The scan operation cannot be
parallelized if, for example:
■ The SELECT clause has a NOPARALLEL hint
■ The operation scans an index of a nonpartitioned table
When the CREATE operation is not parallelized, the SELECT can be parallelized if it
has a PARALLEL hint or if the selected table (or partitioned index) has a parallel
declaration.
Degree of Parallelism (CREATE Part) The DOP for the CREATE operation, and for the
SELECT operation if it is parallelized, is specified by the PARALLEL clause of the
CREATE statement, unless it is overridden by an ALTER SESSION FORCE
PARALLEL DDL statement. If the PARALLEL clause does not specify the DOP, the
default is the number of CPUs.
Note: Once Oracle determines the DOP for a query, the DOP does
not change for the duration of the query.
It is best to use the parallel adaptive multiuser feature when users process
simultaneous parallel execution operations. If you enable PARALLEL_AUTOMATIC_
TUNING, Oracle automatically sets PARALLEL_ADAPTIVE_MULTI_USER to true.
PARALLEL_MAX_SERVERS
The recommended value for the PARALLEL_MAX_SEVERS parameter is as follows:
2 x DOP x NUMBER_OF_CONCURRENT_USERS
If the hardware system is neither CPU bound nor I/O bound, then you can increase
the number of concurrent parallel execution users on the system by adding more
query server processes. When the system becomes CPU- or I/O-bound, however,
adding more concurrent users becomes detrimental to the overall performance.
Careful setting of PARALLEL_MAX_SERVERS is an effective method of restricting
the number of concurrent parallel operations.
If users initiate too many concurrent operations, Oracle might not have enough
query server processes. In this case, Oracle executes the operations sequentially or
displays an error if PARALLEL_MIN_PERCENT is set to a value other than the
default value of 0 (zero).
This condition can be verified through the GV$SYSSTAT view by comparing the
statistics for parallel operations not downgraded and parallel operations
downgraded to serial. For example:
SELECT * FROM GV$SYSSTAT WHERE name LIKE 'Parallel operation%';
When Users Have Too Many Processes When concurrent users have too many query
server processes, memory contention (paging), I/O contention, or excessive context
switching can occur. This contention can reduce system throughput to a level lower
than if parallel execution were not used. Increase the PARALLEL_MAX_SERVERS
value only if the system has sufficient memory and I/O bandwidth for the resulting
load.
You can use operating system performance monitoring tools to determine how
much memory, swap space and I/O bandwidth are free. Look at the runq lengths
for both your CPUs and disks, as well as the service time for I/Os on the system.
Verify that the machine has sufficient swap space exists on the machine to add more
processes. Limiting the total number of query server processes might restrict the
number of concurrent users who can execute parallel operations, but system
throughput tends to remain stable.
See Also:
■ Oracle9i Database Administrator’s Guide for more information
about managing resources with user profiles
■ Oracle9i Real Application Clusters Administration for more
information on querying GV$ views
PARALLEL_MIN_SERVERS
The recommended value for the PARALLEL_MIN_SERVERS parameter is 0 (zero),
which is the default.
This parameter is used at startup and lets you specify in a single instance the
number of processes to be started and reserved for parallel operations. The syntax
is:
PARALLEL_MIN_SERVERS=n
The n variable is the number of processes you want to start and reserve for parallel
operations.
Setting PARALLEL_MIN_SERVERS balances the startup cost against memory usage.
Processes started using PARALLEL_MIN_SERVERS do not exit until the database is
shut down. This way, when a query is issued the processes are likely to be available.
It is desirable, however, to recycle query server processes periodically since the
memory these processes use can become fragmented and cause the high water mark
LARGE_POOL_SIZE or SHARED_POOL_SIZE
The following discussion of how to tune the large pool also applies to tuning the
shared pool, except as noted in "SHARED_POOL_SIZE" on page 21-56. You must
also increase the value for this memory setting by the amount you determine.
Parallel execution requires additional memory resources in addition to those
required by serial SQL execution. Additional memory is used for communication
and passing data between query server processes and the query coordinator.
There is no recommended value for LARGE_POOL_SIZE. Instead, Oracle
recommends leaving this parameter unset and having Oracle set it for you by
setting the PARALLEL_AUTOMATIC_TUNING parameter to true. The exception to
this is when the system-assigned value is inadequate for your processing
requirements.
You should reduce the value for LARGE_POOL_SIZE low enough so your database
starts. After reducing the value of LARGE_POOL_SIZE, you might see the error:
ORA-04031: unable to allocate 16084 bytes of shared memory
("large pool","unknown object","large pool heap","PX msg pool")
If so, execute the following query to determine why Oracle could not allocate the
16,084 bytes:
If you specify LARGE_POOL_SIZE and the amount of memory you need to reserve
is bigger than the pool, Oracle does not allocate all the memory it can get. Instead, it
leaves some space. When the query runs, Oracle tries to get what it needs. Oracle
uses the 560 KB and needs another 16KB when it fails. The error does not report the
cumulative amount that is needed. The best way of determining how much more
memory is needed is to use the formulas in "Adding Memory for Message Buffers"
on page 21-53.
To resolve the problem in the current example, increase the value for LARGE_POOL_
SIZE. As shown in the sample output, the LARGE_POOL_SIZE is about 2 MB.
Depending on the amount of memory available, you could increase the value of
LARGE_POOL_SIZE to 4 MB and attempt to start your database. If Oracle continues
to display an ORA-4031 message, gradually increase the value for LARGE_POOL_
SIZE until startup is successful.
Adding Memory for Message Buffers You must increase the value for the LARGE_POOL_
SIZE or the SHARED_POOL_SIZE parameters to accommodate message buffers.
The message buffers allow query server processes to communicate with each other.
If you enable automatic parallel tuning, Oracle allocates space for the message
buffer from the large pool. Otherwise, Oracle allocates space from the shared pool.
Oracle uses a fixed number of buffers for each virtual connection between producer
query servers and consumer query servers. Connections increase as the square of
the DOP increases. For this reason, the maximum amount of memory used by
parallel execution is bound by the highest DOP allowed on your system. You can
control this value by using either the PARALLEL_MAX_SERVERS parameter or by
using policies and profiles.
To calculate the amount of memory required, use one of the following formulas:
■ For SMP systems:
mem in bytes = (3 x size x users x groups x connections)
Calculating Additional Memory for Cursors Parallel execution plans consume more space
in the SQL area than serial execution plans. You should regularly monitor shared
pool resource use to ensure that the memory used by both messages and cursors
can accommodate your system's processing requirements.
Evaluate the memory used as shown in your output, and alter the setting for
LARGE_POOL_SIZE based on your processing needs.
To obtain more memory usage statistics, execute the following query:
SELECT * FROM V$PX_PROCESS_SYSSTAT WHERE STATISTIC LIKE 'Buffers%';
The amount of memory used appears in the Buffers Current and Buffers HWM
statistics. Calculate a value in bytes by multiplying the number of buffers by the
value for PARALLEL_EXECUTION_MESSAGE_SIZE. Compare the high water mark
to the parallel execution message pool size to determine if you allocated too much
memory. For example, in the first output, the value for large pool as shown in px
msg pool is 38,092,812 or 38 MB. The Buffers HWM from the second output is
3,620, which when multiplied by a parallel execution message size of 4,096 is
14,827,520, or approximately 15 MB. In this case, the high water mark has reached
approximately 40 percent of its capacity.
SHARED_POOL_SIZE
As mentioned earlier, if PARALLEL_AUTOMATIC_TUNING is false, Oracle
allocates query server processes from the shared pool. In this case, tune the shared
pool as described under the previous heading for large pool, with the following
exceptions:
■ Allow for other clients of the shared pool, such as shared cursors and stored
procedures
■ Remember that larger values improve performance in multiuser systems, but
smaller values use less memory
You must also take into account that using parallel execution generates more
cursors. Look at statistics in the V$SQLAREA view to determine how often Oracle
recompiles cursors. If the cursor hit ratio is poor, increase the size of the pool. This
happens only when you have a large number of distinct queries.
You can then monitor the number of buffers used by parallel execution in the same
way as explained previously, and compare the shared pool PX msg pool to the
current high water mark reported in output from the view V$PX_PROCESS_
SYSSTAT.
PARALLEL_MIN_PERCENT
The recommended value for the PARALLEL_MIN_PERCENT parameter is 0 (zero).
This parameter allows users to wait for an acceptable DOP, depending on the
application in use. Setting this parameter to values other than 0 (zero) causes Oracle
to return an error when the requested DOP cannot be satisfied by the system at a
given time.
For example, if you set PARALLEL_MIN_PERCENT to 50, which translates to 50
percent, and the DOP is reduced by 50 percent or greater because of the adaptive
algorithm or because of a resource limitation, then Oracle returns ORA-12827. For
example:
SELECT /*+ PARALLEL(e, 8, 1) */ d.department_id, SUM(SAL)
FROM employees e, departments d WHERE e.department_id = d.department_id
GROUP BY d.department_id ORDER BY d.department_id;
CLUSTER_DATABASE_INSTANCES
The CLUSTER_DATABASE_INSTANCES parameter should be set to a value that is
equal to the number of instances in your Real Application Clusters environment.
PGA_AGGREGATE_TARGET
With Oracle9i, you can simplify and improve the way PGA memory is allocated, by
enabling automatic PGA memory management. In this mode, Oracle dynamically
adjusts the size of the portion of the PGA memory dedicated to work areas, based
on an overall PGA memory target explicitly set by the DBA. To enable automatic
PGA memory management, you have to set the initialization parameter PGA_
AGGREGATE_TARGET.
HASH_AREA_SIZE HASH_AREA_SIZE has been deprecated and you should use PGA_
AGGREGATE_TARGET instead.
SORT_AREA_SIZE SORT_AREA_SIZE has been deprecated and you should use PGA_
AGGREGATE_TARGET instead.
PARALLEL_EXECUTION_MESSAGE_SIZE
The recommended value for PARALLEL_EXECUTION_MESSAGE_SIZE is 4 KB. If
PARALLEL_AUTOMATIC_TUNING is true, the default is 4 KB. If PARALLEL_
AUTOMATIC_TUNING is false, the default is slightly greater than 2 KB.
The PARALLEL_EXECUTION_MESSAGE_SIZE parameter specifies the upper limit
for the size of parallel execution messages. The default value is operating system
specific and this value should be adequate for most applications. Larger values for
PARALLEL_EXECUTION_MESSAGE_SIZE require larger values for LARGE_POOL_
SIZE or SHARED_POOL_SIZE, depending on whether you have enabled parallel
automatic tuning.
While you might experience significantly improved response time by increasing the
value for PARALLEL_EXECUTION_MESSAGE_SIZE, memory use also drastically
increases. For example, if you double the value for PARALLEL_EXECUTION_
MESSAGE_SIZE, parallel execution requires a message source pool that is twice as
large.
Therefore, if you set PARALLEL_AUTOMATIC_TUNING to false, you must adjust
the SHARED_POOL_SIZE to accommodate parallel execution messages. If you have
set PARALLEL_AUTOMATIC_TUNING to true, but have set LARGE_POOL_SIZE
manually, then you must adjust the LARGE_POOL_SIZE to accommodate parallel
execution messages.
Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL
The parameters that affect parallel DML and parallel DDL resource consumption
are:
■ TRANSACTIONS
■ ROLLBACK_SEGMENTS
■ FAST_START_PARALLEL_ROLLBACK
■ LOG_BUFFER
■ DML_LOCKS
■ ENQUEUE_RESOURCES
Parallel inserts, updates, and deletes require more resources than serial DML
operations. Similarly, PARALLEL CREATE TABLE ... AS SELECT and PARALLEL
CREATE INDEX can require more resources. For this reason, you may need to
increase the value of several additional initialization parameters. These parameters
do not affect resources for queries.
TRANSACTIONS For parallel DML and DDL, each query server process starts a
transaction. The parallel coordinator uses the two-phase commit protocol to commit
transactions; therefore, the number of transactions being processed increases by the
DOP. As a result, you might need to increase the value of the TRANSACTIONS
initialization parameter.
The TRANSACTIONS parameter specifies the maximum number of concurrent
transactions. The default assumes no parallelism. For example, if you have a DOP
of 20, you will have 20 more new server transactions (or 40, if you have two server
sets) and 1 coordinator transaction. In this case, you should increase
TRANSACTIONS by 21 (or 41) if the transactions are running in the same instance. If
you do not set this parameter, Oracle sets it to a value equal to 1.1 x SESSIONS.
default, the DOP is chosen to be at most two times the value of the CPU_COUNT
parameter.
If the default DOP is insufficient, set the parameter to the HIGH. This gives a
maximum DOP of at most four times the value of the CPU_COUNT parameter. This
feature is available by default.
DML_LOCKS This parameter specifies the maximum number of DML locks. Its value
should equal the total number of locks on all tables referenced by all users. A
parallel DML operation's lock and enqueue resource requirement is very different
from serial DML. Parallel DML holds many more locks, so you should increase the
value of the ENQUEUE_RESOURCES and DML_LOCKS parameters by equal amounts.
Table 21–4 shows the types of locks acquired by coordinator and parallel execution
server processes for different types of parallel DML statements. Using this
information, you can determine the value required for these parameters.
Consider a table with 600 partitions running with a DOP of 100. Assume all
partitions are involved in a parallel UPDATE or DELETE statement with no
row-migrations.
The coordinator acquires:
■ 1 table lock SX
■ 600 partition locks X
Total server processes acquires:
DB_CACHE_SIZE
When you perform parallel updates, merges, and deletes, the buffer cache behavior
is very similar to any OLTP system running a high volume of updates.
DB_BLOCK_SIZE
The recommended value for this parameter is 8 KB or 16 KB.
Set the database block size when you create the database. If you are creating a new
database, use a large block size such as 8 KB or 16 KB.
DB_FILE_MULTIBLOCK_READ_COUNT
The recommended value for this parameter is eight for 8 KB block size, or four for
16 KB block size. The default is 8.
This parameter determines how many database blocks are read with a single
operating system READ call. The upper limit for this parameter is
platform-dependent. If you set DB_FILE_MULTIBLOCK_READ_COUNT to an
excessively high value, your operating system will lower the value to the highest
allowable level when you start your database. In this case, each platform uses the
highest value possible. Maximum values generally range from 64 KB to 1 MB.
Synchronous read
Asynchronous read
I/O: CPU:
read block #1 process block #1
I/O: CPU:
read block #2 process block #2
Asynchronous operations are currently supported for parallel table scans, hash
joins, sorts, and serial table scans. However, this feature can require operating
system specific configuration and may not be supported on all platforms. Check
your Oracle operating system-specific documentation.
Is There Regression?
Does parallel execution's actual performance deviate from what you expected? If
performance is as you expected, could there be an underlying performance
problem? Perhaps you have a desired outcome in mind to which you are comparing
the current outcome. Perhaps you have justifiable performance expectations that the
system does not achieve. You might have achieved this level of performance or a
particular execution plan in the past, but now, with a similar environment and
operation, the system is not meeting this goal.
If performance is not as you expected, can you quantify the deviation? For data
warehousing operations, the execution plan is key. For critical data warehousing
operations, save the EXPLAIN PLAN results. Then, as you analyze and reanalyze the
data, upgrade Oracle, and load new data, over time you can compare new
execution plans with old plans. Take this approach either proactively or reactively.
Alternatively, you might find that plan performance improves if you use hints. You
might want to understand why hints are necessary and determine how to get the
optimizer to generate the desired plan without hints. Try increasing the statistical
sample size: better statistics can give you a better plan.
■ Compute statistics. If you do not analyze often and you can spare the time, it is
a good practice to compute statistics. This is particularly important if you are
performing many joins, and it will result in better plans. Alternatively, you can
estimate statistics.
Note: Using different sample sizes can cause the plan to change.
Generally, the higher the sample size, the better the plan.
that are more or less I/O intensive, but in general each CPU should have roughly
the same amount of activity.
The statistics in V$PQ_TQSTAT show rows produced and consumed for each
parallel execution server. This is a good indication of skew and does not require
single user operation.
Operating system statistics show you the per-processor CPU utilization and
per-disk I/O activity. Concurrently running tasks make it harder to see what is
going on, however. It may be useful to run in single-user mode and check operating
system monitors that show system level CPU and I/O activity.
If I/O problems occur, you might need to reorganize your data by spreading it over
more devices. If parallel execution problems occur, check to be sure you have
followed the recommendation to spread data over at least as many devices as CPUs.
If there is no skew in workload distribution, check for the following conditions:
■ Is there device contention?
■ Is there controller contention?
■ Is the system I/O-bound with too little parallelism? If so, consider increasing
parallelism up to the number of devices.
■ Is the system CPU-bound with too much parallelism? Check the operating
system CPU monitor to see whether a lot of time is being spent in system calls.
The resource might be overcommitted, and too much parallelism might cause
processes to compete with themselves.
■ Are there more concurrent users than the system can support?
V$PX_SESSION
The V$PX_SESSION view shows data about query server sessions, groups, sets, and
server numbers. It also displays real-time data about the processes working on
behalf of parallel execution. This table includes information about the requested
DOP and the actual DOP granted to the operation.
V$PX_SESSTAT
The V$PX_SESSTAT view provides a join of the session information from V$PX_
SESSION and the V$SESSTAT table. Thus, all session statistics available to a normal
session are available for all sessions performed using parallel execution.
V$PX_PROCESS
The V$PX_PROCESS view contains information about the parallel processes,
including status, session ID, process ID, and other information.
V$PX_PROCESS_SYSSTAT
The V$PX_PROCESS_SYSSTAT view shows the status of query servers and
provides buffer allocation statistics.
V$PQ_SESSTAT
The V$PQ_SESSTAT view shows the status of all current server groups in the
system such as data about how queries allocate processes and how the multiuser
and load balancing algorithms are affecting the default and hinted values. V$PQ_
SESSTAT will be obsolete in a future release.
You might need to adjust some parameter settings to improve performance after
reviewing data from these views. In this case, refer to the discussion of "Tuning
General Parameters for Parallel Execution" on page 21-49. Query these views
periodically to monitor the progress of long-running parallel operations.
Note: For many dynamic performance views, you must set the
parameter TIMED_STATISTICS to true in order for Oracle to
collect statistics for each view. You can use the ALTER SYSTEM or
ALTER SESSION statements to turn TIMED_STATISTICS on and
off.
V$FILESTAT
The V$FILESTAT view sums read and write requests, the number of blocks, and
service times for every datafile in every tablespace. Use V$FILESTAT to diagnose
I/O and workload distribution problems.
You can join statistics from V$FILESTAT with statistics in the DBA_DATA_FILES
view to group I/O by tablespace or to find the filename for a given file number.
Using a ratio analysis, you can determine the percentage of the total tablespace
activity used by each file in the tablespace. If you make a practice of putting just one
large, heavily accessed object in a tablespace, you can use this technique to identify
objects that have a poor physical layout.
You can further diagnose disk space allocation problems using the DBA_EXTENTS
view. Ensure that space is allocated evenly from all files in the tablespace.
Monitoring V$FILESTAT during a long-running operation and then correlating I/O
activity to the EXPLAIN PLAN output is a good way to follow progress.
V$PARAMETER
The V$PARAMETER view lists the name, current value, and default value of all
system parameters. In addition, the view shows whether a parameter is a session
parameter that you can modify online with an ALTER SYSTEM or ALTER SESSION
statement.
V$PQ_TQSTAT
As a simple example, consider a hash join between two tables, with a join on a
column with only 2 distinct values. At best, this hash function will have one hash
value to parallel execution server A and the other to parallel execution server B. A
DOP of two is fine, but, if it is 4, then at least 2 parallel execution servers have no
work. To discover this type of skew, use a query similar to the following example:
SELECT dfo_number, tq_id, server_type, process, num_rows
FROM V$PQ_TQSTAT
ORDER BY dfo_number DESC, tq_id, server_type, process;
The best way to resolve this problem might be to choose a different join method; a
nested loop join might be the best option. Alternatively, if one of the join tables is
small relative to the other, a BROADCAST distribution method can be hinted using
PQ_DISTRIBUTE hint. Note that the optimizer considers the BROADCAST
distribution method, but requires OPTIMIZER_FEATURE_ENABLED set to 9.0.2 or
higher.
Now, assume that you have a join key with high cardinality, but one of the values
contains most of the data, for example, lava lamp sales by year. The only year that
had big sales was 1968, and thus, the parallel execution server for the 1968 records
will be overwhelmed. You should use the same corrective actions as described
previously.
The V$PQ_TQSTAT view provides a detailed report of message traffic at the table
queue level. V$PQ_TQSTAT data is valid only when queried from a session that is
executing parallel SQL statements. A table queue is the pipeline between query
server groups, between the parallel coordinator and a query server group, or
between a query server group and the coordinator. Table queues are represented in
EXPLAIN PLAN output by the row labels of PARALLEL_TO_PARALLEL, SERIAL_
TO_PARALLEL, or PARALLEL_TO_SERIAL, respectively.
V$PQ_TQSTAT has a row for each query server process that reads from or writes to
in each table queue. A table queue connecting 10 consumer processes to 10
producer processes has 20 rows in the view. Sum the bytes column and group by
TQ_ID, the table queue identifier, to obtain the total number of bytes sent through
each table queue. Compare this with the optimizer estimates; large variations might
indicate a need to analyze the data using a larger sample.
Compute the variance of bytes grouped by TQ_ID. Large variances indicate
workload imbalances. You should investigate large variances to determine whether
the producers start out with unequal distributions of data, or whether the
distribution itself is skewed. If the data itself is skewed, this might indicate a low
cardinality, or low number of distinct values.
The processes shown in the output from the previous example using
GV$PX_SESSION collaborate to complete the same task. The next example shows
the execution of a join query to determine the progress of these processes in terms
of physical reads. Use this query to track any specific statistic:
SELECT QCSID, SID, INST_ID "Inst",
SERVER_GROUP "Group", SERVER_SET "Set",
NAME "Stat Name", VALUE
FROM GV$PX_SESSTAT A, V$STATNAME B
WHERE A.STATISTIC# = B.STATISTIC#
AND NAME LIKE 'PHYSICAL READS'
AND VALUE > 0
ORDER BY QCSID, QCINST_ID, SERVER_GROUP, SERVER_SET;
Use the previous type of query to track statistics in V$STATNAME. Repeat this query
as often as required to observe the progress of the query server processes.
The next query uses V$PX_PROCESS to check the status of the query servers.
SELECT * FROM V$PX_PROCESS;
14 rows selected.
Other configurations (for example, multiple partitions in one file striped over
multiple devices) will yield correct query results, but you may need to use hints or
explicitly set object attributes to select the correct DOP.
then that table is never scanned in parallel. This override occurs regardless of the
default DOP indicated by the number of CPUs, instances, and devices storing that
table.
You can adjust the DOP by using the following guidelines:
■ Modify the default DOP by changing the value for the PARALLEL_THREADS_
PER_CPU parameter.
■ Adjust the DOP either by using ALTER TABLE, ALTER SESSION, or by using
hints.
■ To increase the number of concurrent parallel operations, reduce the DOP, or set
the parameter PARALLEL_ADAPTIVE_MULTI_USER to true.
You can increase the optimizer's ability to generate parallel plans converting
subqueries, especially correlated subqueries, into joins. Oracle can parallelize joins
more efficiently than subqueries. This also applies to updates.
TABLE ... AS SELECT or direct-path INSERT to store the result set in the database.
At a later time, users can view the result set serially.
When combined with the NOLOGGING option, the parallel version of CREATE
TABLE ... AS SELECT provides a very efficient intermediate table facility, for
example:
CREATE TABLE summary PARALLEL NOLOGGING
AS SELECT dim_1, dim_2 ..., SUM (meas_1)
FROM facts
GROUP BY dim_1, dim_2;
These tables can also be incrementally loaded with parallel INSERT. You can take
advantage of intermediate tables using the following techniques:
■ Common subqueries can be computed once and referenced many times. This
can allow some queries against star schemas (in particular, queries without
selective WHERE-clause predicates) to be better parallelized. Note that star
queries with selective WHERE-clause predicates using the star-transformation
technique can be effectively parallelized automatically without any
modification to the SQL.
■ Decompose complex queries into simpler steps in order to provide
application-level checkpoint or restart. For example, a complex multitable join
on a database 1 terabyte in size could run for dozens of hours. A failure during
this query would mean starting over from the beginning. Using CREATE TABLE
... AS SELECT or PARALLEL INSERT AS SELECT, you can rewrite the query as a
sequence of simpler queries that run for a few hours each. If a system failure
occurs, the query can be restarted from the last completed step.
■ Implement manual parallel deletes efficiently by creating a new table that omits
the unwanted rows from the original table, and then dropping the original
table. Alternatively, you can use the convenient parallel delete feature, which
directly deletes rows from the original table.
■ Create summary tables for efficient multidimensional drill-down analysis. For
example, a summary table might store the sum of revenue grouped by month,
brand, region, and salesman.
■ Reorganize tables, eliminating chained rows, compressing free space, and so on,
by copying the old table to a new table. This is much faster than export/import
and easier than reloading.
temporary extent still requires the overhead of acquiring a latch and searching
through the SGA structures, as well as SGA space consumption for the sort extent
pool.
There are several ways to optimize the parallel execution of join statements. You can
alter system configuration, adjust parameters as discussed earlier in this chapter, or
use hints, such as the DISTRIBUTION hint.
The key points when using EXPLAIN PLAN are to:
■ Verify optimizer selectivity estimates. If the optimizer thinks that only one row
will be produced from a query, it tends to favor using a nested loop. This could
be an indication that the tables are not analyzed or that the optimizer has made
an incorrect estimate about the correlation of multiple predicates on the same
table. A hint may be required to force the optimizer to use another join method.
Consequently, if the plan says only one row is produced from any particular
stage and this is incorrect, consider hints or gather statistics.
■ Use hash join on low cardinality join keys. If a join key has few distinct values,
then a hash join may not be optimal. If the number of distinct values is less than
the DOP, then some parallel query servers may be unable to work on the
particular query.
■ Consider data skew. If a join key involves excessive data skew, a hash join may
require some parallel query servers to work more than others. Consider using a
hint to cause a BROADCAST distribution method if the optimizer did not choose
it. Note that the optimizer will consider the BROADCAST distribution method
only if the OPTIMIZER_FEATURE_ENABLED is set to 9.0.2 or higher. See
"V$PQ_TQSTAT" on page 21-70 for further details.
segment header by decreasing the number of process free lists; this leaves more
room for transaction free lists in the segment header.
For UPDATE and DELETE operations, each server process can require its own
transaction free list. The parallel DML DOP is thus effectively limited by the
smallest number of transaction free lists available on the table and on any of the
global indexes the DML statement must maintain. For example, if the table has 25
transaction free lists and the table has two global indexes, one with 50 transaction
free lists and one with 30 transaction free lists, the DOP is limited to 25. If the table
had had 40 transaction free lists, the DOP would have been limited to 30.
The FREELISTS parameter of the STORAGE clause is used to set the number of
process free lists. By default, no process free lists are created.
The default number of transaction free lists depends on the block size. For example,
if the number of process free lists is not set explicitly, a 4 KB block has about 80
transaction free lists by default. The minimum number of transaction free lists is 25.
In this case, you should consider increasing the DBWn processes. If there are no
waits for free buffers, the query will not return any rows.
[NO]LOGGING Clause
The [NO]LOGGING clause applies to tables, partitions, tablespaces, and indexes.
Virtually no log is generated for certain operations (such as direct-path INSERT) if
the NOLOGGING clause is used. The NOLOGGING attribute is not specified at the
INSERT statement level but is instead specified when using the ALTER or CREATE
statement for a table, partition, index, or tablespace.
When a table or index has NOLOGGING set, neither parallel nor serial direct-path
INSERT operations generate undo or redo logs. Processes running with the
NOLOGGING option set run faster because no redo is generated. However, after a
NOLOGGING operation against a table, partition, or index, if a media failure occurs
before a backup is taken, then all tables, partitions, and indexes that have been
modified might be corrupted.
process in a second set of query processes based on key. Each process in the second
set sorts the keys and builds an index in the usual fashion. After all index pieces are
built, the parallel coordinator simply concatenates the pieces (which are ordered) to
form the final index.
Parallel local index creation uses a single server set. Each server process in the set is
assigned a table partition to scan and for which to build an index partition. Because
half as many server processes are used for a given DOP, parallel local index creation
can be run with a higher DOP.
You can optionally specify that no redo and undo logging should occur during
index creation. This can significantly improve performance but temporarily renders
the index unrecoverable. Recoverability is restored after the new index is backed
up. If your application can tolerate a window where recovery of the index requires
it to be re-created, then you should consider using the NOLOGGING clause.
The PARALLEL clause in the CREATE INDEX statement is the only way in which you
can specify the DOP for creating the index. If the DOP is not specified in the parallel
clause of CREATE INDEX, then the number of CPUs is used as the DOP. If there is no
PARALLEL clause, index creation is done serially.
When you add or enable a UNIQUE or PRIMARY KEY constraint on a table, you
cannot automatically create the required index in parallel. Instead, manually create
an index on the desired columns, using the CREATE INDEX statement and an
appropriate PARALLEL clause, and then add or enable the constraint. Oracle then
uses the existing index when enabling or adding the constraint.
Multiple constraints on the same table can be enabled concurrently and in parallel if
all the constraints are already in the ENABLE NOVALIDATE state. In the following
example, the ALTER TABLE ... ENABLE CONSTRAINT statement performs the table
scan that checks the constraint in parallel:
CREATE TABLE a (a1 NUMBER CONSTRAINT ach CHECK (a1 > 0) ENABLE NOVALIDATE)
PARALLEL;
If parallel DML is enabled and there is a PARALLEL hint or PARALLEL attribute set
for the table in the data dictionary, then inserts are parallel and appended, unless a
restriction applies. If either the PARALLEL hint or PARALLEL attribute is missing,
the insert is performed serially.
Parallelizing INSERT ... SELECT In the INSERT ... SELECT statement you can specify a
PARALLEL hint after the INSERT keyword, in addition to the hint after the SELECT
keyword. The PARALLEL hint after the INSERT keyword applies to the INSERT
operation only, and the PARALLEL hint after the SELECT keyword applies to the
SELECT operation only. Thus, parallelism of the INSERT and SELECT operations
are independent of each other. If one operation cannot be performed in parallel, it
has no effect on whether the other operation can be performed in parallel.
The ability to parallelize inserts causes a change in existing behavior if the user has
explicitly enabled the session for parallel DML and if the table in question has a
PARALLEL attribute set in the data dictionary entry. In that case, existing INSERT ...
SELECT statements that have the select operation parallelized can also have their
insert operation parallelized.
If you query multiple tables, you can specify multiple SELECT PARALLEL hints and
multiple PARALLEL attributes.
The APPEND keyword is not required in this example because it is implied by the
PARALLEL hint.
Parallelizing UPDATE and DELETE The PARALLEL hint (placed immediately after the
UPDATE or DELETE keyword) applies not only to the underlying scan operation,
but also to the UPDATE or DELETE operation. Alternatively, you can specify UPDATE
or DELETE parallelism in the PARALLEL clause specified in the definition of the
table to be modified.
If you have explicitly enabled parallel DML for the session or transaction, UPDATE
or DELETE statements that have their query operation parallelized can also have
their UPDATE or DELETE operation parallelized. Any subqueries or updatable views
in the statement can have their own separate PARALLEL hints or clauses, but these
parallel directives do not affect the decision to parallelize the update or delete. If
these operations cannot be performed in parallel, it has no effect on whether the
UPDATE or DELETE portion can be performed in parallel.
Tables must be partitioned in order to support parallel UPDATE and DELETE.
The PARALLEL hint is applied to the UPDATE operation as well as to the scan.
Again, the parallelism is applied to the scan as well as UPDATE operation on table
employees.
You can then update the customers table with the following SQL statement:
UPDATE /*+ PARALLEL(cust_joinview) */
(SELECT /*+ PARALLEL(customers) PARALLEL(diff_customer) */
CUSTOMER.c_name AS c_name
CUSTOMER.c_addr AS c_addr,
diff_customer.c_name AS c_newname, diff_customer.c_addr AS c_newaddr
WHERE customers.c_key = diff_customer.c_key) cust_joinview
SET c_name = c_newname, c_addr = c_newaddr;
The base scans feeding the join view cust_joinview are done in parallel. You can
then parallelize the update to further improve performance, but only if the
customer table is partitioned.
See Also:
■ "Rewriting SQL Statements" on page 21-78
■ Oracle9i Application Developer’s Guide - Fundamentals for
information about key-preserved tables
However, you can guarantee that the subquery is transformed into an anti-hash join
by using the HASH_AJ hint. Doing so enables you to use parallel INSERT to execute
the preceding statement efficiently. Parallel INSERT is applicable even if the table is
not partitioned.
Merging in Parallel
In Oracle9i, you combine the previous updates and inserts into one statement,
commonly known as a merge. The following statement achieves the same result as
all of the statements in "Updating the Table in Parallel" on page 21-91 and "Inserting
the New Rows into the Table in Parallel" on page 21-92:
MERGE INTO customers USING diff_customer
ON (diff_customer.c_key = customer.c_key)
WHEN MATCHED THEN
UPDATE SET (c_name, c_addr) = (SELECT c_name, c_addr
FROM diff_customer
WHERE diff_customer.c_key = customers.c_key)
WHEN NOT MATCHED THEN
INSERT VALUES (diff_customer.c_key,diff_customer.c_data);
Use discretion in employing hints. If used, hints should come as a final step in
tuning and only when they demonstrate a necessary and significant performance
advantage. In such cases, begin with the execution plan recommended by
cost-based optimization, and go on to test the effect of hints only after you have
quantified your performance expectations. Remember that hints are powerful. If
you use them and the underlying data changes, you might need to change the hints.
Otherwise, the effectiveness of your execution plans might deteriorate.
Always use cost-based optimization unless you have an existing application that
has been hand-tuned for rule-based optimization. If you must use rule-based
optimization, rewriting a SQL statement can greatly improve application
performance.
FIRST_ROWS(n) Hint
Starting with Oracle9i, a hint called FIRST_ROWS(n), where n is a positive integer
was added. This hint enables the optimizer to use a new optimization mode to
optimize the query to return n rows in the shortest amount of time. Oracle
Corporation recommends that you use this new hint in place of the old FIRST_
ROWS hint for online queries because the new optimization mode may improve the
response time compared to the old optimization mode.
Use the FIRST_ROWS(n) hint in cases where you want the first n number of rows
in the shortest possible time. For example, to obtain the first 10 rows in the shortest
possible time, use the hint as follows:
SELECT /*+ FIRST_ROWS(10) */ article_id
FROM articles_tab
WHERE CONTAINS(article, 'Oracle')>0
ORDER BY pub_date DESC;
however, have a small cost, so you should use it when that cost is likely to be a
small fraction of the total execution time.
If you enable dynamic statistic sampling, Oracle determines at compile time
whether a query would benefit from dynamic sampling. If so, a recursive SQL
statement is issued to scan a small, random sample of the table’s blocks, and to
apply the relevant single table predicates to estimate predicate selectivities. More
accurate selectivity and statistics estimates allow the optimizer to produce better
performing plans.
Dynamic sampling is controlled with the initialization parameter OPTIMIZER_
DYNAMIC_SAMPLING, which can be set to a value between 0 and 10, inclusive.
Increasing the value of the parameter will result in more aggressive application of
dynamic sampling, in terms of both the type (unanalyzed/analyzed) of tables
sampled and the amount of I/O spent on sampling.
The sample cardinality can also be used, in some cases, to estimate table cardinality.
Depending on the value of the OPTIMIZER_DYNAMIC_SAMPLING initialization
parameter, a certain number of blocks is read by the dynamic sampling query.
Oracle also provides the table-specific hint DYNAMIC_SAMPLING. If the table name
is omitted, the hint is considered cursor-level. If a cursor-level hint is specified
anywhere in the query (for example, in a subquery), it will apply to the entire query,
so care should be taken when specifying a cursor-level hint in a view or subquery.
The table-level hint forces dynamic sampling for the table.
Cost-Based Rewrite
Query rewrite is available with cost-based optimization. Oracle optimizes the input
query with and without rewrite and selects the least costly alternative. The
optimizer rewrites a query by rewriting one or more query blocks, one at a time.
If the rewrite logic has a choice between multiple materialized views to rewrite a
query block, it will select the one which can result in reading in the least amount of
data.
After a materialized view has been picked for a rewrite, the optimizer performs the
rewrite, and then tests whether the rewritten query can be rewritten further with
another materialized view. This process continues until no further rewrites are
possible. Then the rewritten query is optimized and the original query is optimized.
The optimizer compares these two optimizations and selects the least costly
alternative.
Since optimization is based on cost, it is important to collect statistics both on tables
involved in the query and on the tables representing materialized views. Statistics
are fundamental measures, such as the number of rows in a table, that are used to
calculate the cost of a rewritten query. They are created by using the DBMS_STATS
package.
Queries that contain in-line or named views are also candidates for query rewrite.
When a query contains a named view, the view name is used to do the matching
between a materialized view and the query. When a query contains an inline view,
the inline view can be merged into the query before matching between a
materialized view and the query occurs.
In addition, if the inline view's text definition exactly matches with that of an inline
view present in any eligible materialized view, general rewrite may be possible.
This is because, whenever a materialized view contains exactly identical inline view
text to the one present in a query, query rewrite treats such an inline view like a
named view or a table.
Figure 22–1 presents a graphical view of the cost-based approach used during the
rewrite process.
User's SQL
Oracle9i
Generate Rewrite
plan
Generate
plan
Choose
(based on cost)
Execute
■ Either all or part of the results requested by the query must be obtainable from
the precomputed result stored in the materialized view.
To determine this, the optimizer may depend on some of the data relationships
declared by the user using constraints and dimensions. Such data relationships
include hierarchies, referential integrity, and uniqueness of key data, and so on.
AND s.prod_id=p.prod_id
GROUP BY p.prod_id, t.week_ending_day, s.cust_id;
You must collect statistics on the materialized views so that the optimizer can
determine whether to rewrite the queries. You can do this either on a per object base
or for all newly created objects without statistics.
On a per object base, shown for join_sales_time_product_mv:
EXECUTE DBMS_STATS.GATHER_TABLE_STATS ('SH','JOIN_SALES_TIME_PRODUCT_MV',
estimate_percent=>20,block_sample=>TRUE,cascade=>TRUE);
See Also: Oracle9i Supplied PL/SQL Packages and Types Reference for
further information about using the DBMS_STATS package to
maintain statistics
parameter cannot enable query rewrite for materialized views that have disabled it
with the CREATE or ALTER statement.
The NOREWRITE hint disables query rewrite in a SQL statement, overriding the
QUERY_REWRITE_ENABLED parameter, and the REWRITE hint (when used with
mv_name) restricts the eligible materialized views to those named in the hint.
With OPTIMIZER_MODE set to choose, a query will not be rewritten unless at least
one table referenced by it has been analyzed. This is because the rule-based
optimizer is used when OPTIMIZER_MODE is set to choose and none of the tables
referenced in a query have been analyzed.
You can set the level of query rewrite for a session, thus allowing different users to
work at different integrity levels. The possible statements are:
ALTER SESSION SET QUERY_REWRITE_INTEGRITY = stale_tolerated;
ALTER SESSION SET QUERY_REWRITE_INTEGRITY = trusted;
ALTER SESSION SET QUERY_REWRITE_INTEGRITY = enforced;
Rewrite Hints
Hints can be included in SQL statements to control whether query rewrite occurs.
Using the NOREWRITE hint in a query prevents the optimizer from rewriting it.
The REWRITE hint with no argument in a query forces the optimizer to use a
materialized view (if any) to rewrite it regardless of the cost.
The REWRITE(mv1,mv2,...) hint with arguments forces rewrite to select the
most suitable materialized view from the list of names specified.
To prevent a rewrite, you can use the following statement:
SELECT /*+ NOREWRITE */ p.prod_subcategory, SUM(s.amount_sold)
FROM sales s, products p
WHERE s.prod_id=p.prod_id
GROUP BY p.prod_subcategory;
Note that the scope of a rewrite hint is a query block. If a SQL statement consists of
several query blocks (SELECT clauses), you might need to specify a rewrite hint on
each query block to control the rewrite for the entire statement.
The system privilege GRANT QUERY REWRITE lets you enable materialized views in
your own schema for query rewrite only if all tables directly referenced by the
materialized view are in that schema. The GRANT GLOBAL QUERY REWRITE
privilege allows you to enable materialized views for query rewrite even if the
materialized view references objects in other schemas.
The privileges for using materialized views for query rewrite are similar to those for
definer-rights procedures.
several situations where the output with rewrite can be different from that without
it.
■ A materialized view can be out of synchronization with the master copy of the
data. This generally happens because the materialized view refresh procedure is
pending following bulk load or DML operations to one or more detail tables of
a materialized view. At some data warehouse sites, this situation is desirable
because it is not uncommon for some materialized views to be refreshed at
certain time intervals.
■ The relationships implied by the dimension objects are invalid. For example,
values at a certain level in a hierarchy do not roll up to exactly one parent value.
■ The values stored in a prebuilt materialized view table might be incorrect.
■ Partition operations such as DROP and MOVE PARTITION on the detail table
could affect the results of the materialized view.
■ A wrong answer can occur because of bad data relationships defined by
unenforced table or view constraints.
When full text match fails, the optimizer then attempts a partial text match. In this
method, the text starting from the FROM clause of a query is compared against the
text starting with the FROM clause of a materialized view definition. Therefore, the
following query:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city,
AVG(s.amount_sold)
FROM sales s, products p, times t, customers c
WHERE s.time_id=t.time_id
AND s.prod_id=p.prod_id
AND s.cust_id=c.cust_id
GROUP BY p.prod_subcategory, t.calendar_month_desc, c.cust_city;
Note that, under the partial text match rewrite method, the average of sales
aggregate required by the query is computed using the sum of sales and count of
sales aggregates stored in the materialized view.
When neither text match succeeds, the optimizer uses a general query rewrite
method.
Text match rewrite can support set operators (UNION ALL, UNION, MINUS,
INTERSECT).
Table 22–1 Materialized View Types and General Query Rewrite Methods
MV with MV with Joins and MV with Aggregates
Query Rewrite Checks Joins Only Aggregates on a Single Table
Selection Compatibility X X X
Join Compatibility X X -
Table 22–1 Materialized View Types and General Query Rewrite Methods(Cont.)
MV with MV with Joins and MV with Aggregates
Query Rewrite Checks Joins Only Aggregates on a Single Table
Data Sufficiency X X X
Grouping Compatibility - X X
Aggregate Computability - X X
To perform these checks, the optimizer uses data relationships on which it can
depend. For example, primary key and foreign key relationships tell the optimizer
that each row in the foreign key table joins with at most one row in the primary key
table. Furthermore, if there is a NOT NULL constraint on the foreign key, it indicates
that each row in the foreign key table must join to exactly one row in the primary
key table.
Data relationships such as these are very important for query rewrite because they
tell what type of result is produced by joins, grouping, or aggregation of data.
Therefore, to maximize the rewritability of a large set of queries when such data
relationships exist in a database, they should be declared by the user.
View Constraints
Data warehouse applications recognize multi-dimensional cubes in the database by
identifying integrity constraints in the relational schema. Integrity constraints
represent primary and foreign key relationships between fact and dimension tables.
You can now establish a foreign-primary key relationship (in RELY ON) mode
between the view and the fact table, and thus rewrite will take place as described in
Table 22–3, by adding the following constraints. Rewrite will then work for example
in TRUSTED mode.
ALTER VIEW time_view ADD (CONSTRAINT time_view_pk
PRIMARY KEY (time_id) DISABLE NOVALIDATE);
ALTER VIEW time_view MODIFY CONSTRAINT time_view_pk RELY;
ALTER TABLE sales ADD (CONSTRAINT time_view_fk FOREIGN key (time_id)
REFERENCES time_view(time_id) DISABLE NOVALIDATE);
ALTER TABLE sales MODIFY CONSTRAINT time_view_fk RELY;
The following query, omitting the dimension table products, will also be rewritten
without the primary key/foreign key relationships, because the suppressed join
between sales and products is known to be lossless.
SELECT t.day_in_year,
SUM(s.amount_sold) AS sum_amount_sold
FROM time_view t, sales s
WHERE t.time_id = s.time_id
GROUP BY t.day_in_year;
losslessness of the delta materialized view join. With the additional constraints as
shown previously, this query will also rewrite.
SELECT p.prod_category,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p
WHERE p.prod_id = s.prod_id
GROUP BY p.prod_category;
To revert the changes you have made to the sales history schema, apply the
following SQL commands:
ALTER TABLE sales DROP CONSTRAINT time_view_fk;
DROP VIEW time_view;
Expression Matching
An expression that appears in a query can be replaced with a simple column in a
materialized view provided the materialized view column represents a
precomputed expression that matches with the expression in the query. If a query
can be rewritten to use a materialized view, it will be faster. This is because
materialized views contain precomputed calculations and do not need to perform
expression computation.
The expression matching is done by first converting the expressions into canonical
forms and then comparing them for equality. Therefore, two different expressions
will be matched as long as they are equivalent to each other. Further, if the entire
expression in a query fails to match with an expression in a materialized view, then
subexpressions of it are tried to find a match. The subexpressions are tried in a
top-down order to get maximal expression matching.
Consider a query that asks for sum of sales by age brackets (1-10, 11-20, 21-30, and
so on).
Date Folding
Date folding rewrite is a specific form of expression matching rewrite. In this type
of rewrite, a date range in a query is folded into an equivalent date range
representing higher date granules. The resulting expressions representing higher
date granules in the folded date range are matched with equivalent expressions in a
materialized view. The folding of date range into higher date granules such as
months, quarters, or years is done when the underlying datatype of the column is
an Oracle DATE. The expression matching is done based on the use of canonical
forms for the expressions.
DATE is a built-in datatype which represents ordered time units such as seconds,
days, and months, and incorporates a time hierarchy (second -> minute -> hour ->
day -> month -> quarter -> year). This hard-coded knowledge about DATE is used
in folding date ranges from lower-date granules to higher-date granules.
Specifically, folding a date value to the beginning of a month, quarter, year, or to the
end of a month, quarter, year is supported. For example, the date value
1-jan-1999 can be folded into the beginning of either year 1999 or quarter
1999-1 or month 1999-01. And, the date value 30-sep-1999 can be folded into
the end of either quarter 1999-03 or month 1999-09.
Note: Due to the way date folding works, you should be careful
when using BETWEEN and date columns. The best way to use
BETWEEN and date columns is to increment the later date by 1. In
other words, instead of using date_col BETWEEN
'1-jan-1999' AND '30-jun-1999', you should use date_
col BETWEEN '1-jan-1999' AND '1-jul-1999'. You could
also use the TRUNC function to get the equivalent result, as in
TRUNC(date_col) BETWEEN '1-jan-1999' AND
'30-jun-1999'. TRUNC will, however, strip time values.
Because date values are ordered, any range predicate specified on date columns can
be folded from lower level granules into higher level granules provided the date
range represents an integral number of higher level granules. For example, the
range predicate date_col >= '1-jan-1999' AND date_col <
'30-jun-1999' can be folded into either a month range or a quarter range using
the TO_CHAR function, which extracts specific date components from a date value.
The advantage of aggregating data by folded date values is the compression of data
achieved. Without date folding, the data is aggregated at the lowest granularity
level, resulting in increased disk space for storage and increased I/O to scan the
materialized view.
Consider a query that asks for the sum of sales by product types for the years 1998.
SELECT p.prod_category, SUM(s.amount_sold)
FROM sales s, products p
WHERE s.prod_id=p.prod_id
AND s.time_id >= TO_DATE('01-jan-1998', 'dd-mon-yyyy')
AND s.time_id < TO_DATE('01-jan-1999', 'dd-mon-yyyy')
GROUP BY p.prod_category;
The range specified in the query represents an integral number of years, quarters, or
months. Assume that there is a materialized view mv3 that contains
pre-summarized sales by prod_type and is defined as follows:
CREATE MATERIALIZED VIEW mv3
ENABLE QUERY REWRITE
AS
SELECT prod_type, TO_CHAR(sale_date,'yyyy-mm') AS month, SUM(sales) AS sum_sales
FROM fact, product
WHERE fact.prod_id = product.prod_id
GROUP BY prod_type, TO_CHAR(sale_date, 'yyyy-mm');
The query can be rewritten by first folding the date range into the month range and
then matching the expressions representing the months with the month expression
in mv3. This rewrite is shown in two steps (first folding the date range followed by
the actual rewrite).
SELECT prod_type, SUM(sales) AS sum_sales
FROM fact, product
WHERE fact.prod_id = product.prod_id AND
TO_CHAR(sale_date, 'yyyy-mm') >=
TO_CHAR('01-jan-1998', 'yyyy-mm') AND < TO_CHAR('01-jan-1999', 'yyyy-mm')
GROUP BY prod_type;
Selection Compatibility
Oracle supports rewriting of queries so that they will use materialized views in
which the HAVING or WHERE clause of the materialized view contains a selection of
a subset of the data in a table or tables. A materialized view's WHERE or HAVING
clause can contain a join, a selection, or both, and still be used by a rewritten query.
Predicate clauses containing expressions, or selecting rows based on the values of
particular columns, are examples of non-join predicates.
To perform this type of query rewrite, Oracle must determine if the data requested
in the query is contained in, or is a subset of, the data stored in the materialized
view. This problem is sometimes referred to as the data containment problem or, in
more general terms, the problem of a restricted subset of data in a materialized
view. The following sections detail the conditions where Oracle can solve this
problem and thus rewrite a query to use a materialized view that contains a
restricted portion of the data in the detail table.
Selection compatibility is performed when both the query and the materialized
view contain selections (non-joins). A selection compatibility check is done on the
WHERE as well as the HAVING clause. If the materialized view contains selections
and the query does not, then selection compatibility check fails because the
materialized view is more restrictive than the query. If the query has selections and
the materialized view does not then selection compatibility check is not needed.
Regardless, selections and any columns mentioned in them must pass the data
sufficiency check.
right-hand side contains the values. For example, color='red' means the
left-hand side is color and the right-hand side is 'red' and the relational
operator is (=).
■ LHS-constrained
When comparing a selection from the query with a selection from the
materialized view, the left-hand side of the selection is compared with the
left-hand side of the query. If they match, they are said to be LHS-constrained or
just constrained for short.
■ RHS-constrained
When comparing a selection from the query with a selection from the
materialized view, the right-hand side of the selection is compared with the
right-hand side of the query. If they match, they are said to be RHS-constrained
or just constrained. Note that before comparing the selections, the
LHS/RHS-expression is converted to a canonical form and then the comparison
is done. This means that expressions such as column1 + 5 and 5 + column1
will match and be constrained.
Although selection compatibility does not restrict the general form of the WHERE,
there is an optimal pattern and normally most queries fall into this pattern as
follows:
(join predicate AND join predicate AND ....) AND
(selection predicate AND|OR selection predicate .... )
The join compatibility check operates on the joins and the selection compatibility
operates on the selections. If the WHERE clause has an OR at the top, then the
optimizer first checks for common predicates under the OR. If found, the common
predicates are factored out from under the OR then joined with an AND back to the
OR. This helps to put the WHERE into the optimal pattern. This is done only if OR
occurs at the top of the WHERE clause. For example, if the WHERE clause is:
(sales.prod_id = prod.prod_id AND prod.prod_name = 'Kids Polo Shirt')
OR (sales.prod_id = prod.prod_id AND prod.prod_name = 'Kids Shorts')
If the WHERE is so complex that factoring cannot be done, all predicates under the
OR are treated as selections and join compatibility is not performed but selection
compatibility is still performed. In the HAVING clause, all predicates are considered
selections.
Selection compatibility categorizes selections into the following cases:
■ Simple
Simple selections are of the form expression relop constant.
■ Complex
Complex selections are of the form expression relop expression.
■ Range
Range selections are of a form such as WHERE (cust_last_name BETWEEN
'abacrombe' AND 'anakin').
Note that simple selections with relational operators (<,<=,>,>=)are also
considered range selections.
■ IN lists
Single and multi-column IN lists such as WHERE(prod_id) IN (102, 233,
....).
Note that selections of the form (column1='v1' OR column1='v2' OR
column1='v3' OR ....) are treated as a group and classified as an IN list.
■ IS [NOT] NULL
■ [NOT] LIKE
■ Other
Other selections are when selection compatibility cannot determine
containment of data. For example, EXISTS.
When comparing a selection from the query with a selection from the materialized
view, the left-hand side of the selection is compared with the left-hand side of the
query. If they match, they are said to be LHS-constrained or constrained for short.
If the selections are constrained, then the right-hand side values are checked for
containment. That is, the RHS values of the query selection must be contained by
right-hand side values of the materialized view selection.
In this example, the selections are constrained on prod_id and the right-hand side
value of the query 102 is within the range of the materialized view.
In this example, the selections are constrained on prod_id and the query range is
within the materialized view range. In this example, we notice that both query
selections are constrained by the same materialized view selection. The left-hand
side can be an expression.
If the left-hand side and the right-hand side are constrained and the <selection
relop> is the same, then generally the selection can be dropped from the rewritten
query. Otherwise, the selection must be keep to filter out extra data from the
materialized view.
If query rewrite can drop the selection from the rewritten query, then any columns
from the selection may not have to be in the materialized view so more rewrites can
be done with less data.
Selection compatibility requires that all selections in the materialized view be
LHS-constrained with some selection in the query. This ensures that the
materialized view data is not more restrictive that the query.
In this example, the materialized view IN lists are constrained by the columns in the
query multi-column in list. Furthermore, the right-hand side values of the query
selection are contained by the materialized view so that rewrite will occur.
In this example, the materialized view IN-list columns are fully constrained by the
columns in the query selections. Furthermore, the right-hand side values of the
query selection are contained by the materialized view. However, the following
example fails selection compatibility check.
In this example, the materialized view in list column cust_city is not constrained
so the materialized view is more restrictive than the query. Selection compatibility
also works with complex ORs. If we assume that the shape of the WHERE is as
follows:
(selection AND selection AND ...) OR (selection AND selection AND ...)
Each group of selections separated by AND is related and the group is called a
disjunct. The disjuncts are separated by ORs. Selection compatibility requires that
every disjunct in the query be contained by some disjunct in the materialized view.
Otherwise, the materialized view is more restrictive than the query. The
materialized view disjuncts do not have to match any query disjunct. This just
means that the materialized view has more data than the query requires. When
comparing a disjunct from the query with a disjunct of the materialized view, the
normal selection compatibility rules apply as specified in the previous discussion.
For example:
In this example, the query has a single disjunct (group of selections separated by
AND). The materialized view has two disjuncts separated by OR. The query disjunct
is contained by the second materialized view disjunct so selection compatibility
succeeds. It is clear that the materialized view contains more data than needed by
the query so the query can be rewritten.
The following query could be rewritten to use this materialized view because the
query asks for the amount where the customer ID is 10 and this is contained in the
materialized view.
SELECT t.calendar_month_desc, SUM(s.amount_sold) AS dollars
FROM times t, sales s
WHERE s.time_id = t.time_id AND s.cust_id = 10
GROUP BY t.calendar_month_desc;
Because the predicate s.cust_id = 10 selects the same data in the query and in
the materialized view, it is dropped from the rewritten query. This means the
rewritten query looks like:
SELECT mv.calendar_month_desc, mv.dollars FROM cal_month_sales_id_mv mv;
Query rewrite can also occur when the query specifies a range of values, such as
s.prod_id > 10000 and s.prod_id < 20000, as long as the range specified in
the query is within the range specified in the materialized view. For example, if
there is a materialized view defined as:
CREATE MATERIALIZED VIEW product_sales_mv
BUILD IMMEDIATE
REFRESH FORCE
ENABLE QUERY REWRITE
AS
SELECT p.prod_name, SUM(s.amount_sold) AS dollar_sales
FROM products p, sales s
WHERE p.prod_id = s.prod_id
GROUP BY prod_name
HAVING SUM(s.amount_sold) BETWEEN 5000 AND 50000;
Rewrite with select expressions is also supported when the expression evaluates to
a constant, such as TO_DATE('12-SEP-1999','DD-Mon-YYYY'). For example, if
an existing materialized view is defined as:
CREATE MATERIALIZED VIEW sales_on_valentines_day_99_mv
BUILD IMMEDIATE
REFRESH FORCE
ENABLE QUERY REWRITE
AS
SELECT prod_id, cust_id, amount_sold
FROM sales s, times t
WHERE s.time_id = t.time_id
AND t.time_id = TO_DATE('04-FEB-1999', 'DD-MON-YYYY');
Rewrite can also occur against a materialized view when the selection is contained
in an IN expression. For example, given the following materialized view definition:
CREATE MATERIALIZED VIEW popular_promo_sales_mv
BUILD IMMEDIATE
REFRESH FORCE
ENABLE QUERY REWRITE
AS
SELECT p.promo_name, SUM(s.amount_sold) AS sum_amount_sold
You can also use expressions in selection predicates. This process looks like the
following example:
expression relational operator constant
This is an example where the query is more restrictive than the definition of the
materialized view, so rewrite can occur. However, if the query had selected promo_
category, then it could not have been rewritten against the materialized view,
because the materialized view definition does not contain that column.
For another example, if the definition of a materialized view restricts a city name
column to Boston, then a query that selects Seattle as a value for this column
can never be rewritten with that materialized view, but a query that restricts city
name to Boston and restricts a column value that is not restricted in the
materialized view could be rewritten to use the materialized view.
All the rules noted previously also apply when predicates are combined with an OR
operator. The simple predicates, or simple predicates connect by ANDs, are
considered separately. Each predicate in the query must appear in the materialized
view if rewrite is to occur.
For example, the query could have a restriction like city='Boston' OR city
='Seattle' and to be eligible for rewrite, the materialized view that the query
might be rewritten against must have the same restriction. In fact, the materialized
view could have additional restrictions, such as city='Boston' OR
city='Seattle' OR city='Cleveland' and rewrite might still be possible.
Note, however, that the reverse is not true. If the query had the restriction city =
'Boston' OR city='Seattle' OR city='Cleveland' and the materialized
view only had the restriction city='Boston' OR city='Seattle', then rewrite
would not be possible since the query seeks more data than is contained in the
restricted subset of data stored in the materialized view.
■ Common joins that occur in both the query and the materialized view. These
joins form the common subgraph.
■ Delta joins that occur in the query but not in the materialized view. These joins
form the query delta subgraph.
■ Delta joins that occur in the materialized view but not in the query. These joins
form the materialized view delta subgraph.
These can be visualized as shown in Figure 22–2.
Materialized
view join
graph
customers products times
sales
Common MV
subgraph delta
Common Joins The common join pairs between the two must be of the same type, or
the join in the query must be derivable from the join in the materialized view. For
example, if a materialized view contains an outer join of table A with table B, and a
query contains an inner join of table A with table B, the result of the inner join can
be derived by filtering the anti-join rows from the result of the outer join.
The common joins between this query and the materialized view join_sales_
time_product_mv are:
s.time_id = t.time_id AND s.prod_id = p.prod_id
In general, if you use an outer join in a materialized view containing only joins, you
should put in the materialized view either the primary key or the rowid on the right
side of the outer join. For example, in the previous example, join_sales_time_
product_oj_mv, there is a primary key on both sales and products.
Another example of when a materialized view containing only joins is used is the
case of a semi-join rewrites. That is, a query contains either an EXISTS or an IN
subquery with a single table.
Consider this query, which reports the products that had sales greater than $1,000.
SELECT DISTINCT prod_name
FROM products p
WHERE EXISTS
(SELECT *
FROM sales s
WHERE p.prod_id=s.prod_id
AND s.amount_sold > 1000);
This query contains a semi-join between the products and the sales table:
s.prod_id = p.prod_id
Rewrites with semi-joins are currently restricted to materialized views with joins
only and are not available for materialized views with joins and aggregates.
Query Delta Joins A query delta join is a join that appears in the query but not in the
materialized view. Any number and type of delta joins in a query are allowed and
they are simply retained when the query is rewritten with a materialized view.
Upon rewrite, the materialized view is joined to the appropriate tables in the query
delta.
For example, consider the following query:
SELECT p.prod_name, t.week_ending_day, c.cust_city,
SUM(s.amount_sold)
FROM sales s, products p, times t, customers c
WHERE s.time_id=t.time_id
AND s.prod_id = p.prod_id
AND s.cust_id = c.cust_id
GROUP BY prod_name, week_ending_day, cust_city;
Materialized View Delta Joins A materialized view delta join is a join that appears in
the materialized view but not the query. All delta joins in a materialized view are
required to be lossless with respect to the result of common joins. A lossless join
guarantees that the result of common joins is not restricted. A lossless join is one
where, if two tables called A and B are joined together, rows in table A will always
match with rows in table B and no data will be lost, hence the term lossless join. For
example, every row with the foreign key matches a row with a primary key
provided no nulls are allowed in the foreign key. Therefore, to guarantee a lossless
join, it is necessary to have FOREIGN KEY, PRIMARY KEY, and NOT NULL constraints
on appropriate join keys. Alternatively, if the join between tables A and B is an outer
join (A being the outer table), it is lossless as it preserves all rows of table A.
All delta joins in a materialized view are required to be non-duplicating with
respect to the result of common joins. A non-duplicating join guarantees that the
result of common joins is not duplicated. For example, a non-duplicating join is one
where, if table A and table B are joined together, rows in table A will match with at
most one row in table B and no duplication occurs. To guarantee a non-duplicating
join, the key in table B must be constrained to unique values by using a primary key
or unique constraint.
Consider the following query that joins sales and times:
SELECT t.week_ending_day,
SUM(s.amount_sold)
FROM sales s, times t
WHERE s.time_id = t.time_id
AND t.week_ending_day BETWEEN TO_DATE('01-AUG-1999', 'DD-MON-YYYY')
AND TO_DATE('10-AUG-1999', 'DD-MON-YYYY')
GROUP BY week_ending_day;
The query can also be rewritten with the materialized view join_sales_time_
product_mv_oj where foreign key constraints are not needed. This view contains
an outer join (s.prod_id=p.prod_id(+)) between sales and products. This
makes the join lossless. If p.prod_id is a primary key, then the non-duplicating
condition is satisfied as well and optimizer will rewrite the query as follows:
SELECT week_ending_day,
SUM(amount_sold)
FROM join_sales_time_product_oj_mv
WHERE week_ending_day BETWEEN TO_DATE('01-AUG-1999', 'DD-MON-YYYY')
column A.X in a query with column B.X in a materialized view or vice versa. For
example, consider this query:
SELECT p.prod_name, s.time_id, t.week_ending_day,
SUM(s.amount_sold)
FROM sales s, products p, times t
WHERE s.time_id=t.time_id
AND s.prod_id = p.prod_id
GROUP BY p.prod_name, s.time_id, t.week_ending_day;
Here the products table is called a joinback table because it was originally joined
in the materialized view but joined again in the rewritten query.
There are two ways to declare functional dependency:
■ Using the primary key constraint (as shown in the previous example)
■ Using the DETERMINES clause of a dimension
The DETERMINES clause of a dimension definition might be the only way you could
declare functional dependency when the column that determines another column
cannot be a primary key. For example, the products table is a denormalized
dimension table that has columns prod_id, prod_name, and prod_
subcategory, and prod_subcategory functionally determines prod_subcat_
desc and prod_category determines prod_cat_desc.
The first functional dependency can be established by declaring prod_id as the
primary key, but not the second functional dependency because the prod_
subcategory column contains duplicate values. In this situation, you can use the
DETERMINES clause of a dimension to declare the second functional dependency.
The following dimension definition illustrates how the functional dependencies are
declared:
CREATE DIMENSION products_dim
LEVEL product IS (products.prod_id)
LEVEL subcategory IS (products.prod_subcategory)
LEVEL category IS (products.prod_category)
HIERARCHY prod_rollup (
product CHILD OF
subcategory CHILD OF
category
)
The hierarchy prod_rollup declares hierarchical relationships that are also 1:n
functional dependencies. The 1:1 functional dependencies are declared using the
DETERMINES clause, as seen when prod_subcategory functionally determines
prod_subcat_desc.
Consider the following query:
SELECT p.prod_subcat_desc, t.week_ending_day,
SUM(s.amount_sold)
FROM sales s, products p, times t
WHERE s.time_id=t.time_id
AND s.prod_id=p.prod_id
AND p.prod_subcat_desc LIKE '%Men'
GROUP BY p.prod_subcat_desc, t.week_ending_day;
In other words, the level of grouping is the same in both the query and the
materialized view.
If the grouping of data requested by a query is at a coarser level compared to the
grouping of data stored in a materialized view, the optimizer can still use the
materialized view to rewrite the query. For example, the materialized view sum_
sales_pscat_week_mv groups by week_ending_day, and prod_
subcategory. This query groups by prod_subcategory, a coarser grouping
granularity:
SELECT p.prod_subcategory, SUM(s.amount_sold) AS sum_amount
FROM sales s, products p
WHERE s.prod_id=p.prod_id
GROUP BY p.prod_subcategory;
FROM products) pv
WHERE mv.prod_subcategory=mv.prod_subcategory
GROUP BY pv.prod_subcategory, mv.week_ending_day;
Note that, for this rewrite, the data sufficiency check determines that a joinback to
the products table is necessary, and the grouping compatibility check determines
that aggregate rollup is necessary.
FROM sum_sales_pscat_month_city_mv mv
GROUP BY mv.prod_subcategory;
The argument of an aggregate such as SUM can be an arithmetic expression like A+B.
The optimizer will try to match an aggregate SUM(A+B) in a query with an
aggregate SUM(A+B) or SUM(B+A) stored in a materialized view. In other words,
expression equivalence is used when matching the argument of an aggregate in a
query with the argument of a similar aggregate in a materialized view. To
accomplish this, Oracle converts the aggregate argument expression into a
canonical form such that two different but equivalent expressions convert into the
same canonical form. For example, A*(B-C), A*B-C*A, (B-C)*A, and -A*C+A*B
all convert into the same canonical form and, therefore, they are successfully
matched.
Query Rewrite with Inline Views Oracle supports general query rewrite when the user
query contains an inline view, or a subquery in the FROM list. Query rewrite
matches inline views in the materialized view with inline views in the request
query when the text of the two inline views exactly match. In this case, rewrite
treats the matching inline view as it would a named view, and general rewrite
processing is possible.
Here is an example where the materialized view contains an inline view, and the
query has the same inline view, but the aliases for these views are different.
Previously, this query could not be rewritten because neither exact text match nor
partial text match is possible.
Here is the materialized view definition:
CREATE MATERIALIZED VIEW inline_example
ENABLE QUERY REWRITE AS
SELECT t.calendar_month_name, t.calendar_year p.prod_category,
SUM(V1.revenue) AS sum_revenue
FROM times t, products p,
(SELECT time_id, prod_id, amount_sold*0.2 as revenue FROM sales) V1
WHERE t.time_id = V1.time_id
AND p.prod_id = V1.prod_id
GROUP BY calendar_month_name, calendar_year, prod_category ;
And here is the query that will be rewritten to use the materialized view:
SELECT t.calendar_month_name, t.calendar_year, p.prod_category,
SUM(X1.revenue) AS sum_revenue
FROM times t, products p,
(SELECT time_id, prod_id, amount_sold*0.2 AS revenue FROM sales) X1
WHERE t.time_id = X1.time_id
Query Rewrite with Selfjoins Query rewrite of queries which contain multiple
references to the same tables, or self joins are possible, to the extent that general
rewrite can occur when the query and the materialized view definition have the
same aliases for the multiple references to a table. This allows Oracle to provide a
distinct identity for each table reference and this in turn allows query rewrite.
The following is an example of a materialized view and a query. In this example, the
query is missing a reference to a column in a table so an exact text match will not
work. But general query rewrite can occur because the aliases for the table
references match.
To demonstrate the self-join rewriting possibility with the Sales History schema,
we are assuming the following addition to include the actual shipping and payment
date in the fact table, referencing the same dimension table times. This is for
demonstration purposes only and will not return any results:
ALTER TABLE sales ADD (time_id_ship DATE);
ALTER TABLE sales ADD (CONSTRAINT time_id_book_fk FOREIGN key (time_id_ship)
REFERENCES times(time_id) ENABLE NOVALIDATE);
ALTER TABLE sales MODIFY CONSTRAINT time_id_book_fk RELY;
ALTER TABLE sales ADD (time_id_paid DATE);
ALTER TABLE sales ADD (CONSTRAINT time_id_paid_fk FOREIGN key (time_id_paid)
REFERENCES times(time_id) ENABLE NOVALIDATE);
ALTER TABLE sales MODIFY CONSTRAINT time_id_paid_fk RELY;
The following query fails the exact text match test but is rewritten because the
aliases for the table references match:
SELECT s.prod_id,
t2.fiscal_week_number - t1.fiscal_week_number AS lag
FROM times t1, sales s, times t2
WHERE t1.time_id = s.time_id
AND t2.time_id = s.time_id_ship;
Note that Oracle performs other checks to insure the correct match of an instance of
a multiply instanced table in the request query with the corresponding table
instance in the materialized view. For instance, in the following example, Oracle
correctly determines that the matching alias names used for the multiple instances
of table time does not establish a match between the multiple instances of table
time in the materialized view:
The following query cannot be rewritten using sales_shipping_lag_mv even
though the alias names of the multiply instanced table time match because the
joins are not compatible between the instances of time aliased by t2:
SELECT s.prod_id,
t2.fiscal_week_number - t1.fiscal_week_number AS lag
FROM times t1, sales s, times t2
WHERE t1.time_id = s.time_id AND t2.time_id = s.time_id_paid;
This request query joins the instance of the time table aliased by t2 on the
s.time_id_paid column, while the materialized views joins the instance of the
time table aliased by t2 on the s.time_id_ship column. Because the join
conditions differ, Oracle correctly determines that rewrite cannot occur.
partitioning key of the table is available in the SELECT list of the materialized view
because this is the easiest way to map a row to a stale partition. The key points
when using partially stale materialized views are:
■ Query rewrite can use an materialized view in ENFORCED or TRUSTED mode if
the rows from the materialized view used to answer the query are known to be
FRESH.
■ The fresh rows in the materialized view are identified by adding selection
predicates to the materialized view's WHERE clause. We will rewrite a query
with this materialized view if its answer is contained within this (restricted)
materialized view. Note that support for materialized views with selection
predicates is a prerequisite for this type of rewrite.
The fact table sales is partitioned based on ranges of time_id as follows:
PARTITION BY RANGE (time_id)
(PARTITION SALES_Q1_1998
VALUES LESS THAN (TO_DATE('01-APR-1998', 'DD-MON-YYYY')),
PARTITION SALES_Q2_1998
VALUES LESS THAN (TO_DATE('01-JUL-1998', 'DD-MON-YYYY')),
PARTITION SALES_Q3_1998
VALUES LESS THAN (TO_DATE('01-OCT-1998', 'DD-MON-YYYY')),
...
Suppose new data will be inserted for December 2000, which will end up in the
partition sales_q4_2000. For testing purposes, you can apply an arbitrary DML
operation on sales, changing a different partition than sales_q1_2000 when
this materialized view is fresh. For example:
INSERT INTO SALES VALUES(10,10,’01-dec-2000’,’S’,10,123.45,54321);
Until a refresh is done, the materialized view is generically stale and cannot be used
for unlimited rewrite in enforced mode. However, because the table sales is
partitioned and not all partitions have been modified, Oracle can identify all
partitions that have not been touched. The fresh rows in the materialized view, that
means the data of all partitions where Oracle knows that no changes have occurred,
can be represented by modifying the materialized view's defining query as follows:
SELECT s.time_id, p.prod_subcategory, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.cust_id = c.cust_id
AND s.prod_id = p.prod_id
AND s.time_id < TO_DATE('01-OCT-2000','DD-MON-YYYY')
GROUP BY time_id, prod_subcategory, cust_city;
Note that the freshness of partially stale materialized views is tracked on a per
partition base, and not on a logical base. Since the partitioning strategy of the sales
fact table is on a quarterly base, changes in December 2000 causes the complete
partition sales_q4_2000 to become stale.
Consider the following query which asks for sales in quarter 1 and 2 of 2000:
SELECT s.time_id, p.prod_subcategory, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.cust_id = c.cust_id
AND s.prod_id = p.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2000', 'DD-MON-YYYY')
AND TO_DATE('01-JUL-2000', 'DD-MON-YYYY')
GROUP BY time_id, prod_subcategory, cust_city;
Oracle knows that those ranges of rows in the materialized view are fresh and can
therefore rewrite the query with the materialized view. The rewritten query looks as
follows:
SELECT time_id, prod_subcategory, cust_city, sum_amount_sold
FROM sum_sales_per_city_mv
WHERE time_id BETWEEN TO_DATE('01-JAN-2000', 'DD-MON-YYYY')
AND TO_DATE('01-JUL-2000', 'DD-MON-YYYY');
Instead of the partitioning key, a partition marker (a function that identifies the
partition given a rowid) can be present in the select (and GROUP BY list) of the
materialized view. You can use the materialized view to rewrite queries that require
data from only certain partitions (identifiable by the partition-marker), for instance,
queries that reference a partition-extended table-name or queries that have a
predicate specifying ranges of the partitioning keys containing entire partitions. See
Chapter 8, "Materialized Views" for details regarding the supplied partition marker
function DBMS_MVIEW.PMARKER.
The following example illustrates the use of a partition marker in the materialized
view instead of the direct usage of the partition key column.
CREATE MATERIALIZED VIEW sum_sales_per_city_2_mv
ENABLE QUERY REWRITE
AS
SELECT DBMS_MVIEW.PMARKER(s.rowid) AS pmarker,
t.fiscal_quarter_desc, p.prod_subcategory, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id
AND s.prod_id = p.prod_id
AND s.time_id = t.time_id
GROUP BY DBMS_MVIEW.PMARKER(s.rowid),
prod_subcategory, cust_city, fiscal_quarter_desc;
Suppose you know that the partition sales_q1_2000 is fresh and DML changes
have taken place for other partitions of the sales table. For testing purposes, you
can apply an arbitrary DML operation on sales, changing a different partition
than sales_q1_2000 when the materialized view is fresh. For example:
INSERT INTO SALES VALUES(10,10,'01-dec-2000','S',10,123.45,54321);
The same query could have been expressed with a partition-extended name as in
the following statement:
SELECT p.prod_subcategory, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
Note that rewrite with a partially stale materialized view that contains a PMARKER
function can only take place when the complete data content of one or more
partitions is accessed and the predicate condition is on the partitioned fact table
itself, as shown in the earlier example.
The DBMS_MVIEW.PMARKER function gives you exactly one distinct value for each
partition. This dramatically reduces the number of rows in a potential materialized
view compared to the partitioning key itself, but you are also giving up any
detailed information about this key. The only thing you know is the partition
number and, therefore, the lower and upper boundary values. This is the trade-off
for reducing the cardinality of the range partitioning column and thus the number
of rows.
Assuming the value of p_marker for partition sales_q1_2000 is 31070, the
previously shown queries can be rewritten against the materialized view as:
SELECT mv.prod_subcategory, mv.cust_city,
SUM(mv.sum_amount_sold)
FROM sum_sales_per_city_2_mv mv
WHERE mv.pmarker = 31070
AND mv.cust_city= 'Nuernberg'
GROUP BY prod_subcategory, cust_city;
So the query can be rewritten against the materialized view without accessing stale
data.
the query and materialized view exactly match and the aliases of the duplicate
tables in both the query and materialized view exactly match. All other cases
involving inline views and self-joins will make a materialized view complex.
Oracle first tries to rewrite it with a materialized aggregate view and finds there is
none eligible (note that single-table aggregate materialized view sum_sales_
store_time_mv cannot yet be used), and then tries a rewrite with a materialized
Because a rewrite occurred, Oracle tries the process again. This time the query can
be rewritten with single-table aggregate materialized view sum_sales_store_
time into this form:
SELECT mv.prod_name, mv.week_ending_day, mv.sum_amount_sold
FROM sum_sales_time_product_mv mv;
The term base grouping for queries with GROUP BY extensions denotes all unique
expressions present in the GROUP BY clause. In the previous query, the following
grouping (p.prod_subcategory, t.calendar_month_desc, c.cust_city,) is a base
grouping.
The extensions can be present in user queries and in the queries defining
materialized views. In both cases, materialized view rewrite applies and you can
distinguish rewrite capabilities into the following scenarios:
Materialized View Has Simple GROUP BY and Query Has Extended GROUP BY
When a query contains an extended GROUP BY clause, it can be rewritten with a
materialized view if its base grouping can be rewritten using the materialized view
as listed in the rewrite rules explained in "When Does Oracle Rewrite a Query?" on
page 22-4. For example, in the following query:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, customers c, products p, times t
WHERE s.time_id=t.time_id
AND s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
(
(p.prod_subcategory, t.calendar_month_desc),
(c.cust_city, p.prod_subcategory)
);
A special situation arises if the query uses the EXPAND_GSET_TO_UNION hint. See
"Hint for Queries with Extended GROUP BY" on page 22-56 for an example of using
EXPAND_GSET_TO_UNION.
Materialized View Has Extended GROUP BY and Query Has Simple GROUP BY
In order for a materialized view with an extended GROUP BY to be used for rewrite,
it must satisfy two additional conditions:
■ It must contain a grouping distinguisher, which is the GROUPING_ID function
on all GROUP BY expressions. For example, if the GROUP BY clause of the
materialized view is GROUP BY CUBE(a, b), then the SELECT list should
contain GROUPING_ID(a, b).
■ The GROUP BY clause of the materialized view should not result in any
duplicate groupings. For example, GROUP BY GROUPING SETS ((a, b),
(a, b)) would disqualify an materialized view from general rewrite.
This query will be rewritten with the closest matching grouping from the
materialized view. That is, the (prodcategory, prod_subcategory, cust_
city) grouping:
SELECT
prod_subcategory, cust_city,
SUM(sum_amount_sold) AS sum_amount_sold
FROM sum_grouping_set_mv
WHERE gid = grouping identifier of (prod_category,prod_subcategory, cust_city)
GROUP BY prod_subcategory, cust_city;
Oracle tries grouping match. The groupings in the query are matched against
groupings in the materialized view and if all are matched with no rollup, Oracle
selects them from the materialized view. For example, the following query:
SELECT
p.prod_category, p.prod_subcategory, c.cust_state_province, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
(
(p.prod_category, p.prod_subcategory, c.cust_city),
(p.prod_category, p.prod_subcategory)
);
SELECT
null, p.prod_subcategory, null,
t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY p.prod_subcategory, t.calendar_month_desc
UNION ALL
SELECT
null, null, null,
t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY t.calendar_month_desc
SELECT
p.prod_category, p.prod_subcategory, c.cust_state_province,
null, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY p.prod_category, p.prod_subcategory, c.cust_state_province
UNION ALL
SELECT
p.prod_category, p.prod_subcategory, null,
null, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY p.prod_category, p.prod_subcategory
Each branch is then rewritten separately using the rules from "When Does Oracle
Rewrite a Query?" on page 22-4. Using the materialized view sum_grouping_
set_mv, Oracle can rewrite only branches 3 (which requires materialized view
rollup) and 4 (which matches the materialized view exactly). The unrewritten
branches will be converted back to the extended GROUP BY form. Thus, eventually,
the query is rewritten as:
SELECT
null, p.prod_subcategory, null,
t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
(
(p.prod_subcategory, t.calendar_month_desc),
(t.calendar_month_desc),
)
UNION ALL
SELECT
prod_category, prod_subcategory, cust_state_province,
null, SUM(sum_amount_sold) AS sum_amount_sold
FROM sum_grouping_set_mv
WHERE gid = <grouping id of (prod_category,prod_subcategory, cust_city)>
GROUP BY p.prod_category, p.prod_subcategory, c.cust_state_province
UNION ALL
SELECT
prod_category, prod_subcategory, null,
null, sum_amount_sold
FROM sum_grouping_set_mv
WHERE gid = <grouping id of (prod_category,prod_subcategory)>
Observe the following features of UNION ALL rewrite. First, a query with extended
GROUP BY is represented as an equivalent UNION ALL and recursively submitted for
rewrite optimization. The groupings that cannot be rewritten stay in the last branch
of UNION ALL and access the base data instead.
Explain Plan
The EXPLAIN PLAN facility is used as described in Oracle9i SQL Reference. For query
rewrite, all you need to check is that the object_name column in PLAN_TABLE
contains the materialized view name. If it does, then query rewrite will occur when
this query is executed.
If EXPLAIN PLAN is used on the following SQL statement, the results are placed in
the default table PLAN_TABLE. However, PLAN_TABLE must first be created using
the utlxplan.sql script.
EXPLAIN PLAN
FOR
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t
WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
For the purposes of query rewrite, the only information of interest from PLAN_
TABLE is the OBJECT_NAME, which identifies the objects that will be used to
execute this query. Therefore, you would expect to see the object name calendar_
month_sales_mv in the output as illustrated here.
SELECT object_name FROM plan_table;
OBJECT_NAME
-----------------------
CALENDAR_MONTH_SALES_MV
2 rows selected.
DBMS_MVIEW.EXPLAIN_REWRITE Procedure
It can be difficult to understand why a query did not rewrite. The rules governing
query rewrite eligibility are quite complex, involving various factors such as
constraints, dimensions, query rewrite integrity modes, freshness of the
materialized views, and the types of queries themselves. In addition, you may want
to know why query rewrite chose a particular materialized view instead of another.
To help with this matter, Oracle provides a PL/SQL procedure (DBMS_
MVIEW.EXPLAIN_REWRITE) to advise you when a query can be rewritten and, if
not, why not. Using the results from DBMS_MVIEW.EXPLAIN_REWRITE, you can
take the appropriate action needed to make a query rewrite if at all possible.
DBMS_MVIEW.EXPLAIN_REWRITE Syntax
You can obtain the output from DBMS_MVIEW.EXPLAIN_REWRITE in two ways.
The first is to use a table, while the second is to create a varray. The following shows
the basic syntax for using an output table:
DBMS_MVIEW.EXPLAIN_REWRITE (
query VARCHAR2(2000),
mv VARCHAR2(30),
statement_id VARCHAR2(30)
);
Using REWRITE_TABLE
Output of EXPLAIN_REWRITE can be directed to a table named REWRITE_TABLE.
You can create this output table by running the Oracle-supplied script
utlxrw.sql. This script can be found in the admin directory. The format of
REWRITE_TABLE is as follows.
CREATE TABLE REWRITE_TABLE(
statement_id VARCHAR2(30), -- ID for the query
mv_owner VARCHAR2(30), -- MV's schema
mv_name VARCHAR2(30), -- Name of the MV
sequence INTEGER, -- Seq # of error msg
query VARCHAR2(2000),-- user query
message VARCHAR2(512), -- EXPLAIN_REWRITE error msg
pass VARCHAR2(3), -- Query Rewrite pass no
flags INTEGER, -- For future use
reserved1 INTEGER, -- For future use
reserved2 VARCHAR2(256); -- For future use
);
'TestXRW.PRODUCT_SALES_MV', \
'SH');
Here is another example where you can see a more detailed explanation of why
some materialized views were not considered and eventually the materialized view
sales_mv was chosen as the best one.
DECLARE
qrytext VARCHAR2(500) :='SELECT cust_first_name, cust_last_name,
SUM(amount_sold) AS dollar_sales FROM sales s, customers c WHERE s.cust_id=
c.cust_id GROUP BY cust_first_name, cust_last_name';
idno VARCHAR2(30) :='ID1';
BEGIN
DBMS_MVIEW.EXPLAIN_REWRITE(querytxt, '', idno);
END;
/
SELECT message FROM rewrite_table ORDER BY sequence;
SQL> MESSAGE
--------------------------------------------------------------------------------
QSM-01082: Joining materialized view, CAL_MONTH_SALES_MV, with table, SALES, not possible
QSM-01022: a more optimal materialized view than PRODUCT_SALES_MV was used to rewrite
QSM-01022: a more optimal materialized view than FWEEK_PSCAT_SALES_MV was used to rewrite
QSM-01033: query rewritten with materialized view, SALES_MV
Using a VARRAY
You can save the output of EXPLAIN_REWRITE in a PL/SQL varray. The elements
of this array are of the type RewriteMessage, which is defined in the SYS schema
as shown in the following:
TYPE RewriteMessage IS record(
mv_owner VARCHAR2(30), -- MV's schema
mv_name VARCHAR2(30), -- Name of the MV
sequence INTEGER, -- Seq # of error msg
query VARCHAR2(2000),-- user query
message VARCHAR2(512), -- EXPLAIN_REWRITE error msg
pass VARCHAR2(3), -- Query Rewrite pass no
flags INTEGER, -- For future use
reserved1 INTEGER, -- For future use
AS
SELECT c.cust_city, c.cust_state_province,
AVG(s.amount_sold)
FROM sales s, customers c
WHERE s.cust_id = c.cust_id
GROUP BY c.cust_city, c.cust_state_province;
The query will not rewrite with this materialized view. This can be quite confusing
to a novice user as it seems like all information required for rewrite is present in the
materialized view. The user can find out from DBMS_MVIEW.EXPLAIN_REWRITE
that AVG cannot be computed from the given materialized view. The problem is that
a ROLLUP is required here and AVG requires a COUNT or a SUM to do ROLLUP.
An example PL/SQL block for the previous query, using a varray as its output
medium, is as follows:
SET SERVEROUTPUT ON
DECLARE
Rewrite_Array SYS.RewriteArrayType := SYS.RewriteArrayType();
querytxt VARCHAR2(1500) := 'SELECT S.CITY, AVG(F.DOLLAR_SALES)
FROM STORE S, FACT F WHERE S.STORE_KEY = F.STORE_KEY
GROUP BY S.CITY';
i NUMBER;
BEGIN
DBMS_MVIEW.Explain_Rewrite(querytxt, 'MV_CITY_STATE', Rewrite_Array);
FOR i IN 1..Rewrite_Array.count
LOOP
DBMS_OUTPUT.PUT_LINE(Rewrite_Array(i).message);
END LOOP;
END;
/
view would contain the query results, thus eliminating the time required to perform
any complex joins and search through all the data for that which is required.
additive
Describes a fact (or measure) that can be summarized through addition. An
additive fact is the most common type of fact. Examples include sales, cost, and
profit. Contrast with nonadditive and semi-additive.
advisor
See: Summary Advisor.
aggregate
Summarized data. For example, unit sales of a particular product could be
aggregated by day, month, quarter and yearly sales.
aggregation
The process of consolidating data values into a single value. For example, sales data
could be collected on a daily basis and then be aggregated to the week level, the
week data could be aggregated to the month level, and so on. The data can then be
referred to as aggregate data. Aggregation is synonymous with summarization,
and aggregate data is synonymous with summary data.
ancestor
A value at any level higher than a given value in a hierarchy. For example, in a Time
dimension, the value 1999 might be the ancestor of the values Q1-99 and Jan-99.
Glossary-1
attribute
A descriptive characteristic of one or more levels. For example, the product
dimension for a clothing manufacturer might contain a level called item, one of
whose attributes is color. Attributes represent logical groupings that enable end
users to select data based on like characteristics.
Note that in relational modeling, an attribute is defined as a characteristic of an
entity. In Oracle9i, an attribute is a column in a dimension that characterizes
elements of a single level.
cardinality
From an OLTP perspective, this refers to the number of rows in a table. From a data
warehousing perspective, this typically refers to the number of distinct values in a
column. For most data warehouse DBAs, a more important issue is the degree of
cardinality.
child
A value at the level under a given value in a hierarchy. For example, in a Time
dimension, the value Jan-99 might be the child of the value Q1-99. A value can be
a child for more than one parent if the child value belongs to multiple hierarchies.
See Also:
■ hierarchy
■ level
■ parent
cleansing
The process of resolving inconsistencies and fixing the anomalies in source data,
typically as part of the ETL process.
Glossary-2
cross product
A procedure for combining the elements in multiple sets. For example, given two
columns, each element of the first column is matched with every element of the
second column. A simple example is illustrated as follows:
Col1 Col2 Cross Product
---- ---- -------------
a c ac
b d ad
bc
bd
Cross products are performed when grouping sets are concatenated, as described in
Chapter 18, "SQL for Aggregation in Data Warehouses".
data mart
A data warehouse that is designed for a particular line of business, such as sales,
marketing, or finance. In a dependent data mart, the data can be derived from an
enterprise-wide data warehouse. In an independent data mart, data can be collected
directly from sources.
data source
A database, application, repository, or file that contributes data to a warehouse.
data warehouse
A relational database that is designed for query and analysis rather than transaction
processing. A data warehouse usually contains historical data that is derived from
transaction data, but it can include data from other sources. It separates analysis
workload from transaction workload and enables a business to consolidate data
from several sources.
In addition to a relational database, a data warehouse environment often consists of
an ETL solution, an OLAP engine, client analysis tools, and other applications that
manage the process of gathering data and delivering it to business users.
Glossary-3
degree of cardinality
The number of unique values of a column divided by the total number of rows in
the table. This is particularly important when deciding which indexes to build. You
typically want to use bitmap indexes on low degree of cardinality columns and
B-tree indexes on high degree of cardinality columns. As a general rule, a
cardinality of under 1% makes a good candidate for a bitmap index.
denormalize
The process of allowing redundancy in a table. Contrast with normalize.
detail
See: fact table.
detail table
See: fact table.
dimension
The term dimension is commonly used in two ways:
■ A general term for any characteristic that is used to specify the members of a
data set. The 3 most common dimensions in sales-oriented data warehouses are
time, geography, and product. Most dimensions have hierarchies.
■ An object defined in a database to enable queries to navigate dimensions. In
Oracle9i, a dimension is a database object that defines hierarchical
(parent/child) relationships between pairs of column sets. In Oracle Express, a
dimension is a database object that consists of a list of values.
dimension table
Dimension tables describe the business entities of an enterprise, represented as
hierarchical, categorical information such as time, departments, locations, and
products. Dimension tables are sometimes called lookup or reference tables.
Glossary-4
dimension value
One element in the list that makes up a dimension. For example, a computer
company might have dimension values in the product dimension called LAPPC and
DESKPC. Values in the geography dimension might include Boston and Paris.
Values in the time dimension might include MAY96 and JAN97.
drill
To navigate from one item to a set of related items. Drilling typically involves
navigating up and down through the levels in a hierarchy. When selecting data, you
can expand or collapse a hierarchy by drilling down or up in it, respectively.
drill down
To expand the view to include child values that are associated with parent values in
the hierarchy.
drill up
To collapse the list of descendant values that are associated with a parent value in
the hierarchy.
element
An object or process. For example, a dimension is an object, a mapping is a process,
and both are elements.
entity
Entity is used in database modeling. In relational databases, it typically maps to a
table.
Glossary-5
ETL
Extraction, transformation, and loading. ETL refers to the methods involved in
accessing and manipulating source data and loading it into a data warehouse. The
order in which these processes are performed varies.
Note that ETT (extraction, transformation, transportation) and ETM (extraction,
transformation, move) are sometimes used instead of ETL.
See Also:
■ data warehouse
■ extraction
■ transformation
■ transportation
extraction
The process of taking data out of a source as part of an initial phase of ETL.
fact
Data, usually numeric and additive, that can be examined and analyzed. Examples
include sales, cost, and profit. Fact and measure are synonymous; fact is more
commonly used with relational environments, measure is more commonly used
with multidimensional environments.
fact table
A table in a star schema that contains facts. A fact table typically has two types of
columns: those that contain facts and those that are foreign keys to dimension
tables. The primary key of a fact table is usually a composite key that is made up of
all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated
(fact tables that contain aggregated facts are often instead called summary tables).
A fact table usually contains facts with the same level of aggregation.
Glossary-6
fast refresh
An operation that applies only the data changes to a materialized view, thus
eliminating the need to rebuild the materialized view from scratch.
file-to-table mapping
Maps data from flat files to tables in the warehouse.
hierarchy
A logical structure that uses ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation; for example, in a time dimension,
a hierarchy might be used to aggregate data from the Month level to the Quarter
level to the Year level. Hierarchies can be defined in Oracle9i as part of the
dimension object. A hierarchy can also be used to define a navigational drill path,
regardless of whether the levels in the hierarchy represent aggregated totals.
level
A position in a hierarchy. For example, a time dimension might have a hierarchy
that represents data at the Month, Quarter, and Year levels.
mapping
The definition of the relationship and data flow between source and target objects.
materialized view
A pre-computed table comprising aggregated or joined data from fact and possibly
dimension tables. Also known as a summary or aggregate table.
measure
See: fact.
Glossary-7
metadata
Data that describes data and other structures, such as objects, business rules, and
processes. For example, the schema design of a data warehouse is typically stored in
a repository as metadata, which is used to generate scripts used to build and
populate the data warehouse. A repository contains metadata.
Examples include: for data, the definition of a source to target transformation that is
used to generate and populate the data warehouse; for information, definitions of
tables, columns and associations that are stored inside a relational modeling tool;
for business rules, discount by 10 percent after selling 1,000 items.
model
An object that represents something to be made. A representative style, plan, or
design. Metadata that defines the structure of the data warehouse.
nonadditive
Describes a fact (or measure) that cannot be summarized through addition. An
example includes Average. Contrast with additive and semi-additive.
normalize
In a relational database, the process of removing redundancy in data by separating
the data into multiple tables. Contrast with denormalize.
The process of removing redundancy in data by separating the data into multiple
tables.
OLAP
See: online analytical processing (OLAP).
Glossary-8
OLTP
See: online transaction processing (OLTP).
parallelism
Breaking down a task so that several processes do part of the work. When multiple
CPUs each do their portion simultaneously, very large performance gains are
possible.
parallel execution
Breaking down a task so that several processes do part of the work. When multiple
CPUs each do their portion simultaneously, very large performance gains are
possible.
parent
A value at the level above a given value in a hierarchy. For example, in a Time
dimension, the value Q1-99 might be the parent of the value Jan-99.
See Also:
■ child
■ hierarchy
■ level
partition
Very large tables and indexes can be difficult and time-consuming to work with. To
improve manageability, you can break your tables and indexes into smaller pieces
called partitions.
pivoting
A transformation where each record in an input stream is converted to many
records in the appropriate table in the data warehouse. This is particularly
important when taking data from nonrelational databases.
Glossary-9
publisher
Usually a database administrator who is in charge of creating and maintaining
schema objects that make up the Change Data Capture system.
refresh
The mechanism whereby materialized views are changed to reflect new data.
schema
A collection of related database objects. Relational schemas are grouped by database
user ID and include tables, views, and other objects. Whenever possible, a sample
schema called sh is used throughout this Guide.
semi-additive
Describes a fact (or measure) that can be summarized through addition along some,
but not all, dimensions. Examples include headcount and on hand stock. Contrast
with additive and nonadditive.
snowflake schema
A type of star schema in which the dimension tables are partly or fully normalized.
source
A database, application, file, or other storage facility from which the data in a data
warehouse is derived.
Glossary-10
source system
A database, application, file, or other storage facility from which the data in a data
warehouse is derived.
staging area
A place where data is processed before entering the warehouse.
staging file
A file used when data is processed before entering the warehouse.
star query
A join between a fact table and a number of dimension tables. Each dimension table
is joined to the fact table using a primary key to foreign key join, but the dimension
tables are not joined to each other.
star schema
A relational schema whose design represents a multidimensional data model. The
star schema consists of one or more fact tables and one or more dimension tables
that are related through foreign keys.
subject area
A classification system that represents or distinguishes parts of an organization or
areas of knowledge. A data mart is often developed to support a subject area such
as sales, marketing, or geography.
subscribers
Consumers of the published change data. These are normally applications.
summary
See: materialized view.
Summary Advisor
The Summary Advisor recommends which materialized views to retain, create, and
drop. It helps database administrators manage materialized views. It is a GUI in
Oracle Enterprise Manager, and has similar capabilities to the DBMS_OLAP package.
Glossary-11
target
Holds the intermediate or final results of any part of the ETL process. The target of
the entire ETL process is the data warehouse.
transformation
The process of manipulating data. Any manipulation beyond copying is a
transformation. Examples include cleansing, aggregating, and integrating data from
multiple sources.
transportation
The process of moving copied or transformed data from a source to a data
warehouse.
unique identifier
An identifier whose purpose is to differentiate between the same item when it
appears in more than one place.
update window
The length of time available for updating a warehouse. For example, you might
have 8 hours at night to update your warehouse.
Glossary-12
update frequency
How often a data warehouse is updated with new information. For example, a
warehouse might be updated nightly from an OLTP system.
validation
The process of verifying metadata definitions and configuration parameters.
versioning
The ability to create new versions of a data warehouse project for new requirements
and changes.
Glossary-13
Glossary-14
Index
A data warehouses
star queries, 17-4
access decision support, 21-2
controlling to change data, 15-3 decision support systems (DSS), 6-3
adaptive multiuser parallel SQL, 21-14
algorithm for, 21-47 direct-path INSERT, 21-21
definition, 21-47 parallel DML, 21-20
ADD PARTITION clause, 5-32 ARCH processes
affinity multiple, 21-84
parallel DML, 21-76 architecture
partitions, 21-75 data warehouse, 1-5
aggregates, 8-13, 22-64 MPP, 21-76
computability check, 22-42 SMP, 21-76
ALL_SOURCE_TABLES view, 15-15 asynchronous I/O, 21-64
ALTER MATERIALIZED VIEW statement, 8-22 attributes, 2-3, 9-6
enabling query rewrite, 22-7
ALTER SESSION statement
ENABLE PARALLEL DML clause, 21-21 B
FORCE PARALLEL DDL clause, 21-42, 21-45 backups
create or rebuild index, 21-43, 21-46 disk mirroring, 4-11
create table as select, 21-44, 21-45 bandwidth, 5-2, 21-2
move or split partition, 21-43, 21-46 bitmap indexes, 6-2
FORCE PARALLEL DML clause nulls and, 6-5
insert, 21-41, 21-42, 21-45 on partitioned tables, 6-6
update and delete, 21-39, 21-40, 21-45 parallel query and DML, 6-3
ALTER TABLE statement bitmap join indexes, 6-6
NOLOGGING clause, 21-88 block range granules, 5-3
altering dimensions, 9-13 B-tree indexes, 6-10
analytic functions bitmap indexes versus, 6-3
concepts, 19-3 build methods, 8-23
analyzing data
for parallel processing, 21-68
APPEND hint, 21-88 C
applications cardinality
Index-1
degree of, 6-3 view, 7-7, 22-14
CASE expressions, 19-44 with partitioning, 7-7
change data with query rewrite, 22-63
controlling access to, 15-3 cost-based optimization, 21-92
publishing, 15-3 parallel execution, 21-92
Change Data Capture, 11-5 cost-based rewrite, 22-3
change sets CPU
definition, 15-7 utilization, 5-2, 21-2
SYNC_SET, 15-7 CREATE DIMENSION statement, 9-4
change source CREATE INDEX statement, 21-86
definition, 15-6 rules of parallelism, 21-43
SYNC_SOURCE, 15-6 CREATE MATERIALIZED VIEW statement, 8-22
change tables enabling query rewrite, 22-7
contain published data, 15-3 CREATE SNAPSHOT statement, 8-3
definition, 15-7 CREATE TABLE AS SELECT statement, 21-67,
importing for Change Data Capture, 15-21 21-78
CLUSTER_DATABASE_INSTANCES initialization rules of parallelism
parameter index-organized tables, 21-14
and parallel execution, 21-57 CREATE TABLE statement
columns AS SELECT
cardinality, 6-3 decision support systems, 21-14
common joins, 22-32 rules of parallelism, 21-43
COMPATIBLE initialization parameter, 13-28, 22-8 space fragmentation, 21-16
COMPLETE clause, 8-26 temporary storage space, 21-16
complete refresh, 14-13 parallelism, 21-14
complex queries index-organized tables, 21-14
snowflake schemas, 17-5 CUBE clause, 18-10
composite partial, 18-12
columns, 18-21 when to use, 18-10
partitioning, 5-9 cubes
partitioning methods, 5-9 hierarchical, 8-42
performance considerations, 5-12, 5-15 CUME_DIST function, 19-13
compression
See data segment compression, 8-23
concatenated groupings, 18-24
D
concatenated ROLLUP, 8-43 data
concurrent users integrity of
increasing the number of, 21-50 parallel DML restrictions, 21-26
CONSIDER FRESH clause, 14-30 partitioning, 5-4
constraints, 7-2, 9-11 purging, 14-11
foreign key, 7-5 sufficiency check, 22-37
parallel create table, 21-43 transformation, 13-9
RELY, 7-6 transportation, 12-2
states, 7-3 data compression
unique, 7-4 See data segment compression, 8-23
Index-2
data cubes EXPLAIN_MVIEW procedure, 8-53
hierarchical, 18-26 EXPLAIN_REWRITE procedure, 22-57
data manipulation language REFRESH procedure, 14-12, 14-15
parallel DML, 21-18 REFRESH_ALL_MVIEWS procedure, 14-12
transaction model for parallel DML, 21-22 REFRESH_DEPENDENT procedure, 14-12
data marts, 1-7 DBMS_OLAP package, 16-3, 16-4, 16-5
data mining, 20-4 ADD_FILTER_ITEM procedure, 16-18
data segment compression, 3-5 LOAD_WORKLOAD_TRACE procedure, 16-12
bitmap indexes, 5-18 PURGE_FILTER procedure, 16-23
materialized views, 8-23 PURGE_RESULTS procedure, 16-32
partitioning, 3-5, 5-17 PURGE_WORKLOAD procedure, 16-18
data transformation SET_CANCELLED procedure, 16-32
multistage, 13-2 DBMS_STATS package, 16-6, 22-3
pipelined, 13-4 decision support systems (DSS)
data warehouse, 8-2 bitmap indexes, 6-3
architectures, 1-5 disk striping, 21-75
dimension tables, 8-7 parallel DML, 21-20
dimensions, 17-4 parallel SQL, 21-14, 21-20
fact tables, 8-7 performance, 21-20
logical design, 2-2 scoring tables, 21-21
partitioned tables, 5-10 default partition, 5-8
physical design, 3-2 degree of cardinality, 6-3
refresh tips, 14-18 degree of parallelism, 21-32, 21-38, 21-40
refreshing table data, 21-20 and adaptive multiuser, 21-47
star queries, 17-4 between query operations, 21-9
database parallel SQL, 21-34
extraction DELETE statement
with and without Change Data parallel DELETE statement, 21-39
Capture, 15-2 DEMO_DIM package, 9-10
scalability, 21-20 DENSE_RANK function, 19-5
staging, 8-2 design
database writer process (DBWn) logical, 3-2
tuning, 21-84 physical, 3-2
date folding detail tables, 8-7
with query rewrite, 22-18 dimension tables, 2-5, 8-7, 17-4
DB_BLOCK_SIZE initialization parameter, 21-63 normalized, 9-9
and parallel query, 21-63 Dimension Wizard, 9-14
DB_FILE_MULTIBLOCK_READ_COUNT dimensional modeling, 2-3
initialization parameter, 21-63 dimensions, 2-6, 9-2, 9-11
DBA_DATA_FILES view, 21-70 altering, 9-13
DBA_EXTENTS view, 21-70 analyzing, 18-3
DBMS_LOGMNR_CDC_PUBLISH package, 15-3 creating, 9-4
DBMS_LOGMNR_CDC_SUBSCRIBE definition, 9-2
package, 15-3 dimension tables, 8-7
DBMS_MVIEW package, 14-14 dropping, 9-14
Index-3
hierarchies, 2-6 EXPLAIN PLAN statement, 21-66, 22-56
hierarchies overview, 2-6 query parallelization, 21-81
multiple, 18-3 star transformations, 17-9
star joins, 17-4 exporting
star queries, 17-4 a source table
validating, 9-12 change data capture, 15-20
with query rewrite, 22-63 EXP utility, 11-10
direct-path INSERT expression matching
restrictions, 21-24 with query rewrite, 22-17
disk affinity extend window
disabling with MPP, 4-6 to create a new view, 15-3
parallel DML, 21-76 extents
partitions, 21-75 parallel DDL, 21-16
disk striping size, 13-28
affinity, 21-75 external tables, 13-6
DISK_ASYNCH_IO initialization parameter, 21-64 extraction, transformation, and loading (ETL), 10-2
distributed transactions overview, 10-2
parallel DDL restrictions, 21-11 process, 7-2
parallel DML restrictions, 21-11, 21-27 extractions
DML statements data files, 11-8
captured by Change Data Capture, 15-4 distributed operations, 11-11
DML_LOCKS initialization parameter, 21-61 full, 11-3
drilling down, 9-2 incremental, 11-3
hierarchies, 9-2 OCI, 11-10
DROP MATERIALIZED VIEW statement, 8-22 online, 11-4
prebuilt tables, 8-33 overview, 11-2
DROP PARTITION clause, 5-33 physical, 11-4
dropping Pro*C, 11-10
dimensions, 9-14 SQL*Plus, 11-8
materialized views, 8-52
F
E fact tables, 2-5
ENFORCED mode, 22-10 star joins, 17-4
ENQUEUE_RESOURCES initialization star queries, 17-4
parameter, 21-61 facts, 9-2
entity, 2-2 FAST clause, 8-26
estimating materialized view size, 16-38 fast refresh, 14-14
ETL. See extraction, transformation, and loading restrictions, 8-27
(ETL), 10-2 FAST_START_PARALLEL_ROLLBACK
EVALUATE_MVIEW_STRATEGY package, 16-39 initialization parameter, 21-60
EXCHANGE PARTITION statement, 7-7 features, new, xxxiii
execution plans FIRST_ROWS(n) hint, 21-93
parallel operations, 21-66 FIRST_VALUE function, 19-24
star transformations, 17-9 FIRST/LAST functions, 19-28
Index-4
FORCE clause, 8-26 granting access to change data, 15-3
foreign key granules, 5-3
constraints, 7-5 block range, 5-3
joins partition, 5-4
snowflake schemas, 17-5 GROUP_ID function, 18-17
fragmentation grouping
parallel DDL, 21-16 compatibility check, 22-40
FREELISTS parameter, 21-84 conditions, 22-64
full partition-wise joins, 5-21 GROUPING function, 18-13
functions when to use, 18-16
COUNT, 6-5 GROUPING_ID function, 18-17
CUME_DIST, 19-13 GROUPING_SETS expression, 18-19
DENSE_RANK, 19-5 groups, instance, 21-37
FIRST_VALUE, 19-24 GV$FILESTAT view, 21-68
FIRST/LAST, 19-28
GROUP_ID, 18-17
GROUPING, 18-13
H
GROUPING_ID, 18-17 hash partitioning, 5-7
LAG/LEAD, 19-27 HASH_AREA_SIZE initialization parameter
LAST_VALUE, 19-24 and parallel execution, 21-59
linear regression, 19-31 hierarchical cubes, 8-42
NTILE, 19-14 hierarchies, 9-2
parallel execution, 21-28 how used, 2-6
PERCENT_RANK, 19-14 multiple, 9-7
RANK, 19-5 overview, 2-6
ranking, 19-5 rolling up and drilling down, 9-2
RATIO_TO_REPORT, 19-27 hints
REGR_AVGX, 19-32 FIRST_ROWS(n), 21-93
REGR_AVGY, 19-32 PARALLEL, 21-34
REGR_COUNT, 19-32 PARALLEL_INDEX, 21-35
REGR_INTERCEPT, 19-32 query rewrite, 22-8, 22-9
REGR_SLOPE, 19-32 histograms
REGR_SXX, 19-33 creating with user-defined buckets, 19-45
REGR_SXY, 19-33 hypothetical rank, 19-38
REGR_SYY, 19-33
reporting, 19-24 I
ROW_NUMBER, 19-16
WIDTH_BUCKET, 19-42 importing
windowing, 19-17 a change table
Change Data Capture, 15-21
a source table
G Change Data Capture, 15-20
global indexes
indexes, 21-83 bitmap indexes, 6-6
striping, 4-6 bitmap join, 6-6
Index-5
B-tree, 6-10 STAR_TRANSFORMATION_ENABLED, 17-6
cardinality, 6-3 TAPE_ASYNCH_IO, 21-64
creating in parallel, 21-85 TIMED_STATISTICS, 21-69
global, 21-83 TRANSACTIONS, 21-60
local, 21-83 INSERT statement
nulls and, 6-5 functionality, 21-87
parallel creation, 21-85, 21-86 parallelizing INSERT ... SELECT, 21-41
parallel DDL storage, 21-16 instance groups for parallel operations, 21-37
parallel local, 21-86 instance recovery
partitioned tables, 6-6 SMON process, 21-24
partitioning, 5-9 instances
STORAGE clause, 21-86 instance groups, 21-37
index-organized tables integrity constraints, 7-2
parallel CREATE, 21-14 integrity rules
parallel queries, 21-11 parallel DML restrictions, 21-26
INITIAL extent size, 13-28 invalidating
initialization parameters materialized views, 8-50
CLUSTER_DATABASE_INSTANCES, 21-57 I/O
COMPATIBLE, 13-28, 22-8 asynchronous, 21-64
DB_BLOCK_SIZE, 21-63 parallel execution, 5-2, 21-2
DB_FILE_MULTIBLOCK_READ_ striping to avoid bottleneck, 4-2
COUNT, 21-63
DISK_ASYNCH_IO, 21-64
DML_LOCKS, 21-61
J
ENQUEUE_RESOURCES, 21-61 Java
FAST_START_PARALLEL_ROLLBACK, 21-60 used by Change Data Capture, 15-8
HASH_AREA_SIZE, 21-59 JOB_QUEUE_PROCESSES initialization
JOB_QUEUE_PROCESSES, 14-18 parameter, 14-18
LARGE_POOL_SIZE, 21-52 join compatibility, 22-31
LOG_BUFFER, 21-61 joins
MULTIBLOCK_READ_COUNT, 13-28 full partition-wise, 5-21
OPTIMIZER_MODE, 14-18, 21-93, 22-8 partial partition-wise, 5-27
PARALLEL_ADAPTIVE_MULTI_USER, 21-47 partition-wise, 5-21
PARALLEL_AUTOMATIC_TUNING, 21-30 star joins, 17-4
PARALLEL_EXECUTION_MESSAGE_ star queries, 17-4
SIZE, 21-58, 21-59
PARALLEL_MAX_SERVERS, 14-18, 21-4, 21-50 K
PARALLEL_MIN_PERCENT, 21-36, 21-50,
21-57 key lookups, 13-33
PARALLEL_MIN_SERVERS, 21-3, 21-4, 21-51 keys, 8-7, 17-4
PARALLEL_THREADS_PER_CPU, 21-30
PGA_AGGREGATE_TARGET, 14-18 L
QUERY_REWRITE_ENABLED, 22-7, 22-8
ROLLBACK_SEGMENTS, 21-60 LAG/LEAD functions, 19-27
SHARED_POOL_SIZE, 21-52, 21-56 LARGE_POOL_SIZE initialization
Index-6
parameter, 21-52 estimating size, 16-38
LAST_VALUE function, 19-24 invalidating, 8-50
level relationships, 2-6 logs, 11-7
purpose, 2-7 naming, 8-22
levels, 2-6, 2-7 nested, 8-18
linear regression functions, 19-31 OLAP, 8-41
list partitioning, 5-7 OLAP cubes, 8-41
load partitioned tables, 14-26
parallel, 13-31 partitioning, 8-35
LOB datatypes prebuilt, 8-22
restrictions query rewrite
parallel DDL, 21-14 hints, 22-8, 22-9
parallel DML, 21-25 matching join graphs, 8-24
local indexes, 6-3, 6-6, 21-83 parameters, 22-8
local striping, 4-5 privileges, 22-10
locks refresh dependent, 14-16
parallel DML, 21-24 refreshing, 8-26, 14-12
LOG_BUFFER initialization parameter refreshing all, 14-16
and parallel execution, 21-61 registration, 8-33
LOGGING clause, 21-84 restrictions, 8-24
logging mode rewrites
parallel DDL, 21-14, 21-15 enabling, 22-7
logical design, 3-2 schema design, 8-8
lookup tables, 17-4 schema design guidelines, 8-8
See dimension tables, 8-7 security, 8-50
star queries, 17-4 set operators, 8-47
storage characteristics, 8-23
types of, 8-12
M uses for, 8-2
manual MAXEXTENTS keyword, 13-28
refresh, 14-14 MAXEXTENTS UNLIMITED storage
striping, 4-4 parameter, 21-23
massively parallel processing (MPP) measures, 8-7, 17-4
affinity, 21-75, 21-76 memory
disk affinity, 4-6 configure at 2 levels, 21-58
massively parallel systems, 5-2, 21-2 MERGE operation, 13-10
materialized views MERGE PARTITIONS clause, 5-35
aggregates, 8-13 MERGE statement, 14-9
altering, 8-51 MINIMUM EXTENT parameter, 21-17
build methods, 8-23 mirroring
containing only joins, 8-16 disks, 4-10
creating, 8-21 monitoring
data segment compression, 8-23 parallel processing, 21-68
delta joins, 22-35 refresh, 14-19
dropping, 8-33, 8-52 MOVE PARTITION statement
Index-7
rules of parallelism, 21-43 batch jobs, 21-21
MULTIBLOCK_READ_COUNT initialization parallel DML, 21-20
parameter, 13-28 ON COMMIT clause, 8-26
multiple archiver processes, 21-84 ON DEMAND clause, 8-26
multiple hierarchies, 9-7 OPTIMAL storage parameter, 21-23
MV_CAPABILITIES_TABLE table, 8-54 optimizations
MVIEW_WORKLOAD view, 16-2 parallel SQL, 21-6
query rewrite
enabling, 22-7
N hints, 22-8, 22-9
nested materialized views, 8-18 matching join graphs, 8-24
refreshing, 14-23 query rewrites
restrictions, 8-21 privileges, 22-10
nested tables optimizer
restrictions, 21-13 with rewrite, 22-2
NEVER clause, 8-27 OPTIMIZER_MODE initialization
new features, xxxiii parameter, 14-18, 21-93, 22-8
NOAPPEND hint, 21-88 Oracle Real Application Clusters
NOARCHIVELOG mode, 21-85 disk affinity, 21-75
nodes instance groups, 21-37
disk affinity in Real Application Clusters, 21-75 parallel load, 13-31
NOLOGGING clause, 21-79, 21-84, 21-86 system monitor process and, 21-24
with APPEND hint, 21-88 ORDER BY clause, 8-31
NOLOGGING mode outer joins
parallel DDL, 21-14, 21-15 with query rewrite, 22-63
nonvolatile data, 1-3
NOPARALLEL attribute, 21-77
NOREWRITE hint, 22-8, 22-9 P
NTILE function, 19-14 PARALLEL clause, 21-87, 21-88
nulls parallelization rules, 21-38
indexes and, 6-5 PARALLEL CREATE INDEX statement, 21-60
PARALLEL CREATE TABLE AS SELECT statement
resources required, 21-60
O parallel DDL, 21-13
object types extent allocation, 21-16
parallel query, 21-12 parallelization rules, 21-38
restrictions, 21-13 partitioned tables and indexes, 21-13
restrictions restrictions
parallel DDL, 21-14 LOBs, 21-14
parallel DML, 21-25 object types, 21-13, 21-14
OLAP, 20-2 parallel delete, 21-39
materialized views, 8-41 parallel DELETE statement, 21-39
OLAP cubes parallel DML, 21-18
materialized views, 8-41 applications, 21-20
OLTP database bitmap indexes, 6-3
Index-8
degree of parallelism, 21-38, 21-40 number of parallel execution servers, 21-3
enabling PARALLEL DML, 21-21 optimizer, 21-6
lock and enqueue resources, 21-24 parallelization rules, 21-38
parallelization rules, 21-38 shared server, 21-4
recovery, 21-23 summary or rollup tables, 21-14
restrictions, 21-24 parallel update, 21-39
object types, 21-13, 21-25 parallel UPDATE statement, 21-39
remote transactions, 21-27 PARALLEL_ADAPTIVE_MULTI_USER
rollback segments, 21-23 initialization parameter, 21-47
transaction model, 21-22 PARALLEL_AUTOMATIC_TUNING initialization
parallel execution parameter, 21-30
cost-based optimization, 21-92 PARALLEL_EXECUTION_MESSAGE_SIZE
index creation, 21-85 initialization parameter, 21-58, 21-59
interoperator parallelism, 21-9 PARALLEL_INDEX hint, 21-35
intraoperator parallelism, 21-9 PARALLEL_MAX_SERVERS initialization
introduction, 5-2 parameter, 14-18, 21-4, 21-50
I/O parameters, 21-63 and parallel execution, 21-49
method of, 21-31 PARALLEL_MIN_PERCENT initialization
plans, 21-66 parameter, 21-36, 21-50, 21-57
process classification, 4-2, 4-6, 4-9, 4-12 PARALLEL_MIN_SERVERS initialization
resource parameters, 21-58 parameter, 21-3, 21-4, 21-51
rewriting SQL, 21-78 PARALLEL_THREADS_PER_CPU initialization
solving problems, 21-77 parameter, 21-30, 21-48
tuning, 5-2, 21-2 parallelism, 5-2
PARALLEL hint, 21-34, 21-77, 21-87 degree, 21-32
parallelization rules, 21-38 degree, overriding, 21-77
UPDATE and DELETE, 21-39 enabling for tables and queries, 21-46
parallel load interoperator, 21-9
example, 13-31 intraoperator, 21-9
Oracle Real Application Clusters, 13-31 parameters
using, 13-25 FREELISTS, 21-84
parallel partition-wise joins partition
performance considerations, 5-30 default, 5-8
parallel query, 21-11 granules, 5-4
bitmap indexes, 6-3 Partition Change Tracking (PCT), 8-35, 14-26
index-organized tables, 21-11 partitioned tables
object types, 21-12 data warehouses, 5-10
restrictions, 21-13 example, 13-29
parallelization rules, 21-38 partitioning, 11-7
parallel scan operations, 4-3 composite, 5-9
parallel SQL data, 5-4
allocating rows to parallel execution data segment compression, 5-17
servers, 21-7 bitmap indexes, 5-18
degree of parallelism, 21-34 hash, 5-7
instance groups, 21-37 indexes, 5-9
Index-9
list, 5-7 processing, 21-50
materialized views, 8-35 classes of parallel execution, 4-2, 4-6, 4-9, 4-12
prebuilt tables, 8-40 pruning
range, 5-6 partitions, 5-19, 21-75
range-list, 5-15 using DATE columns, 5-20
partitions publication
adding, 5-32 definition, 15-7
affinity, 21-75 publisher tasks, 15-3
bitmap indexes, 6-6 publishers
coalescing, 5-36 capture data, 15-3
dropping, 5-33 determines the source tables, 15-3
exchanging, 5-34 publish change data, 15-3
merging, 5-35 purpose, 15-3
moving, 5-34 purging data, 14-11
parallel DDL, 21-13
partition pruning
disk striping and, 21-75
Q
pruning, 5-19 queries
range partitioning ad hoc, 21-14
disk striping and, 21-75 enabling parallelism for, 21-46
rules of parallelism, 21-43, 21-45 star queries, 17-4
splitting, 5-35 query delta joins, 22-35
truncating, 5-35 query rewrite
partition-wise joins, 5-21 controlling, 22-8
benefits of, 5-29 correctness, 22-10
full, 5-21 enabling, 22-7
partial, 5-27 hints, 22-8, 22-9
PERCENT_RANK function, 19-14 matching join graphs, 8-24
performance methods, 22-11
DSS database, 21-20 parameters, 22-8
PGA_AGGREGATE_TARGET initialization privileges, 22-10
parameter, 14-18 restrictions, 8-25
physical design, 3-2 when it occurs, 22-4
structures, 3-4 QUERY_REWRITE_ENABLED initialization
pivoting, 13-35 parameter, 22-7, 22-8
plans
star transformations, 17-9 R
PL/SQL packages
for publish and subscribe tasks, 15-3 RAID
prebuilt materialized views, 8-22 configurations, 4-9
PRIMARY KEY constraints, 21-86 range partitioning, 5-6
process monitor process (PMON) performance considerations, 5-9
parallel DML process recovery, 21-23 range-list partitioning, 5-15
processes RANK function, 19-5
and memory contention in parallel ranking functions, 19-5
Index-10
RATIO_TO_REPORT function, 19-27 direct-path INSERT, 21-24
REBUILD INDEX PARTITION statement fast refresh, 8-27
rules of parallelism, 21-43 nested materialized views, 8-21
REBUILD INDEX statement nested tables, 21-13
rules of parallelism, 21-43 parallel DDL, 21-14
recovery remote transactions, 21-11
instance recovery parallel DML, 21-24
parallel DML, 21-24 remote transactions, 21-11, 21-27
SMON process, 21-24 query rewrite, 8-25
media, with striping, 4-10 result set, 17-7
parallel DML, 21-23 revoking access to change data, 15-3
redo buffer allocation retries, 21-61 REWRITE hint, 22-8, 22-9
reference tables rewrites
See dimension tables, 8-7 hints, 22-9
refresh parameters, 22-8
monitoring, 14-19 privileges, 22-10
options, 8-25 query optimizations
refreshing hints, 22-8, 22-9
materialized views, 14-12 matching join graphs, 8-24
nested materialized views, 14-23 rollback segments, 21-60
partitioning, 14-2 MAXEXTENTS UNLIMITED, 21-23
REGR_AVGX function, 19-32 OPTIMAL, 21-23
REGR_AVGY function, 19-32 parallel DML, 21-23
REGR_COUNT function, 19-32 ROLLBACK_SEGMENTS initialization
REGR_INTERCEPT function, 19-32 parameter, 21-60
REGR_R2 function, 19-32 rolling up hierarchies, 9-2
REGR_SLOPE function, 19-32 ROLLUP, 18-6
REGR_SXX function, 19-33 concatenated, 8-43
REGR_SXY function, 19-33 partial, 18-8
REGR_SYY function, 19-33 when to use, 18-7
regression root level, 2-7
detecting, 21-66 ROW_NUMBER function, 19-16
RELY constraints, 7-6 RULE hint, 21-93
remote transactions
parallel DML and DDL restrictions, 21-11
replication
S
restrictions sar UNIX command, 21-74
parallel DML, 21-25 scalability
reporting functions, 19-24 batch jobs, 21-21
resources parallel DML, 21-20
consumption, parameters affecting, 21-58, 21-60 scalable operations, 21-81
limiting for users, 21-51 schemas, 17-2
limits, 21-49 design guidelines for materialized views, 8-8
parallel query usage, 21-58 snowflake, 2-3
restrictions star, 2-3, 17-4
Index-11
third normal form, 17-2 defining fact tables, 2-6
SELECT privilege dimensional model, 2-4, 17-4
granting and revoking for access to change star transformations, 17-7
data, 15-3 restrictions, 17-12
sessions STAR_TRANSFORMATION_ENABLED
enabling parallel DML, 21-21 initialization parameter, 17-6
set operators statistics, 22-65
materialized views, 8-47 estimating, 21-67
shared server operating system, 21-74
parallel SQL execution, 21-4 storage
SHARED_POOL_SIZE initialization fragmentation in parallel DDL, 21-16
parameter, 21-56 STORAGE clause
SHARED_POOL_SIZE parameter, 21-52 parallel execution, 21-16
single table aggregate requirements, 8-15 parallel query, 21-86
skewing parallel DML workload, 21-37 storage parameters
SMP architecture MAXEXTENTS UNLIMITED, 21-23
disk affinity, 21-76 OPTIMAL (in rollback segments), 21-23
snowflake schemas, 17-5 striping, 4-2
complex queries, 17-5 analyzing, 4-6
SORT_AREA_SIZE initialization parameter automatic, 4-3
and parallel execution, 21-59 example, 13-25
source systems, 11-2 global, 4-5
definition, 15-6 local, 4-5
source tables manual, 4-4
definition, 15-6 media recovery, 4-10
exporting for Change Data Capture, 15-20 subpartition
importing for Change Data Capture, 15-20 mapping, 5-14
space management template, 5-14
MINIMUM EXTENT parameter, 21-17 subqueries
parallel DDL, 21-16 in DDL statements, 21-14
SPLIT PARTITION clause, 5-32, 5-35 subscriber views
rules of parallelism, 21-43 definition, 15-7
SQL statements dropping, 15-3
parallelizing, 21-3, 21-6 removing, 15-3
SQL*Loader, 13-25 subscribers
staging definition, 15-5
areas, 1-6 drop the subscriber view, 15-3
databases, 8-2 drop the subscription, 15-3
files, 8-2 extend the window to create a new view, 15-3
STALE_TOLERATED mode, 22-10 purge the subscription window, 15-3
star joins, 17-4 purpose, 15-3
star queries, 17-4 removing subscriber views, 15-3
star transformation, 17-7 retrieve change data from the subscriber
star schemas views, 15-3
advantages, 2-4 subscribe to source tables, 15-3
Index-12
tasks, 15-3 third normal form
subscription window queries, 17-3
purging, 15-3 schemas, 17-2
Summary Advisor, 16-2 TIMED_STATISTICS initialization
Wizard, 16-40 parameter, 21-69
summary management timestamps, 11-6
components, 8-5 transactions
summary tables, 2-5 distributed
symmetric multiprocessors, 5-2, 21-2 parallel DDL restrictions, 21-11
SYNC_SET change set parallel DML restrictions, 21-11, 21-27
system-generated change set, 15-7 TRANSACTIONS initialization parameter, 21-60
SYNC_SOURCE change source transformations, 13-2
system-generated change source, 15-6 scenarios, 13-25
system monitor process (SMON) SQL and PL/SQL, 13-9
Oracle Real Application Clusters and, 21-24 SQL*Loader, 13-5
parallel DML instance recovery, 21-24 transportable tablespaces, 11-5, 12-3, 12-6
parallel DML system recovery, 21-24 transportation
definition, 12-2
distributed operations, 12-2
T flat files, 12-2
table queues, 21-71 triggers, 11-7
tables restrictions, 21-27
detail tables, 8-7 parallel DML, 21-25
dimension tables (lookup tables), 8-7 TRUNCATE PARTITION clause, 5-35
dimensions TRUSTED mode, 22-10
star queries, 17-4 two-phase commit, 21-60
enabling parallelism for, 21-46
external, 13-6
fact tables, 8-7 U
star queries, 17-4 unique
historical, 21-21 constraints, 7-4, 21-86
lookup tables (dimension tables), 17-4 identifier, 2-3, 3-2
parallel creation, 21-14 UNLIMITED extents, 21-23
parallel DDL storage, 21-16 update frequencies, 8-12
refreshing in data warehouse, 21-20 UPDATE statement
STORAGE clause with parallel execution, 21-16 parallel UPDATE statement, 21-39
summary or rollup, 21-14 update windows, 8-12
tablespaces user resources
creating, example, 13-27 limiting, 21-51
transportable, 11-5, 12-3, 12-6
TAPE_ASYNCH_IO initialization parameter, 21-64
temporary segments
V
parallel DDL, 21-16 V$FILESTAT view
text match, 22-12 and parallel query, 21-69
with query rewrite, 22-63 V$PARAMETER view, 21-70
Index-13
V$PQ_SESSTAT view, 21-67, 21-69
V$PQ_SYSSTAT view, 21-67
V$PQ_TQSTAT view, 21-68, 21-70
V$PX_PROCESS view, 21-69
V$PX_SESSION view, 21-68
V$PX_SESSTAT view, 21-69
V$SESSTAT view, 21-71, 21-74
V$SYSSTAT view, 21-61, 21-71, 21-84
validating dimensions, 9-12
view constraints, 7-7, 22-14
views
ALL_SOURCE_TABLES, 15-15
DBA_DATA_FILES, 21-70
DBA_EXTENTS, 21-70
V$FILESTAT, 21-69
V$PARAMETER, 21-70
V$PQ_SESSTAT, 21-69
V$PQ_TQSTAT, 21-70
V$PX_PROCESS, 21-69
V$SESSTAT, 21-71, 21-74
V$SYSSTAT, 21-71
vmstat UNIX command, 21-74
W
WIDTH_BUCKET function, 19-42
windowing functions, 19-17
workloads
distribution, 21-67
skewing, 21-37
Index-14