De Lab Programs
De Lab Programs
position
expr
Usage notes
Examples
Start by creating and loading a table with information about sales from a chain store that has
branches in different cities and states/territories.
Run a cube query that shows profit by city, state, and total across all states. The example below
shows a query that has three “levels”:
Each city.
Each state.
All revenue combined.
This example uses ORDER BY state, city NULLS LAST to ensure that each state’s rollup
comes immediately after all of the cities in that state, and that the final rollup appears at the end
of the output.
Some rollup rows contain NULL values. For example, the last row in the table contains a NULL
value for the city and a NULL value for the state because the data is for all cities and states, not a
specific city and state.
Both GROUP BY CUBE and GROUP BY ROLLUP produce one row for each city/state pair, and
both GROUP BY clauses also produce rollup rows for each individual state and for all states
combined. The difference between the two GROUP BY clauses is that GROUP BY CUBE also
produces an output row for each city name (‘Miami’, ‘SJ’, etc.).
Be careful using GROUP BY CUBE on hierarchical data. In this example, the row for “SJ” contains
totals for both the city named “SJ” in the state of “CA” and the city named “SJ” in the territory
of “PR”, even though the only relationship between those cities is that they have the same name.
In general, use GROUP BY ROLLUP to analyze hierarchical data, and GROUP BY CUBE to analyze
data across independent axes.
The last decade has seen a tremendous increase in the use of query, reporting, and on-
line analytical processing (OLAP) tools, often in conjunction with data warehouses
and data marts. Enterprises exploring new markets and facing greater competition
expect these tools to provide the maximum possible decision-making value from their
data resources.
ROLLUP and CUBE are simple extensions to the SELECT statement's GROUP BY
clause. ROLLUP creates subtotals at any level of aggregation needed, from the most
detailed up to a grand total. CUBE is an extension similar to ROLLUP, enabling a single
statement to calculate all possible combinations of subtotals. CUBE can generate the
information needed in cross-tab reports with a single query. To enhance performance,
both CUBE and ROLLUP are parallelized: multiple processes can simultaneously execute
both types of statements.
See Also::
Enhanced Top-N queries enable more efficient retrieval of the largest and smallest
values of a data set. This chapter presents concepts, syntax, and examples
of CUBE, ROLLUP and Top-N analysis.
Analyzing across Multiple Dimensions
Show total sales across all products at increasing aggregation levels: from
state to country to region for 1996 and 1997.
Create a cross-tabular analysis of our operations showing expenses by
territory in South America for 1996 and 1997. Include all possible subtotals.
List the top 10 sales representatives in Asia according to 1997 sales revenue in
for automotive products and rank their commissions.
To visualize data that has many dimensions, analysts commonly use the analogy of a
data "cube," that is, a space where facts are stored at the intersection of n
dimensions. Figure 20-1 shows a data cube and how it could be used differently by
various groups. The cube stores sales data organized by the dimensions of Product,
Market, and Time.
Figure 20-1 Cube and Views by Different Users
We can retrieve "slices" of data from the cube. These correspond to cross-tabular
reports such as the one shown in Table 20-1. Regional managers might study the data
by comparing slices of the cube applicable to different markets. In contrast, product
managers might compare slices that apply to different products. An ad hoc user might
work with a wide variety of constraints, working in a subset cube.
Optimized Performance
Not only multi-dimensional issues, but all types of processing can benefit from
enhanced aggregation facilities. Transaction processing, financial and manufacturing
systems--all of these generate large numbers of production reports needing substantial
system resources. Improved efficiency when creating these reports will reduce system
load. In fact, any computer process that aggregates data from details to higher levels
needs optimized performance.
To leverage the power of the database server, powerful aggregation commands should
be available inside the SQL engine. New extensions in Oracle provide these features
and bring many benefits, including:
Oracle8i provides all these benefits with the new CUBE and ROLLUP extensions to
the GROUP BY clause. These extensions adhere to the ANSI and ISO proposals for
SQL3, a draft standard for enhancements to SQL.
A Scenario
To illustrate CUBE, ROLLUP, and Top-N queries, this chapter uses a hypothetical
videotape sales and rental company. All the examples given refer to data from this
scenario. The hypothetical company has stores in several regions and tracks sales and
profit information. The data is categorized by three dimensions: Time, Department,
and Region. The time dimensions are 1996 and 1997, the departments are Video Sales
and Video Rentals, and the regions are East, West, and Central.
Table 20-1 is a sample cross-tabular report showing the total profit by region and
department in 1997:
ROLLUP
Syntax
ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY
ROLLUP(grouping_column_reference_list)
Details
ROLLUP's action is straightforward: it creates subtotals which "roll up" from the most
detailed level to a grand total, following a grouping list specified in
the ROLLUP clause. ROLLUP takes as its argument an ordered list of grouping columns.
First, it calculates the standard aggregate values specified in the GROUP BY clause. Then,
it creates progressively higher-level subtotals, moving from right to left through the
list of grouping columns. Finally, it creates a grand total.
ROLLUP will create subtotals at n+1 levels, where n is the number of grouping columns.
For instance, if a query specifies ROLLUP on grouping columns of Time, Region, and
Department ( n=3), the result set will include rows at four aggregation levels.
Example
This example of ROLLUP uses the data in the video store database.
SELECT Time, Region, Department,
sum(Profit) AS Profit FROM sales
GROUP BY ROLLUP(Time, Region, Dept)
As you can see in Table 20-2, this query returns the following sets of rows:
The NULL values returned by ROLLUP and CUBE are not always the traditional NULL value
meaning "value unknown." Instead, a NULL may indicate that its row is a subtotal. For
instance, the first NULL value shown in Table 20-2 is in the Department column.
This NULL means that the row is a subtotal for "All Departments" for the Central
region in 1996. To avoid introducing another non-value in the database system, these
subtotal values are not given a special tag.
See the section "GROUPING Function" for details on how the NULLs representing
subtotals are distinguished from NULLs stored in the data.
Note:
The NULLs shown in the figures of this paper are displayed only for clarity:
in standard Oracle output these cells would be blank.
The result set in Table 20-1 could be generated by the UNION of four SELECT statements,
as shown below. This is a subtotal across three dimensions. Notice that a complete set
of ROLLUP-style subtotals in n dimensions would require n+1 SELECT statements linked
with UNION ALL.
SELECT Time, Region, Department, SUM(Profit)
FROM Sales
GROUP BY Time, Region, Department
UNION ALL
SELECT Time, Region, '' , SUM(Profit)
FROM Sales
GROUP BY Time, Region
UNION ALL
SELECT Time, '', '', SUM(Profits)
FROM Sales
GROUP BY Time
UNION ALL
SELECT '', '', '', SUM(Profits)
FROM Sales;
The approach shown in the SQL above has two shortcomings compared to using
the ROLLUP operator. First, the syntax is complex, requiring more effort to generate and
understand. Second, and more importantly, query execution is inefficient because the
optimizer receives no guidance about the user's overall goal. Each of the
four SELECT statements above causes table access even though all the needed subtotals
could be gathered with a single pass. The ROLLUP extension makes the desired result
explicit and gathers its results with just one table access.
The more columns used in a ROLLUP clause, the greater the savings versus
the UNION approach. For instance, if a four-column ROLLUP replaces a UNION of
5 SELECT statements, the reduction in table access is four-fifths or 80%.
Some data access tools calculate subtotals on the client side and thereby avoid the
multiple SELECT statements described above. While this approach can work, it places
significant loads on the computing environment. For large reports, the client must
have substantial memory and processing power to handle the subtotaling tasks. Even
if the client has the necessary resources, a heavy processing burden for subtotal
calculations may slow down the client in its performance of other activities.
See Also::
CUBE
Note that the subtotals created by ROLLUP are only a fraction of possible subtotal
combinations. For instance, in the cross-tab shown in Table 20-1, the departmental
totals across regions (279,000 and 319,000) would not be calculated by
a ROLLUP(Time, Region, Department) clause. To generate those numbers would require
a ROLLUP clause with the grouping columns specified in a different order: ROLLUP(Time,
Department, Region). The easiest way to generate the full set of subtotals needed for
cross-tabular reports such as those needed for Figure 20-1 is to use the CUBE extension.
CUBE enables a SELECT statement to calculate subtotals for all possible combinations of
a group of dimensions. It also calculates a grand total. This is the set of information
typically needed for all cross-tabular reports, so CUBE can calculate a cross-tabular
report with a single SELECT statement. Like ROLLUP, CUBE is a simple extension to
the GROUP BY clause, and its syntax is also easy to learn.
Syntax
CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY
CUBE (grouping_column_reference_list)
Details
CUBE takes a specified set of grouping columns and creates subtotals for all possible
combinations of them. In terms of multi-dimensional analysis, CUBE generates all the
subtotals that could be calculated for a data cube with the specified dimensions. If you
have specified CUBE(Time, Region, Department), the result set will include all the
values that would be included in an equivalent ROLLUP statement plus additional
combinations. For instance, in Table 20-1, the departmental totals across regions
(279,000 and 319,000) would not be calculated by a ROLLUP(Time, Region,
Department) clause, but they would be calculated by a CUBE(Time, Region,
Department) clause. If there are n columns specified for a CUBE, there will be 2n
combinations of subtotals returned. Table 20-3 gives an example of a three-
dimension CUBE.
Example
This example of CUBE uses the data in the video store database.
SELECT Time, Region, Department,
sum(Profit) AS Profit FROM sales
GROUP BY CUBE (Time, Region, Dept)
Table 20-3 shows the results of this query.
There are five basic analytical operations that can be performed on an OLAP
cube:
1. Drill down: In drill-down operation, the less detailed data is converted
into highly detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is
performed by moving down in the concept hierarchy of Time dimension
(Quarter -> Month).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is
selected by selecting following dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a
new sub-cube creation. In the cube given in the overview section, Slice is
performed on the dimension Time = “Q1”.
a. Suppose that the data for analysis includes the attribute age. The age values for the data tuples
are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70. Write a C program to implement smoothing by bin means to smooth the
data, using a bin depth of 3.
b. Write a C program to calculate the correlation coefficient. Use the following data to check your
code.
Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the
following result:
Answer:
Step 1 : Since the bin depth is 3 i.e., every bin will have 3 values. And we
have total 27 values , so there will be 9 bins
BIN 1 : 13,15,16
BIN 2 : 16,19,20
BIN 3 : 20,21,22
BIN 4 : 22,25,25
BIN 5 : 25,25,30
BIN 6 : 33,33,35
BIN 7 : 35,35,35
BIN 8 : 36,40,45
BIN 9 : 46,52,70
Step 2 : Now every bin value will be replaced by the respective mean of that
bin
BIN 1 : 14.67,14.67,14.67
BIN 2 : 18.33,18.33,18.33
BIN 3 : 21,21,21
BIN 4 : 24,24,24
BIN 5 : 26.67,26.67,26.67
BIN 6 : 33.67,33.67,33.67
BIN 7 : 35,35,35
BIN 8 : 40.33,40.33,40.33
BIN 9 : 56,56,56
In Smoothing by bin means, each value in a bin is replaced by the mean value
of the bin. In general, the larger the width the greater the effect of the
smoothing.
#include<bits/stdc++.h>
return corr;
// Driver function
int main()
int n = sizeof(X)/sizeof(X[0]);
cout<<correlationCoefficient(X, Y, n);
return 0;
Output
0.953463
6. Write a C program to implement:
(a) min-max normalization
(b) z-score normalization
(c) Normalization by decimal scaling.
Ans:Example:
Here, we will discuss an example as follows.
Z=σX−μ
Where,
(𝑍)(Z) is the Z-score.
(𝑋)(X) is the value of the data point.
(𝜇)(μ) is the mean of the dataset.
(𝜎)(σ) is the standard deviation of the dataset.
Practical Example: Z-Score Normalization in Python
Here is a simple example of how to perform Z-score normalization using
Python:
Step 1: Importing the required Libraries
import numpy as np: This imports the NumPy library and gives it the alias
np, which is a common convention.
import matplotlib.pyplot as plt: This imports the pyplot module from the
Matplotlib library and gives it the alias plt.
data = np.array([70, 80, 90, 100, 110, 130, 150]): This creates a NumPy
array named data containing the sample test scores
import numpy as np
Output:
Original data: [ 70 80 90 100 110]
Mean: 90.0
Standard Deviation: 14.142135623730951
Z-scores: [-1.41421356 -0.70710678 0. 0.70710678
1.41421356]
a = Decimal(-1)
b = Decimal('0.142857')
Output :
Decimal value a : -1
Decimal value b : 0.142857
# normalize() method
a = Decimal('-3.14')
b = Decimal('321e + 5')
Output :
Decimal value a : -3.14
Decimal value b : 3.21E+7
Decimal a with normalize() method : -3.14
Decimal b with normalize() method : 3.21E+7
Solution:
Solution:
Solution:
a) The mean of the data x= 1 N i=1 N xi = 809 27 = 30. The median of the data is the middle value of the
ordered set which is 25.
b) Mode of data refers to the value with highest frequency among others. In this example 25 and 35 both
are having the same highest frequency and hence the data is bimodal in nature.
c) The midrange of the data is the average of the largest (70) and smallest (13) values in the data set.
(70+13) 2 = 41.5
d) First Quartile(Q1)=((n+1)/4)th=((27+1)/4)th=7th term which is 20.It is also known as
-The second quartile or the 50th percentile or the Median is given as: Second
Quartile(Q2)=((n+1)/2)th Term=25
-The third Quartile of the 75th Percentile (Q3) is given as: Third Quartile(Q3)=(3(n+1)/4)th