0% found this document useful (0 votes)
1 views

De Lab Programs

The document discusses various data warehousing schema designs, including Star Schema, Snowflake Schema, and Fact Constellation Schema, explaining their structures, advantages, and examples. It also covers the use of SQL extensions like CUBE and ROLLUP for multi-dimensional analysis, providing examples of how to compute aggregates and subtotals across different dimensions. Additionally, it emphasizes the importance of optimized performance in querying and reporting for analytical applications.

Uploaded by

smce.ramu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

De Lab Programs

The document discusses various data warehousing schema designs, including Star Schema, Snowflake Schema, and Fact Constellation Schema, explaining their structures, advantages, and examples. It also covers the use of SQL extensions like CUBE and ROLLUP for multi-dimensional analysis, providing examples of how to compute aggregates and subtotals across different dimensions. Additionally, it emphasizes the importance of optimized performance in querying and reporting for analytical applications.

Uploaded by

smce.ramu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

DE LAB PROGRAMS

1. Creating Star Schema/snowflake Schema / Fact constellation


Schema using any tool
a) All Electronics sales application.
b) Identify the facts and dimensions for banking environment.

Star Schema:The star schema is a widely used schema


design in data warehousing. It features a central fact
table that holds the primary data or measures, such as
sales, revenue, or quantities. The fact table is
connected to multiple dimension tables, each
representing different attributes or characteristics
related to the data in the fact table. The dimension
tables are not directly connected to each other,
creating a simple and easy-to-understand structure.

Simplicity: Star schema is the simplest and most


straightforward schema design, with fewer tables and
relationships. It provides ease of understanding, querying,
and report generation.
Denormalization: Dimension tables in star schema are
often denormalized, meaning they may contain redundant
data to optimize query performance.

Example: Consider a retail data warehouse. The fact table


might contain sales data with measures like “Total Sales”
and “Quantity Sold.” The dimension tables could include
“Product” with attributes like “Product ID,” “Product Name,”
and “Category,” and “Time” with attributes like “Date,”
“Month,” and “Year.” The fact table connects to these
dimension tables through foreign keys, allowing analysts to
perform queries like “Total Sales by Product Category” or
“Quantity Sold by Date.”
Snowflake Schema: The snowflake schema is an extension of the
star schema, designed to further reduce data redundancy by
normalizing the dimension tables. In a snowflake schema,
dimension tables are broken down into multiple related sub-
tables. This normalization creates a more complex structure with
additional levels of relationships, reducing storage requirements
but potentially increasing query complexity due to the need for
additional joins.

Normalization: Snowflake schema normalizes dimension


tables, resulting in more tables and more complex
relationships compared to the star schema.
Space Efficiency: Due to normalization, the snowflake
schema may require less storage space for dimension data
but may lead to more complex queries due to additional joins

Example: Continuing with the retail data warehouse


example, in a snowflake schema, the “Product” dimension
may be normalized into sub-tables like “Product_Category,”
“Product_Subcategory,” and “Product_Details,” each holding
specific attributes related to the product. This normalization
allows for efficient storage of data, but it may require more
complex queries to navigate through the snowflake
structure.
Fact Constellation (Galaxy Schema):The fact constellation
schema, also known as a galaxy schema, is a more complex
design that involves multiple fact tables sharing dimension
tables. It is used when there are multiple fact tables with different
measures and each fact table is related to several common
dimension tables.

Complexity: Fact constellation schema is the most complex


among the three designs, as it involves multiple
interconnected star schemas.
Flexibility: This schema design offers more flexibility in
modeling complex and diverse business scenarios, allowing
multiple fact tables to coexist and share dimensions.

Example: In a data warehouse for a healthcare organization,


there could be multiple fact tables representing different
metrics like patient admissions, medical procedures, and
medication dispensing. These fact tables would share
common dimension tables like “Patient,” “Doctor,” and
“Date.” The fact constellation schema allows analysts to
analyze different aspects of healthcare operations while
efficiently reusing shared dimension tables.
2. Compute all the cuboids of 4D cube using group-bys.
Parameters
column_alias

Column alias appearing in the query block’s SELECT list.

position

Position of an expression in the SELECT list.

expr

Any expression on tables in the current scope.

Usage notes

 Snowflake allows up to 7 elements (equivalent to 128 grouping sets) in each cube.

Examples

Start by creating and loading a table with information about sales from a chain store that has
branches in different cities and states/territories.

-- Create some tables and insert some rows.


CREATE TABLE products (product_ID INTEGER, wholesale_price
REAL);
INSERT INTO products (product_ID, wholesale_price) VALUES
(1, 1.00),
(2, 2.00);

CREATE TABLE sales (product_ID INTEGER, retail_price REAL,


quantity INTEGER, city VARCHAR, state VARCHAR);
INSERT INTO sales (product_id, retail_price, quantity, city,
state) VALUES
(1, 2.00, 1, 'SF', 'CA'),
(1, 2.00, 2, 'SJ', 'CA'),
(2, 5.00, 4, 'SF', 'CA'),
(2, 5.00, 8, 'SJ', 'CA'),
(2, 5.00, 16, 'Miami', 'FL'),
(2, 5.00, 32, 'Orlando', 'FL'),
(2, 5.00, 64, 'SJ', 'PR');

Run a cube query that shows profit by city, state, and total across all states. The example below
shows a query that has three “levels”:

 Each city.
 Each state.
 All revenue combined.

This example uses ORDER BY state, city NULLS LAST to ensure that each state’s rollup
comes immediately after all of the cities in that state, and that the final rollup appears at the end
of the output.

SELECT state, city, SUM((s.retail_price - p.wholesale_price) *


s.quantity) AS profit
FROM products AS p, sales AS s
WHERE s.product_ID = p.product_ID
GROUP BY CUBE (state, city)
ORDER BY state, city NULLS LAST
;
+-------+---------+--------+
| STATE | CITY | PROFIT |
|-------+---------+--------|
| CA | SF | 13 |
| CA | SJ | 26 |
| CA | NULL | 39 |
| FL | Miami | 48 |
| FL | Orlando | 96 |
| FL | NULL | 144 |
| PR | SJ | 192 |
| PR | NULL | 192 |
| NULL | Miami | 48 |
| NULL | Orlando | 96 |
| NULL | SF | 13 |
| NULL | SJ | 218 |
| NULL | NULL | 375 |
+-------+---------+--------+

Some rollup rows contain NULL values. For example, the last row in the table contains a NULL
value for the city and a NULL value for the state because the data is for all cities and states, not a
specific city and state.

Both GROUP BY CUBE and GROUP BY ROLLUP produce one row for each city/state pair, and
both GROUP BY clauses also produce rollup rows for each individual state and for all states
combined. The difference between the two GROUP BY clauses is that GROUP BY CUBE also
produces an output row for each city name (‘Miami’, ‘SJ’, etc.).
Be careful using GROUP BY CUBE on hierarchical data. In this example, the row for “SJ” contains
totals for both the city named “SJ” in the state of “CA” and the city named “SJ” in the territory
of “PR”, even though the only relationship between those cities is that they have the same name.
In general, use GROUP BY ROLLUP to analyze hierarchical data, and GROUP BY CUBE to analyze
data across independent axes.

3. Compute all the cuboids of 4D cube using Rollup and


Cube operators of oracle SQL.
Overview of CUBE, ROLLUP, and Top-N Queries

The last decade has seen a tremendous increase in the use of query, reporting, and on-
line analytical processing (OLAP) tools, often in conjunction with data warehouses
and data marts. Enterprises exploring new markets and facing greater competition
expect these tools to provide the maximum possible decision-making value from their
data resources.

Oracle expands its long-standing support for analytical applications in Oracle8i


release 8.1.5 with the CUBE and ROLLUP extensions to SQL. Oracle also provides
optimized performance and simplified syntax for Top-N queries. These enhancements
make important calculations significantly easier and more efficient, enhancing
database performance, scalability and simplicity.

ROLLUP and CUBE are simple extensions to the SELECT statement's GROUP BY
clause. ROLLUP creates subtotals at any level of aggregation needed, from the most
detailed up to a grand total. CUBE is an extension similar to ROLLUP, enabling a single
statement to calculate all possible combinations of subtotals. CUBE can generate the
information needed in cross-tab reports with a single query. To enhance performance,
both CUBE and ROLLUP are parallelized: multiple processes can simultaneously execute
both types of statements.

See Also::

For information on parallel execution, see Oracle8i Concepts.

Enhanced Top-N queries enable more efficient retrieval of the largest and smallest
values of a data set. This chapter presents concepts, syntax, and examples
of CUBE, ROLLUP and Top-N analysis.
Analyzing across Multiple Dimensions

One of the key concepts in decision support systems is "multi-dimensional analysis":


examining the enterprise from all necessary combinations of dimensions. We use the
term "dimension" to mean any category used in specifying questions. Among the most
commonly specified dimensions are time, geography, product, department, and
distribution channel, but the potential dimensions are as endless as the varieties of
enterprise activity. The events or entities associated with a particular set of dimension
values are usually referred to as "facts." The facts may be sales in units or local
currency, profits, customer counts, production volumes, or anything else worth
tracking.

Here are some examples of multi-dimensional requests:

 Show total sales across all products at increasing aggregation levels: from
state to country to region for 1996 and 1997.
 Create a cross-tabular analysis of our operations showing expenses by
territory in South America for 1996 and 1997. Include all possible subtotals.
 List the top 10 sales representatives in Asia according to 1997 sales revenue in
for automotive products and rank their commissions.

All the requests above constrain multiple dimensions. Many multi-dimensional


questions require aggregated data and comparisons of data sets, often across time,
geography or budgets.

To visualize data that has many dimensions, analysts commonly use the analogy of a
data "cube," that is, a space where facts are stored at the intersection of n
dimensions. Figure 20-1 shows a data cube and how it could be used differently by
various groups. The cube stores sales data organized by the dimensions of Product,
Market, and Time.
Figure 20-1 Cube and Views by Different Users

We can retrieve "slices" of data from the cube. These correspond to cross-tabular
reports such as the one shown in Table 20-1. Regional managers might study the data
by comparing slices of the cube applicable to different markets. In contrast, product
managers might compare slices that apply to different products. An ad hoc user might
work with a wide variety of constraints, working in a subset cube.

Answering multi-dimensional questions often involves huge quantities of data,


sometimes millions of rows. Because the flood of detailed data generated by large
organizations cannot be interpreted at the lowest level, aggregated views of the
information are essential. Subtotals across many dimensions are vital to multi-
dimensional analyses. Therefore, analytical tasks require convenient and efficient data
aggregation.

Optimized Performance

Not only multi-dimensional issues, but all types of processing can benefit from
enhanced aggregation facilities. Transaction processing, financial and manufacturing
systems--all of these generate large numbers of production reports needing substantial
system resources. Improved efficiency when creating these reports will reduce system
load. In fact, any computer process that aggregates data from details to higher levels
needs optimized performance.
To leverage the power of the database server, powerful aggregation commands should
be available inside the SQL engine. New extensions in Oracle provide these features
and bring many benefits, including:

 Simplified programming requiring less SQL code for many tasks


 Quicker and more efficient query processing
 Reduced client processing loads and network traffic because aggregation work
is shifted to servers
 Opportunities for caching aggregations because similar queries can leverage
existing work

Oracle8i provides all these benefits with the new CUBE and ROLLUP extensions to
the GROUP BY clause. These extensions adhere to the ANSI and ISO proposals for
SQL3, a draft standard for enhancements to SQL.

A Scenario

To illustrate CUBE, ROLLUP, and Top-N queries, this chapter uses a hypothetical
videotape sales and rental company. All the examples given refer to data from this
scenario. The hypothetical company has stores in several regions and tracks sales and
profit information. The data is categorized by three dimensions: Time, Department,
and Region. The time dimensions are 1996 and 1997, the departments are Video Sales
and Video Rentals, and the regions are East, West, and Central.

Table 20-1 is a sample cross-tabular report showing the total profit by region and
department in 1997:

Table 20-1 Simple Cross-Tabular Report, with Subtotals Shaded


1997
Region Department

Video Rental Profit Video Sales Profit Total Profit

Central 82,000 85,000 167,000

East 101,000 137,000 238,000

West 96,000 97,000 193,000

Total 279,000 319,000 598,000


Consider that even a simple report like Table 20-1, with just twelve values in its grid,
needs five subtotals and a grand total. The subtotals are the shaded numbers, such as
Video Rental Profits across regions, namely, 279,000, and Eastern region profits
across department, namely, 238,000. Half of the values needed for this report would
not be calculated with a query that used a standard SUM() and GROUP BY. Database
commands that offer improved calculation of subtotals bring major benefits to
querying, reporting and analytical operations.

ROLLUP

ROLLUP enables a SELECT statement to calculate multiple levels of subtotals across a


specified group of dimensions. It also calculates a grand total. ROLLUP is a simple
extension to the GROUP BY clause, so its syntax is extremely easy to use.
The ROLLUP extension is highly efficient, adding minimal overhead to a query.

Syntax

ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY
ROLLUP(grouping_column_reference_list)

Details

ROLLUP's action is straightforward: it creates subtotals which "roll up" from the most
detailed level to a grand total, following a grouping list specified in
the ROLLUP clause. ROLLUP takes as its argument an ordered list of grouping columns.
First, it calculates the standard aggregate values specified in the GROUP BY clause. Then,
it creates progressively higher-level subtotals, moving from right to left through the
list of grouping columns. Finally, it creates a grand total.

ROLLUP will create subtotals at n+1 levels, where n is the number of grouping columns.
For instance, if a query specifies ROLLUP on grouping columns of Time, Region, and
Department ( n=3), the result set will include rows at four aggregation levels.

Example

This example of ROLLUP uses the data in the video store database.
SELECT Time, Region, Department,
sum(Profit) AS Profit FROM sales
GROUP BY ROLLUP(Time, Region, Dept)
As you can see in Table 20-2, this query returns the following sets of rows:

 Regular aggregation rows that would be produced by GROUP BY without


using ROLLUP
 First-level subtotals aggregating across Department for each combination of
Time and Region
 Second-level subtotals aggregating across Region and Department for each
Time value
 A grand total row

Table 20-2 ROLLUP Aggregation across Three Dimensions


Time Region Department Profit

1996 Central VideoRental 75,000


1996 Central VideoSales 74,000
1996 Central [NULL] 149,000
1996 East VideoRental 89,000
1996 East VideoSales 115,000
1996 East [NULL] 204,000
1996 West VideoRental 87,000
1996 West VideoSales 86,000
1996 West [NULL] 173,000
1996 [NULL] [NULL] 526,000
1997 Central VideoRental 82,000
1997 Central VideoSales 85,000
1997 Central [NULL] 167,000
1997 East VideoRental 101,000
1997 East VideoSales 137,000
1997 East [NULL] 238,000
1997 West VideoRental 96,000
1997 West VideoSales 97,000
1997 West [NULL] 193,000
1997 [NULL] [NULL] 598,000
[NULL] [NULL] [NULL] 1,124,000
Interpreting "[NULL]" Values in Results

The NULL values returned by ROLLUP and CUBE are not always the traditional NULL value
meaning "value unknown." Instead, a NULL may indicate that its row is a subtotal. For
instance, the first NULL value shown in Table 20-2 is in the Department column.
This NULL means that the row is a subtotal for "All Departments" for the Central
region in 1996. To avoid introducing another non-value in the database system, these
subtotal values are not given a special tag.

See the section "GROUPING Function" for details on how the NULLs representing
subtotals are distinguished from NULLs stored in the data.

Note:

The NULLs shown in the figures of this paper are displayed only for clarity:
in standard Oracle output these cells would be blank.

Calculating Subtotals without ROLLUP

The result set in Table 20-1 could be generated by the UNION of four SELECT statements,
as shown below. This is a subtotal across three dimensions. Notice that a complete set
of ROLLUP-style subtotals in n dimensions would require n+1 SELECT statements linked
with UNION ALL.
SELECT Time, Region, Department, SUM(Profit)
FROM Sales
GROUP BY Time, Region, Department
UNION ALL
SELECT Time, Region, '' , SUM(Profit)
FROM Sales
GROUP BY Time, Region
UNION ALL
SELECT Time, '', '', SUM(Profits)
FROM Sales
GROUP BY Time
UNION ALL
SELECT '', '', '', SUM(Profits)
FROM Sales;
The approach shown in the SQL above has two shortcomings compared to using
the ROLLUP operator. First, the syntax is complex, requiring more effort to generate and
understand. Second, and more importantly, query execution is inefficient because the
optimizer receives no guidance about the user's overall goal. Each of the
four SELECT statements above causes table access even though all the needed subtotals
could be gathered with a single pass. The ROLLUP extension makes the desired result
explicit and gathers its results with just one table access.

The more columns used in a ROLLUP clause, the greater the savings versus
the UNION approach. For instance, if a four-column ROLLUP replaces a UNION of
5 SELECT statements, the reduction in table access is four-fifths or 80%.

Some data access tools calculate subtotals on the client side and thereby avoid the
multiple SELECT statements described above. While this approach can work, it places
significant loads on the computing environment. For large reports, the client must
have substantial memory and processing power to handle the subtotaling tasks. Even
if the client has the necessary resources, a heavy processing burden for subtotal
calculations may slow down the client in its performance of other activities.

When to Use ROLLUP

Use the ROLLUP extension in tasks involving subtotals.

 It is very helpful for subtotaling along a hierarchical dimension such as time or


geography. For instance, a query could specify a ROLLUP of year/month/day or
country/state/city.
 It simplifies and speeds the population and maintenance of summary tables.
Data warehouse administrators may want to make extensive use of it. Note
that population of summary tables is even faster if the ROLLUP query executes
in parallel.

See Also::

For information on parallel execution, see Oracle8i Concepts.

CUBE

Note that the subtotals created by ROLLUP are only a fraction of possible subtotal
combinations. For instance, in the cross-tab shown in Table 20-1, the departmental
totals across regions (279,000 and 319,000) would not be calculated by
a ROLLUP(Time, Region, Department) clause. To generate those numbers would require
a ROLLUP clause with the grouping columns specified in a different order: ROLLUP(Time,
Department, Region). The easiest way to generate the full set of subtotals needed for
cross-tabular reports such as those needed for Figure 20-1 is to use the CUBE extension.

CUBE enables a SELECT statement to calculate subtotals for all possible combinations of
a group of dimensions. It also calculates a grand total. This is the set of information
typically needed for all cross-tabular reports, so CUBE can calculate a cross-tabular
report with a single SELECT statement. Like ROLLUP, CUBE is a simple extension to
the GROUP BY clause, and its syntax is also easy to learn.

Syntax

CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY
CUBE (grouping_column_reference_list)

Details

CUBE takes a specified set of grouping columns and creates subtotals for all possible
combinations of them. In terms of multi-dimensional analysis, CUBE generates all the
subtotals that could be calculated for a data cube with the specified dimensions. If you
have specified CUBE(Time, Region, Department), the result set will include all the
values that would be included in an equivalent ROLLUP statement plus additional
combinations. For instance, in Table 20-1, the departmental totals across regions
(279,000 and 319,000) would not be calculated by a ROLLUP(Time, Region,
Department) clause, but they would be calculated by a CUBE(Time, Region,
Department) clause. If there are n columns specified for a CUBE, there will be 2n
combinations of subtotals returned. Table 20-3 gives an example of a three-
dimension CUBE.

Example

This example of CUBE uses the data in the video store database.
SELECT Time, Region, Department,
sum(Profit) AS Profit FROM sales
GROUP BY CUBE (Time, Region, Dept)
Table 20-3 shows the results of this query.

Table 20-3 Cube Aggregation across Three Dimensions


Time Region Department Profit

1996 Central VideoRental 75,000


1996 Central VideoSales 74,000
1996 Central [NULL] 149,000
1996 East VideoRental 89,000
1996 East VideoSales 115,000
1996 East [NULL] 204,000
1996 West VideoRental 87,000
1996 West VideoSales 86,000
1996 West [NULL] 173,000
1996 [NULL] VideoRental 251,000
1996 [NULL] VideoSales 275,000
1996 [NULL] [NULL] 526,000
1997 Central VideoRental 82,000
1997 Central VideoSales 85,000
1997 Central [NULL] 167,000
1997 East VideoRental 101,000
1997 East VideoSales 137,000
1997 East [NULL] 238,000
1997 West VideoRental 96,000
1997 West VideoSales 97,000
1997 West [NULL] 193,000
1997 [NULL] VideoRental 279,000
1997 [NULL] VideoSales 319,000
1997 [NULL] [NULL] 598,000
[NULL] Central VideoRental 157,000
[NULL] Central VideoSales 159,000
[NULL] Central [NULL] 316,000
[NULL] East VideoRental 190,000
Time Region Department Profit

[NULL] East VideoSales 252,000


[NULL] East [NULL] 442,000
[NULL] West VideoRental 183,000
[NULL] West VideoSales 183,000
[NULL] West [NULL] 366,000
[NULL] [NULL] VideoRental 530,000
[NULL] [NULL] VideoSales 594,000
[NULL] [NULL] [NULL] 1,124,000

4. SQL queries for implementing different OLAP


operations.
OLAP Operations in DBMS
OLAP stands for Online Analytical Processing Server. It is a software
technology that allows users to analyze information from multiple database
systems at the same time. It is based on multidimensional data model and
allows the user to query on multi-dimensional data (eg. Delhi -> 2018 ->
Sales data). OLAP databases are divided into one or more cubes and these
cubes are known as Hyper-cubes.
OLAP operations:

There are five basic analytical operations that can be performed on an OLAP
cube:
1. Drill down: In drill-down operation, the less detailed data is converted
into highly detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is
performed by moving down in the concept hierarchy of Time dimension
(Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs


aggregation on the OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is
performed by climbing up in the concept hierarchy of Location dimension
(City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is
selected by selecting following dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a
new sub-cube creation. In the cube given in the overview section, Slice is
performed on the dimension Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view


to get a new view of the representation. In the sub-cube obtained after the
slice operation, performing pivot operation gives a new view of it.
5. Write high level language programs to implement different data preprocessing
techniques.

a. Suppose that the data for analysis includes the attribute age. The age values for the data tuples
are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70. Write a C program to implement smoothing by bin means to smooth the
data, using a bin depth of 3.

b. Write a C program to calculate the correlation coefficient. Use the following data to check your
code.

Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the
following result:

Are these two variables positively or negatively correlated?

Answer:
Step 1 : Since the bin depth is 3 i.e., every bin will have 3 values. And we
have total 27 values , so there will be 9 bins

BIN 1 : 13,15,16
BIN 2 : 16,19,20
BIN 3 : 20,21,22
BIN 4 : 22,25,25
BIN 5 : 25,25,30
BIN 6 : 33,33,35
BIN 7 : 35,35,35
BIN 8 : 36,40,45
BIN 9 : 46,52,70
Step 2 : Now every bin value will be replaced by the respective mean of that
bin
BIN 1 : 14.67,14.67,14.67
BIN 2 : 18.33,18.33,18.33
BIN 3 : 21,21,21
BIN 4 : 24,24,24
BIN 5 : 26.67,26.67,26.67
BIN 6 : 33.67,33.67,33.67
BIN 7 : 35,35,35
BIN 8 : 40.33,40.33,40.33
BIN 9 : 56,56,56
In Smoothing by bin means, each value in a bin is replaced by the mean value
of the bin. In general, the larger the width the greater the effect of the
smoothing.

Program to find correlation coefficient


Correlation coefficient
= (5 * 3000 - 105 * 140)
/ sqrt((5 * 2295 - 105 2)*(5*3964 - 140 2))
= 300 / sqrt(450 * 220) = 0.953463
Example
Input : X[] = {43, 21, 25, 42, 57, 59}
Y[] = {99, 65, 79, 75, 87, 81}
Output : 0.529809

Input : X[] = {15, 18, 21, 24, 27};


Y[] = {25, 25, 27, 31, 32}
Output : 0.953463
// Program to find correlation coefficient

#include<bits/stdc++.h>

using namespace std;

// function that returns correlation coefficient.

float correlationCoefficient(int X[], int Y[], int n)

int sum_X = 0, sum_Y = 0, sum_XY = 0;

int squareSum_X = 0, squareSum_Y = 0;

for (int i = 0; i < n; i++)

// sum of elements of array X.

sum_X = sum_X + X[i];

// sum of elements of array Y.

sum_Y = sum_Y + Y[i];

// sum of X[i] * Y[i].

sum_XY = sum_XY + X[i] * Y[i];

// sum of square of array elements.

squareSum_X = squareSum_X + X[i] * X[i];

squareSum_Y = squareSum_Y + Y[i] * Y[i];


}

// use formula for calculating correlation coefficient.

float corr = (float)(n * sum_XY - sum_X * sum_Y)

/ sqrt((n * squareSum_X - sum_X * sum_X)

* (n * squareSum_Y - sum_Y * sum_Y));

return corr;

// Driver function

int main()

int X[] = {15, 18, 21, 24, 27};

int Y[] = {25, 25, 27, 31, 32};

//Find the size of array.

int n = sizeof(X)/sizeof(X[0]);

//Function call to correlationCoefficient.

cout<<correlationCoefficient(X, Y, n);

return 0;

Output
0.953463
6. Write a C program to implement:
(a) min-max normalization
(b) z-score normalization
(c) Normalization by decimal scaling.

Ans:Example:
Here, we will discuss an example as follows.

Normalize the following group of data –


1000,2000,3000,9000
using min-max normalization by setting min:0 and max:1
Solution –
here,new_max(A)=1 , as given in question- max=1
new_min(A)=0,as given in question- min=0
max(A)=9000,as the maximum data among 1000,2000,3000,9000 is
9000
min(A)=1000,as the minimum data among 1000,2000,3000,9000 is
1000
Case-1: normalizing 1000 –
v = 1000 , putting all values in the formula,we get
v' = (1000-1000) X (1-0)
----------------- + 0 =0
9000-1000
Case-2: normalizing 2000 –
v = 2000, putting all values in the formula,we get
v '= (2000-1000) X (1-0)
----------------- + 0 =0 .125
9000-1000
Case-3: normalizing 3000 –
v=3000, putting all values in the formula,we get
v'=(3000-1000) X (1-0)
----------------- + 0 =0 .25
9000-1000
Case-4: normalizing 9000 –
v=9000, putting all values in the formula, we get
v'=(9000-1000) X (1-0)
----------------- + 0 =1
9000-1000
Outcome :
Hence, the normalized values of 1000,2000,3000,9000 are 0, 0.125, .25, 1.

(b) z-score normalization:

Z=σX−μ
Where,
 (𝑍)(Z) is the Z-score.
 (𝑋)(X) is the value of the data point.
 (𝜇)(μ) is the mean of the dataset.
 (𝜎)(σ) is the standard deviation of the dataset.
Practical Example: Z-Score Normalization in Python
Here is a simple example of how to perform Z-score normalization using
Python:
Step 1: Importing the required Libraries
 import numpy as np: This imports the NumPy library and gives it the alias
np, which is a common convention.
 import matplotlib.pyplot as plt: This imports the pyplot module from the
Matplotlib library and gives it the alias plt.
 data = np.array([70, 80, 90, 100, 110, 130, 150]): This creates a NumPy
array named data containing the sample test scores
import numpy as np

# Sample data: test scores


data = np.array([70, 80, 90, 100, 110])

Step 2: Calculate the Mean


mean = np.mean(data): This calculates the mean (average) of the data array
using the mean function from NumPy and stores it in the variable mean.

# Calculate the mean


mean = np.mean(data)
Step 3:Calculate the standard deviation
std_dev = np.std(data): This calculates the standard deviation of the data
array using the std function from NumPy and stores it in the variable std_dev

# Calculate the standard deviation


std_dev = np.std(data)

Step 4: Perform Z-score normalization


z_scores = (data - mean) / std_dev: This applies the Z-score normalization
formula to each element in the data array. It subtracts the mean from each
data point and divides the result by the standard deviation. The resulting Z-
scores are stored in the array z_scores.
# Perform Z-score normalization
z_scores = (data - mean) / std_dev
# Print the results
print("Original data:", data)
print("Mean:", mean)
print("Standard Deviation:", std_dev)
print("Z-scores:", z_scores)

Output:
Original data: [ 70 80 90 100 110]
Mean: 90.0
Standard Deviation: 14.142135623730951
Z-scores: [-1.41421356 -0.70710678 0. 0.70710678
1.41421356]

(c) Normalization by decimal scaling.

Decimal#normalize() : normalize() is a Decimal class method which returns


the simplest form of the Decimal value.
Syntax: Decimal.normalize()

Parameter: Decimal values

Return: the simplest form of the Decimal value.


Code #1 : Example for normalize() method

# Python Program explaining


# normalize() method

# loading decimal library

from decimal import *

# Initializing a decimal value

a = Decimal(-1)

b = Decimal('0.142857')

# printing Decimal values

print ("Decimal value a : ", a)

print ("Decimal value b : ", b)

# Using Decimal.normalize() method

print ("\n\nDecimal a with normalize() method : ",


a.normalize())
print ("Decimal b with normalize() method : ", b.normalize())

Output :
Decimal value a : -1
Decimal value b : 0.142857

Decimal a with normalize() method : -1


Decimal b with normalize() method : 0.142857

Code #2 : Example for normalize() method

# Python Program explaining

# normalize() method

# loading decimal library

from decimal import *

# Initializing a decimal value

a = Decimal('-3.14')

b = Decimal('321e + 5')

# printing Decimal values

print ("Decimal value a : ", a)

print ("Decimal value b : ", b)


# Using Decimal.normalize() method

print ("\n\nDecimal a with normalize() method : ",


a.normalize())

print ("Decimal b with normalize() method : ",


b.normalize())

Output :
Decimal value a : -3.14
Decimal value b : 3.21E+7
Decimal a with normalize() method : -3.14
Decimal b with normalize() method : 3.21E+7

7. Write a high level program for the following:


Suppose that the data for analysis includes the attribute age. The age values for the data tuples are
(in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 36, 40, 45, 52, 70.
i) What is the mean of the data? What is median?
ii) What is the mode of the data? Comment on the data’s modality (bimodal,
trimodal, etc..)
iii) What is mid-range of the data?
iv) Can you find the first quartile(Q1) and the third quartile (Q3) of the data?
v) Give the five number summary of the data.

Solution:
Solution:
Solution:

a) The mean of the data x= 1 N i=1 N xi = 809 27 = 30. The median of the data is the middle value of the
ordered set which is 25.

b) Mode of data refers to the value with highest frequency among others. In this example 25 and 35 both
are having the same highest frequency and hence the data is bimodal in nature.

c) The midrange of the data is the average of the largest (70) and smallest (13) values in the data set.
(70+13) 2 = 41.5
d) First Quartile(Q1)=((n+1)/4)th=((27+1)/4)th=7th term which is 20.It is also known as

the lower quartile.

-The second quartile or the 50th percentile or the Median is given as: Second

Quartile(Q2)=((n+1)/2)th Term=25

-The third Quartile of the 75th Percentile (Q3) is given as: Third Quartile(Q3)=(3(n+1)/4)th

Term=35 also known as the upper quartile.

-The interquartile range is calculated as: Upper Quartile – Lower Quartile=35-20=15

You might also like