0% found this document useful (0 votes)
131 views132 pages

DWM Compiled Notes

Dwm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views132 pages

DWM Compiled Notes

Dwm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BCA/[Link](IT)-5th SEM [Link].

in

CLASS:BCA/ B . S C ( I T ) 5 t h

SUBJECT: Data Warehousing and Mining

Notes as per IKGPTU Syllabus

FACULTY OF C O M P U T E R S C I E N C E A N D I T , SBS COLLEGE. LUDHIANA

SBS ©PROPRIETARY Page 1


INDEX

Course Code: UGCA1931


Course Name: Data Warehouse and Mining

Program: BCA L: 3 T: 1 P: 0
Branch: Computer Applications Credits: 4
Semester: 5th Contact hours: 44 hours
Theory/Practical: Theory Percentage of numerical/design problems: 20%
Internal max. marks: 40 Duration of end semester exam (ESE): 3hrs
External max. marks: 60 Elective status: Elective
Total marks: 100

Course
Prerequisite: -NA-
Co requisite: -NA-
Additional material required in ESE: -NA-

Outcomes: After completing this course, students will be able to:


CO# Course outcomes
CO1 Highlight the need of Data Warehousing & Mining
CO2 Differentiate between the Transactional and Analytical data models.
CO3 Identify the real-life applications where data mining can be applied.
CO4 Apply different data mining algorithms on wide range of data sets.
CO5 Explain the role of visualization in data representation and analysis.

Detailed Contents Contact hours


Unit-I

Need for strategic information, difference between operational and


Informational data stores.
Data warehouse definition, characteristics, Data warehouse role and 11
structure, OLAP Operations, Data mart, Different between data mart and
data warehouse, approaches to build a data warehouse, Building a data
warehouse, Metadata & its types. [CO1]

Unit-II

Data Pre-processing: Need, Data Summarization, Methods.


Denormalization, Multidimensional data model, Schemas for multi-
dimensional data (Star schema, Snowflake Schema, Fact Constellation 11
Schema, Difference between different schemas.
Data warehouse architecture, OLAP servers, Indexing OLAP Data,
OLAP query processing, Data cube computation [CO2]

Page 1 of 122
INDEX

Unit-III

Data Mining: Definition, Data Mining process, Data mining


methodology, Data mining tasks, Mining various Data types & issues. 12
Attribute-Oriented Induction, Association rule mining, Frequent itemset
mining, The Apriori Algorithm, Mining multilevel association
rules.[CO3]

Unit-IV

Overview of classification, Classification process, Decision tree,


Decision Tree Induction, Attribute Selection Measures. Overview of
classifier’s accuracy, evaluating classifier’s accuracy, Techniques for
accuracy estimation, Increasing the accuracy of classifier. [CO4]

Introduction to Clustering, Types of clusters, Clustering methods, Data


visualization & various data visualization tools. [CO5]
Textbooks:
1. Data Warehousing, Data Mining & Olap by Berson, Tata Mcgraw- Hill.
2. Han J., Kamber M. and Pei J., Data mining concepts and techniques, Morgan
Kaufmann Publishers (2011) 3rd ed.
3. Pudi V., Krishana P.R., Data Mining, Oxford University press, (2009) 1st ed.
4. Adriaans P., Zantinge D., Data mining, Pearson education press (1996), 1st ed.
5. Pooniah P., Data Warehousing Fundamentals, Willey interscience Publication,
(2001), 1st ed.

Page 2 of 122
BCA/[Link](IT) [Link]
Sem

UNIT-I
BCA/[Link](IT) [Link]
Sem
 Data Warehousing:
Data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an
organization.

An operational database undergoes frequent changes on a daily basis on account


of the transactions that take place. Suppose a business executive wants to analyze
previous feedback on any data such as a product, a supplier, or any consumer data,
then the executive will have no data available to analyze because the previous data
has been updated due to transactions.

A data warehouses provides us generalized and consolidated data in


multidimensional view. Along with generalized and consolidated view of data, a
data warehouses also provides us Online Analytical Processing (OLAP) tools.
These tools help us in interactive and effective analysis of data in a
multidimensional space. This analysis results in data generalization and data
mining.

Data mining functions such as association, clustering, classification, prediction


can be integrated with OLAP operations to enhance the interactive mining of
knowledge at multiple level of abstraction. That's why data warehouse has now
become an important platform for data analysis and online analytical processing.

There many types of data warehouses but these are the three most common:

Enterprise data warehouse – Provides a central repository tailored for support


decision-making for the entire enterprise
BCA/[Link](IT) [Link]
Sem
Operational Data Store – Similar to the enterprise warehouse in terms of scope,
but data is refreshed in near real time and can be used for operational reporting

Data Mart – This is a subset of a data warehouse used to support a specific


region, business unit or function area .

 Understanding a Data Warehouse

A data warehouse is a database, which is kept separate from the organization's


operational database.

There is no frequent updating done in a data warehouse.

It possesses consolidated historical data, which helps the organization to analyze


its business.

A data warehouse helps executives to organize, understand, and use their data to
take strategic decisions.

Data warehouse systems help in the integration of diversity of application systems.

A data warehouse system helps in consolidated historical data analysis.

 Benefits of a Data Warehouse


Organizations have a common goal – to make better business decisions. A data
warehouse, once implemented into your business intelligence framework, can
benefit your company in numerous ways. A data warehouse:

1. Delivers enhanced business intelligence

By having access to information from various sources from a single platform,


decision makers will no longer need to rely on limited data or their instinct.
Additionally, data warehouses can effortlessly be applied to a business’s
processes, for instance, market segmentation, sales, risk, inventory, and financial
management.

2. Saves times

A data warehouse standardizes, preserves, and stores data from distinct sources,
aiding the consolidation and integration of all the data. Since critical data is
available to all users, it allows them to make informed decisions on key aspects. In
BCA/[Link](IT) [Link]
Sem
addition, executives can query the data themselves with little to no IT support,
saving more time and money.

3. Enhances data quality and consistency

A data warehouse converts data from multiple sources into a consistent format.
Since the data from across the organization is standardized, each department will
produce results that are consistent. This will lead to more accurate data, which will
become the basis for solid decisions.

4. Generates a high Return on Investment (ROI)

Companies experience higher revenues and cost savings than those that haven’t
invested in a data warehouse.

5. Provides competitive advantage

Data warehouses help get a holistic view of their current standing and evaluate
opportunities and risks, thus providing companies with a competitive advantage.

6. Improves the decision-making process

Data warehousing provides better insights to decision makers by maintaining a


cohesive database of current and historical data. By transforming data into
purposeful information, decision makers can perform more functional, precise,
and reliable analysis and create more useful reports with ease.

7. Enables organizations to forecast with confidence

Data professionals can analyze business data to make market forecasts, identify
potential KPIs, and gauge predicated results, allowing key personnel to plan
accordingly.

8. Streamlines the flow of information

Data warehousing facilitates the flow of information through a network


connecting all related or non-related parties.

 The Disadvantages of a Data Warehouse

Data warehouses are relational databases that act as data analysis tools,
aggregating data from multiple departments of a business into one data store.
Data warehouses are typically updated as an end-of-day batch job, rather than
being churned by real time transactional data. Their primary benefits are giving
BCA/[Link](IT) [Link]
Sem
managers better and timelier data to make strategic decisions for the company.
However, they have some drawbacks as well.

Extra Reporting Work

Depending on the size of the organization, a data warehouse runs the risk of
extra work on departments. Each type of data that's needed in the warehouse
typically has to be generated by the IT teams in each division of the business.
This can be as simple as duplicating data from an existing database, but at other
times, it involves gathering data from customers or employees that wasn't
gathered before.

Cost/Benefit Ratio

A commonly cited disadvantage of data warehousing is the cost/benefit analysis.


A data warehouse is a big IT project, and like many big IT projects, it can suck a
lot of IT man hours and budgetary money to generate a tool that doesn't get used
often enough to justify the implementation expense. This is completely
sidestepping the issue of the expense of maintaining the data warehouse and
updating it as the business grows and adapts to the market.

Data Ownership Concerns

Data warehouses are often, but not always, Software as a Service


implementations, or cloud services applications. Your data security in this
environment is only as good as your cloud vendor. Even if implemented locally,
there are concerns about data access throughout the company. Make sure that the
people doing the analysis are individuals that your organization trusts, especially
with customers' personal data. A data warehouse that leaks customer data is a
privacy and public relations nightmare.

Data Flexibility

Data warehouses tend to have static data sets with minimal ability to "drill
down" to specific solutions. The data is imported and filtered through a schema,
and it is often days or weeks old by the time it's actually used. In addition, data
warehouses are usually subject to ad hoc queries and are thus notoriously
difficult to tune for processing speed and query speed. While the queries are
often ad hoc, the queries are limited by what data relations were set when the
aggregation was assembled.
BCA/[Link](IT) [Link]
Sem
Difference between data warehouse and data mart

Parameter Data Warehouse Data Mart

A Data Warehouse is a large


A data mart is an only subtype of
repository of data collected
a Data Warehouse. It is designed
Definition from different organizations or
to meet the need of a certain user
departments within a
group.
corporation.

It helps to take a strategic It helps to take tactical decisions


Usage
decision. for the business.

The main objective of Data


Warehouse is to provide an A data mart mostly used in a
Objective integrated environment and business division at the
coherent picture of the department level.
business at a point in time.

The designing process of Data The designing process of Data


Designing
Warehouse is quite difficult. Mart is easy.

May or may not use in a


It is built focused on a
dimensional model. However,
dimensional model using a start
it can feed dimensional
schema.
models.
Data warehousing includes Data marts are easy to use,
large area of the corporation design and implement as it can
Data Handling
which is why it takes a long only handle small amounts of
time to process it. data.

Data warehousing is broadly


Data Mart is subject-oriented,
focused all the departments. It
Focus and it is used at a department
is possible that it can even
level.
represent the entire company.

Data type The data stored inside the Data Data Marts are built for particular
Warehouse are always detailed user groups. Therefore, data short
BCA 5th Sem [Link]

Parameter Data Warehouse Data Mart

when compared with data and limited.


mart.

The main objective of Data


Warehouse is to provide an
Mostly hold only one subject
Subject-area integrated environment and
area- for example, Sales figure.
coherent picture of the
business at a point in time.

Dimensional modeling and star


Designed to store enterprise-
schema design employed for
Data storing wide decision data, not just
optimizing the performance of
marketing data.
access layer.

Time variance and non- Mostly includes consolidation


Data type volatile design are strictly data structures to meet subject
enforced. area's query and reporting needs.

Transaction data regardless of


Read-Only from the end-users
Data value grain fed directly from the Data
standpoint.
Warehouse.

Data mart contains data, of a


Data warehousing is more specific department of a
helpful as it can bring company. There are maybe
Scope
information from any separate data marts for sales,
department. finance, marketing, etc. Has
limited usage
In Data Warehouse Data In Data Mart data comes from
Source
comes from many sources. very few sources.

The size of the Data


The Size of Data Mart is less
Size Warehouse may range from
than 100 GB.
100 GB to 1 TB+.
Implementation The implementation process of The implementation process of
time Data Warehouse can be Data Mart is restricted to few
BCA 5th Sem [Link]

Parameter Data Warehouse Data Mart

extended from months to months.


years.

 Need For Data Warehouse


1. Improving integration

An organization registers data in various systems which support the various


business processes. In order to create an overall picture of business operations,
customers, and suppliers – thus creating a single version of the truth – the data
must come together in one place and be made compatible. Both external (from the
environment) and internal data (from ERP and financial systems) should merge
into the data warehouse and then be grouped.

2. Speeding up response times

The source systems are fully optimized in order to process many small
transactions, such as orders, in a short time. Generating information about the
performance of the organization only requires a few large ‘transactions’ in which
large volumes of data are gathered and aggregated. The structure of a data
warehouse is specifically designed to quickly analyze such large volumes of (big)
data.

3. Faster and more flexible reporting

The structure of both data warehouses and data marts enables end users to report
in a flexible manner and to quickly perform interactive analysis based on various
predefined angles (dimensions). They may, for example, with a single mouse click
jump from year level, to quarter, to month level, and quickly switch between the
customer dimension and the product dimension, all while the indicator remains
fixed. In this way, end users can actually mix the data and thus quickly gain
knowledge about business operations and performance indicators.

4. Recording changes to build history


BCA 5th Sem [Link]

Source systems don’t usually keep a history of certain data. For example, if a
customer relocates or a product moves to a different product group, the (old)
values will most likely be overwritten. This means they disappear from the system
– or at least they’re very difficult to trace back.

That’s really bad, because in order to generate reliable information, we actually


need these old values, as users sometimes want to be able to look back in time. In
other words: we want to be able to look at the organization’s performance from a
historical perspective – in accordance with the organizational structure and
product classifications of that time – instead of in the current context. A data
warehouse ensures that data changes in the source system are recorded, which
enables historical analysis.

5. Increasing data quality

Stakeholders and users frequently overestimate the quality of data in the source
systems. Unfortunately, source systems quite often contain data of poor quality.
When we use a data warehouse, we can greatly improve the data quality, either
through – where possible – correcting the data while loading or by tackling the
problem at its source.

6. Unburdening the IT department

A data warehouse and Business Intelligence tools allow employees within the
organization to create reports and perform analyses independently. However, an
organization will first have to invest in order to set up the required infrastructure
for that data warehouse and those BI tools. The following principle applies: the
better the architecture is set up and developed, the more complex reports users can
independently create. Obviously, users first need sufficient training and support,
where necessary. Yet, what we see in practice is that many of the more complex
reports end up being created by the IT department. This is mostly due to users
lacking either the time or the knowledge.

7. Increasing findability

When we create a data warehouse, we make sure that users can easily access the
meaning of data. (In the source system, these meanings are either non-existent or
poorly accessible.) With a data warehouse, users can find data more quickly, and
thus establish information and knowledge faster. All the goals of the data
BCA 5th Sem [Link]

warehouse serve the aims of Business Intelligence: making better decisions faster
at all levels within the organization and even across organizational boundaries.

 Data Warehouse Features


The key features of a data warehouse are discussed below −

Subject Oriented − A data warehouse is subject oriented because it provides


information around a subject rather than the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it focuses on
modelling and analysis of data for decision making.

Integrated − A data warehouse is constructed by integrating data from


heterogeneous sources such as relational databases, flat files, etc. This integration
enhances the effective analysis of data.

Time Variant − The data collected in a data warehouse is identified with a


particular time period. The data in a data warehouse provides information from the
historical point of view.

Non-volatile − Non-volatile means the previous data is not erased when new data
is added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.

 Data Warehouse Applications


Data warehouse helps business executives to organize, analyze, and use their data
for decision making. Data warehouses are widely used in the following fields −

 Financial services

 Banking services

 Consumer goods

 Retail sectors

 Controlled manufacturing
BCA 5th Sem [Link]

 Characteristics and Functions of Data warehouse

Data warehouse can be controlled when the user has a shared way of explaining
the trends that are introduced as specific subject. Below are
major characteristics of data warehouse:

Subject-oriented –
A data warehouse is always a subject oriented as it delivers information about a
theme instead of organization’s current operations. It can be achieved on specific
theme. That means the data warehousing process is proposed to handle with a
specific theme which is more defined. These themes can be sales, distributions,
marketing etc.

A data warehouse never put emphasis only current operations. Instead, it focuses
on demonstrating and analysis of data to make various decision. It also delivers an
easy and precise demonstration around particular theme by eliminating data which
is not required to make the decisions.

Integrated –
It is somewhere same as subject orientation which is made in a reliable format.
Integration means founding a shared entity to scale the all similar data from the
different databases. The data also required to be resided into various data
warehouse in shared and generally granted manner.

A data warehouse is built by integrating data from various sources of data such
that a mainframe and a relational database. In addition, it must have reliable
naming conventions, format and codes. Integration of data warehouse benefits in
effective analysis of data. Reliability in naming conventions, column scaling,
encoding structure etc. should be confirmed. Integration of data warehouse
handles various subject related warehouse.

Time-Variant –
In this data is maintained via different intervals of time such as weekly, monthly,
or annually etc. It founds various time limit which are structured between the large
datasets and are held in online transaction process (OLTP). The time limits for
BCA 5th Sem [Link]

data warehouse is wide-ranged than that of operational systems. The data resided
in data warehouse is predictable with a specific interval of time and delivers
information from the historical perspective. It comprises elements of time
explicitly or implicitly. Another feature of time-variance is that once data is stored
in the data warehouse then it cannot be modified, alter, or updated.

Non-Volatile –
As the name defines the data resided in data warehouse is permanent. It also
means that data is not erased or deleted when new data is inserted. It includes the
mammoth quantity of data that is inserted into modification between the selected
quantity on logical business. It evaluates the analysis within the technologies of
warehouse.

In this, data is read-only and refreshed at particular intervals. This is beneficial in


analyzing historical data and in comprehension the functionality. It does not need
transaction process, recapture and concurrency control mechanism. Functionalities
such as delete, update, and insert that are done in an operational application are
lost in data warehouse environment. Two types of data operations done in the data
warehouse are:

 Data Loading

 Data Access

 Data warehouse role and structure


The purpose of the Data Warehouse mostly is to integrate corporate data in an
organization. It contains the "single version of truth" for the organization that has
been carefully constructed from data stored in disparate internal and external
operational databases ..... Data is stored at a very granular level of detail.

The data warehouse is composed of data structures populated by data extracted


from the OLTP database and transformed to fit a flatter schema. Ultimately
the warehouse structures are exposed as star schemas through views of fact and
dimension tables.

 Data Warehouse Cost


BCA 5th Sem [Link]

Budgeting – Without cost justification, projects will always be in jeopardy.


During future budget cycles, management will be looking for ways to reduce cost
and if there is no documented reason for completing the project, they are likely to
forget the flush of excitement that accompanied the project’s initiation.

Staffing – Without cost justification, staffing with the right people may be
difficult. By having some real dollar numbers to back up a request, the request is
more likely to be satisfied.

Prioritization – Without a cost/benefit analysis, project prioritization is difficult


and management has little to compare projects other than a gut-feel that one is
more important than another. Without cost/benefit analysis, the line-of-business
manager with the most power is likely to get her project approved. The project
that is most important to the enterprise may never be implemented.

Controlling Costs
Costs can and must be controlled. It is the project manager who has the
responsibility for controlling costs along with the other responsibilities. Adhering
to the Project Agreement is a major start for controlling costs. The Project
Agreement specifies the data that will be in the data warehouse, the periods for
which the data is kept, the number of users and predefined queries and reports.
Any one of these factors, if not held in check, will increase the cost and possibly
the schedule of the project. A primary role of the project manager will be to
control scope creep.

Additional Support

User Support staff or the Help Desk staff will be the users’ primary contact when
there are problems. Providing adequate User Support will require more people,
and more training of those people, to answer questions and help the users through
difficult situations. The cost for the additional people, the training and possibly an
upgrade in the number and knowledge-level of the staff answering the phones
must be added into the data warehouse costs.

Consultants and Contractors

Consultant and contractor expenses can balloon a project’s cost. Consultants are
used to supplement the lack of experience of the project team, contractors are used
to supplement the lack of skilled personnel. There are two types of
consultant/contractors:
BCA 5th Sem [Link]

Product specific contractors – These persons are brought in because they know
the product. They can either help or actually install the product, and they can tune
the product. They will customize the product, if it is necessary. The product-
specific consultants may either be in the employ of the tool vendor or may be
independent. An example of their services would be installing and using an ETL
tool to extract, transform and load data from your source files to the data
warehouse. In this activity they may be generating the ETL code on their own or
working with your people in this endeavor.

General data warehouse consultants – These consultants may have a specific


niche such as data modeling, performance, data mining, tool selection,
requirements gathering or project planning. They will typically be involved for a
shorter period of time than the product-specific consultant/contractor. They have
two roles that are equally important. The first is working with your people to
complete a task such as selecting a query tool or developing a project plan. The
second is the knowledge transfer to your staff so they can perform the activity the
next time on their own. Just as in the case of the product-specific
consultant/contractor, your goal is to make your staff as self-sufficient as soon as
possible.

Products

The software products that support the data warehouse can be very expensive. The
first thing to consider is which categories of tools you need. Do not bring in more
categories of products than you need. Do not try to accomplish everything with
your first implementation. Be very selective.

Hopefully, you have someone in your organization experienced in dealing with


vendors and understanding their contracts. You will be working closely with this
person. They will know the things to watch out for in a contract, but you will need
to give them some help to acquaint them with data warehousing. You will also
have to give them some warning if you heard anything negative about the vendor.
Your contract people will know how to include protection in the contract to keep
the vendor from arbitrarily raising their prices.

Existing tools

Your organization most likely already has an RDBMS. Should you have to pay for
it as part of your data warehouse project? If there is a site license, there may be no
charge to your department or you may have to pay a portion of the site license.
BCA 5th Sem [Link]

You may have to pay if the data warehouse will be on another CPU, and if the
RDBMS is charged by CPU. You may have to pay an upgrade if the data
warehouse requires going to a larger CPU, and if there is an additional cost for the
larger CPU.

Capacity planning

Capacity planning for a data warehouse is extremely difficult because:

The actual amount of data that will be in the warehouse is very difficult to
anticipate.

The number of users will also be difficult to estimate.

The number of queries each user will run is difficult to anticipate.

The time of day and the day in the week when the queries will be run is difficult to
guess (we know there will not be an even distribution, expecting more activity at
month-end, etc.).

The nature of the queries, the number of I/Os, the internal processing is almost
impossible to estimate.

Hardware Costs

For the data warehouse, you will need CPUs, disks, networks and desktop
workstations. The hardware vendors can help size the machines and disks. Be
aware that unanticipated growth of the data, increased number of users and
increased usage will explode the hardware costs. Existing desktop workstations
may not be able to support the query tool. Do not ask the query tool vendor for the
minimum desktop configuration. Ask for the recommended configuration. Call
references to find out if and how they had to upgrade their desktop workstations.

Raw Data Multiplier

There are many debates over how much disk is needed as a multiplier of the raw
data. Besides the raw data itself, space is needed for indexes, summary tables and
working space. Additional space may be needed for replicated data that may be
required for both performance and security reasons. The actual space is very
dependent on how much is indexed and how many summary tables are needed.
BCA 5th Sem [Link]

Existing Hardware

How should you account for existing hardware that can be used for the data
warehouse? It may mean you do not have to buy any additional hardware. The
Y2K testing may have required hardware that is now redundant and unused.
Should that be included in our data warehouse cost? It is a safe assumption that
your organization will need additional hardware in the future. By using the
redundant hardware for the data warehouse, it means that additional hardware for
non-data warehouse purposes must be purchased sooner. You may be able to defer
the cost of the redundant hardware; you will eventually have to pay. At the time
the hardware is purchased, it will undoubtedly be less than today’s costs.

Controlling Hardware Costs

Your ability to control hardware costs will depend primarily on whether your
organization has a chargeback system. Even though department heads are
supposed to have the best interests of the organization at heart, what they care
most about is meeting their performance objectives. These, of course, include the
costs assigned to their department. If department heads are paying for what they
get, they will be more thoughtful about asking for resources that may not be cost
justified. We had an experience with a user asking to store ten years worth of
detailed data. When he was presented with the bill (an additional $1.5 million), he
decided that two years worth of data was adequate.

Internal People Costs

These people are getting paid anyway regardless of whether we use them on this
project or not. Why should we have to include their costs in our budget? We have
to assume these people would be working on other productive projects. Otherwise,
there is no reason for the organization to keep them employed. Count on having to
include the fully burdened costs of the people on your project. Keep in mind that
you are much better off with a small team of highly skilled and dedicated workers
than with a larger team of the type of people to avoid for your project.

User Training

User training is usually done on the premises and not at a vendor site. There are
four cost areas for user training that must be considered.

The cost to engage a trainer from the outside or the time it takes for your in-house
trainer to develop and teach the class.
BCA 5th Sem [Link]

The facilities including the desktop workstations for the workshop.

The time the users spend away from the job being in class, and the time it takes
them to become proficient with the tool.

If not all the users are in the same location, travel expenses for either the users or
the trainer must be included.

IT Training

Generic training may be appropriate. Examples are classes in logical data


modeling, data warehouse project management or star schema database designs.
Data warehouse conferences and seminars can provide an overall perspective as
well as training in specific areas. IT will need to attend training on the complex
tools and products. IT will also need enough time to work with the products to
become proficient. The cost of training is sometimes included in the price of the
tool.

On-Going Costs

Most organizations focus on the cost to implement the initial data warehouse
application and give little thought to on-going expense. Over a period of years, the
continuing cost will very likely exceed the cost of the initial application. The data
warehouse will grow in size, in the number of users and in the number of queries
and reports. The database will not remain static. New data will be added,
sometimes more than for the initial implementation and the design most probably
will change, and the database will need to be tuned. New software will be
introduced, new releases will be installed and some interfaces will have to be
rewritten. As the data warehouse grows, the hardware and network will have to be
upgraded.

 Operational & Informational Data Stores


Operational Data Store(ODS)

The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the business.
The data frequently changes as updates are made and reflect the current value of
the last transactions.
BCA 5th Sem [Link]

Operational Database Management Systems also called as OLTP (Online


Transactions Processing Databases), are used to manage dynamic data in real-
time.

Data Warehouse Systems serve users or knowledge workers in the purpose of data
analysis and decision-making. Such systems can organize and present information
in specific formats to accommodate the diverse needs of various users. These
systems are called as Online-Analytical Processing (OLAP) Systems

Operational data store benefits

 An ODS provides current, clean data from multiple sources in a single


place, and the benefits apply primarily to business operations.

 The ODS provides a consolidated repository into which previously isolated


or inefficiently communicating IT systems can feed.

 ODS reporting, which is focused on a snapshot of operational data, can be


more sophisticated than reports from individual underlying systems. The
ODS is architected to provide a consolidated view of data integrated from
multiple systems, so reports can provide a holistic perspective on
operational processes.

 The up-to-date view into operational status also makes it easier for users to
diagnose problems before digging into component systems. For example, an
ODS enables service representatives to immediately find a customer order,
its status, and any troubleshooting information that might be helpful.

 An ODS contains critical, time-sensitive business rules, such as those


automatically notifying a financial institution when a customer has
overdrawn an account. These rules, in aggregate, are a kind of process
automation that greatly improves efficiency, which would be impossible
without current and integrated operational data

Informational Data Store (IDS)

 There are some functions that go on within the enterprise that have to do
with planning,forcasting and managing the organization. These functions
are also critical to the survival of the organization, especially in our current
fast paced world.
BCA 5th Sem [Link]

 Functions like “marketing planning”, “engineering planning” and financial


analysis” also require information systems to support them. But these
functions are different from operational one, and the types of systems and
information required are also different.

 Informational systems have to do with analyzing data and making decisions,


often major decisions.

 Where operational data needs are normally focused upon a single area,
informational data needs often span a number of different areas and need
large amounts of related operational data.

Differences between OPERATIONAL & INFORMATIONAL Data Stores

OPERATIONAL INFORMATIONAL

Data Content Current values, Day to Archived, derived,


Day Values summarized, historical

Data Structure Optimized for Optimized for complex


transactions queries

Access Frequency High Medium to low

Access Type Read, update, delete Read Only

Queries Predictable, repetitive Ad hoc, random

Response Time Sub-seconds Several seconds to minutes

Kind of Users Clerks, DBAs, Database Knowledge Workers eg.


Pros. Analysts, Managers,
Executives.

Number of Users Large number, Thousands Relatively small number,


Hundreds.
BCA 5th Sem [Link]

Usage Used to run the Business Used to Analyse the state


of Business

Focus Focus on Storing Data Focused on outputting


Information

Models E-R Model Star Schema, Snowflake ,


Fact Constellation

 Why a Data Warehouse is Separated from Operational Databases

A data warehouses is kept separate from operational databases due to the


following reasons −

An operational database is constructed for well-known tasks and workloads such


as searching particular records, indexing, etc. In contract, data warehouse queries
are often complex and they present a general form of data.

Operational databases support concurrent processing of multiple transactions.


Concurrency control and recovery mechanisms are required for operational
databases to ensure robustness and consistency of the database.

An operational database query allows to read and modify operations, while an


OLAP query needs only read only access of stored data.

An operational database maintains current data. On the other hand, a data


warehouse maintains historical data.

 What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of


software technology which authorizes analysts, managers, and executives to gain
insight into information through fast, consistent, interactive access in a wide
variety of possible views of data that has been transformed from raw information
to reflect the real dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and


support the capability for complex estimations, trend analysis, and sophisticated
data modeling. It is rapidly enhancing the essential foundation for Intelligent
Solutions containing Business Performance Management, Planning, Budgeting,
Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge
BCA 5th Sem [Link]

Discovery, and Data Warehouses Reporting. OLAP enables end-clients to perform


ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:

 Budgeting

 Activity-based costing

 Financial performance analysis

 And financial modeling

Sales and Marketing

 Sales analysis and forecasting

 Market research analysis

 Promotion analysis

 Customer analysis

 Market and customer segmentation

Production

Production planning

Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a
data model more intuitive to them than a tabular model. This model is called a
Dimensional Model.

The second purpose is to enable fast query response that is usually difficult to
achieve using tabular models.

 Characteristics of OLAP
BCA 5th Sem [Link]

In the FASMI characteristics of OLAP methods, the term derived from the first
letters of the characteristics are:

Fast

It defines which the system targeted to deliver the most feedback to the client
within about five seconds, with the elementary analysis taking no more than one
second and very few taking more than 20 seconds.

Analysis

It defines which the method can cope with any business logic and statistical
analysis that is relevant for the function and the user, keep it easy enough for the
target client. Although some preprogramming may be needed we do not think it
acceptable if all application definitions have to be allow the user to define new
Adhoc calculations as part of the analysis and to document on the data in any
desired method, without having to program so we excludes products (like Oracle
Discoverer) that do not allow the user to define new Adhoc calculation as part of
the analysis and to document on the data in any desired product that do not allow
adequate end user-oriented calculation flexibility.

Share

It defines which the system tools all the security requirements for understanding
and, if multiple write connection is needed, concurrent update location at an
appropriated level, not all functions need customer to write data back, but for the
increasing number which does, the system should be able to manage multiple
updates in a timely, secure manner.
BCA 5th Sem [Link]

Multidimensional

This is the basic requirement. OLAP system must provide a multidimensional


conceptual view of the data, including full support for hierarchies, as this is
certainly the most logical method to analyze business and organizations.

Information

The system should be able to hold all the data needed by the applications. Data
sparsity should be handled in an efficient manner.

The main characteristics of OLAP are as follows:

Multidimensional conceptual view: OLAP systems let business users have a


dimensional and logical view of the data in the data warehouse. It helps in
carrying slice and dice operations.

Multi-User Support: Since the OLAP techniques are shared, the OLAP operation
should provide normal database operations, containing retrieval, update, adequacy
control, integrity, and security.

Accessibility: OLAP acts as a mediator between data warehouses and front-end.


The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end.

Storing OLAP results: OLAP results are kept separate from data sources.

Uniform documenting performance: Increasing the number of dimensions or


database size should not significantly degrade the reporting performance of the
OLAP system.

OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.

OLAP system should ignore all missing values and compute correct aggregate
values.

OLAP facilitate interactive query and complex analysis for the users.

OLAP allows users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimension.

OLAP provides the ability to perform intricate calculations and comparisons.


BCA 5th Sem [Link]

OLAP presents results in a number of meaningful ways, including charts and


graphs.

Benefits of OLAP
OLAP holds several benefits for businesses: -

 OLAP helps managers in decision-making through the multidimensional


record views that it is efficient in providing, thus increasing their
productivity.

 OLAP functions are self-sufficient owing to the inherent flexibility support


to the organized databases.

 It facilitates simulation of business models and problems, through extensive


management of analysis-capabilities.

 In conjunction with data warehouse, OLAP can be used to support a


reduction in the application backlog, faster data retrieval, and reduction in
query drag.

Types of OLAP
There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based on


multidimensional DBMSs.
BCA 5th Sem [Link]

HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.

Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a relational back-end


server and user frontend tools.

They use a relational or extended-relational DBMS to save and handle warehouse


data, and OLAP middleware to provide missing pieces.

ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database,
where the base data and dimension tables are stored as relational tables. This
model permits the multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to
give the presence of traditional OLAP's slicing and dicing functionality. In
essence, each method of slicing and dicing is equivalent to adding a "WHERE"
clause in the SQL statement.

Relational OLAP Architecture


ROLAP Architecture includes the following components

 Database server.

 ROLAP server.

 Front-end tool.
BCA 5th Sem [Link]

Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology


segment in the market. This method allows multiple multidimensional views of
two-dimensional relational tables to be created, avoiding structuring record around
the desired view.

Some products in this segment have supported reliable SQL engines to help the
complexity of multidimensional analysis. This includes creating multiple SQL
statements to handle user requests, being 'RDBMS' aware and also being capable
of generating the SQL statements based on the optimizer of the DBMS engine.

Advantages

Can handle large amounts of information: The data size limitation of ROLAP
technology is depends on the data size of the underlying RDBMS. So, ROLAP
itself does not restrict the data amount.

Disadvantages

 Performance can be slow: Each ROLAP report is a SQL query (or


multiple SQL queries) in the relational database, the query time can be
prolonged if the underlying data size is large.

 Limited by SQL functionalities: ROLAP technology relies on upon


developing SQL statements to query the relational database, and SQL
statements do not suit all needs.

 Multidimensional OLAP (MOLAP) Server


BCA 5th Sem [Link]

A MOLAP system is based on a native logical model that directly supports


multidimensional data and operations. Data are stored physically into
multidimensional arrays, and positional techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are
summarized and are stored in an optimized format in a multidimensional cube,
instead of in a relational database. In MOLAP model, data are structured into
proprietary formats by client's reporting requirements with the calculations pre-
generated on the cubes.

MOLAP Architecture
MOLAP Architecture includes the following components

 Database server.

 MOLAP server.

 Front-end tool.

MOLAP structure primarily reads the precompiled data. MOLAP structure has
limited capabilities to dynamically create aggregations or to evaluate results which
have not been pre-calculated and stored.

Applications requiring iterative and comprehensive time-series analysis of trends


are well suited for MOLAP technology (e.g., financial analysis and budgeting).

Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot


Software's Lightship Server, Sniper's TM/1. Planning Science's Gentium and
Kenan Technology's Multiway.
BCA 5th Sem [Link]

Some of the problems faced by clients are related to maintaining support to


multiple subject areas in an RDBMS. Some vendors can solve these problems by
continuing access from MOLAP tools to detailed data in and RDBMS.

Advantages
 Excellent Performance: A MOLAP cube is built for fast information
retrieval, and is optimal for slicing and dicing operations.

 Can perform complex calculations: All evaluation have been pre-


generated when the cube is created. Hence, complex calculations are not
only possible, but they return quickly.

Disadvantages
 Limited in the amount of information it can handle: Because all
calculations are performed when the cube is built, it is not possible to
contain a large amount of data in the cube itself.

 Requires additional investment: Cube technology is generally proprietary


and does not already exist in the organization. Therefore, to adopt MOLAP
technology, chances are other investments in human and capital resources
are needed.

Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single
architecture. HOLAP systems save more substantial quantities of detailed data in
the relational tables while the aggregations are stored in the pre-calculated cubes.
HOLAP also can drill through from the cube down to the relational tables for
delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.
BCA 5th Sem [Link]

Advantages of HOLAP
 HOLAP provide benefits of both MOLAP and ROLAP.

 It provides fast access at all levels of aggregation.

 HOLAP balances the disk space requirement, as it only stores the aggregate
information on the OLAP server and the detail record remains in the
relational database. So no duplicate copy of the detail record is maintained.

Disadvantages of HOLAP
HOLAP architecture is very complicated because it supports both MOLAP and
ROLAP servers.

Other Types
There are also less popular types of OLAP styles upon which one could stumble
upon every so often. We have listed some of the less popular brands existing in the
OLAP industry.

Web-Enabled OLAP (WOLAP) Server

WOLAP pertains to OLAP application which is accessible via the web browser.
Unlike traditional client/server OLAP applications, WOLAP is considered to have
a three-tiered architecture which consists of three components: a client, a
middleware, and a database server.

Desktop OLAP (DOLAP) Server


BCA 5th Sem [Link]

DOLAP permits a user to download a section of the data from the database or
source, and work with that dataset locally, or on their desktop.

Mobile OLAP (MOLAP) Server

Mobile OLAP enables users to access and work on OLAP data and applications
remotely through the use of their mobile devices.

Spatial OLAP (SOLAP) Server

SOLAP includes the capabilities of both Geographic Information Systems (GIS)


and OLAP into a single user interface. It facilitates the management of both spatial
and non-spatial data

Difference between ROLAP, MOLAP, and HOLAP

ROLAP MOLAP HOLAP

ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical
Analytical Processing. Analytical Processing. Processing.

The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations connects attributes of both
aggregation of the of the division and a copy MOLAP and ROLAP.
division to be stored in of its source information to Like MOLAP, HOLAP
indexed views in the be saved in a causes the aggregation of
relational database that multidimensional the division to be stored in
was specified in the operation in analysis a multidimensional
partition's data source. services when the operation in an SQL
separation is processed. Server analysis services
instance.
BCA 5th Sem [Link]

ROLAP does not This MOLAP operation is HOLAP does not causes a
because a copy of the highly optimize to copy of the source
source information to be maximize query information to be stored.
stored in the Analysis performance. The storage For queries that access the
services data folders. area can be on the only summary record in
Instead, when the computer where the the aggregations of a
outcome cannot be partition is described or on division, HOLAP is the
derived from the query another computer running equivalent of MOLAP.
cache, the indexed views Analysis services. Because
in the record source are a copy of the source
accessed to answer information resides in the
queries. multidimensional
operation, queries can be
resolved without accessing
the partition's source
record.

Query response is Query response times can Queries that access source
frequently slower with be reduced substantially by record for example, if we
ROLAP storage than using aggregations. The want to drill down to an
with the MOLAP or record in the partition's atomic cube cell for which
HOLAP storage mode. MOLAP operation is only there is no aggregation
Processing time is also as current as of the most information must retrieve
frequently slower with recent processing of the data from the relational
ROLAP. separation. database and will not be as
fast as they would be if the
source information were
stored in the MOLAP
architecture.
BCA 5th Sem [Link]

 OLTP (On-Line Transaction Processing)

OLTP (On-Line Transaction Processing) is featured by a large number of short


on-line transactions (INSERT, UPDATE, and DELETE). The primary
significance of OLTP operations is put on very rapid query processing,
maintaining record integrity in multi-access environments, and effectiveness
consistent by the number of transactions per second. In the OLTP database, there
is an accurate and current record, and schema used to save transactional database
is the entity model
BCA 5th Sem [Link]

Advantages of an OLTP System:

 OLTP Systems are user friendly and can be used by anyone having basic
understanding

 It allows its user to perform operations like read, write and delete data
quickly.

 It responds to its user actions immediately as it can process query very


quickly.

 This systems are original source of the data.

 It helps to administrate and run fundamental business tasks

 It helps in widening customer base of an organization by simplifying


individual processes

Characteristics of OLTP
Following are important characteristics of OLTP:

 OLTP uses transactions that include small amounts of data.

 Indexed data in the database can be accessed easily.

 OLTP has a large number of users.

 It has fast response times

 Databases are directly accessible to end-users

 OLTP uses a fully normalized schema for database consistency.

 The response time of OLTP system is short.

 It strictly performs only the predefined operations on a small number of


records.

 OLTP stores the records of the last few days or a week.

 It supports complex data models and tables.


BCA 5th Sem [Link]

[Link]. Data Warehouse (OLAP) Operational Database(OLTP)

1 It involves historical processing It involves day-to-day processing.


of information.

2 OLAP systems are used by OLTP systems are used by clerks,


knowledge workers such as DBAs, or database professionals.
executives, managers, and
analysts.

3 It is used to analyze the It is used to run the business.


business.

4 It focuses on Information out. It focuses on Data in.

5 It is based on Star Schema, It is based on Entity Relationship


Snowflake Schema, and Fact Model.
Constellation Schema.

6 It focuses on Information out. It is application oriented.

7 It contains historical data. It contains current data.

8 It provides summarized and It provides primitive and highly


consolidated data. detailed data.

9 It provides summarized and It provides detailed and flat


multidimensional view of data. relational view of data.

10 The number of users is in The number of users is in


hundreds. thousands.
BCA 5th Sem [Link]

11 The number of records The number of records accessed is


accessed is in millions. in tens.

12 The database size is from The database size is from 100 MB


100GB to 100 TB. to 100 GB.

13 These are highly flexible. It provides high performance.

 Difference between OLAP & OLTP

BASIS FOR
OLTP OLAP
COMPARISON

Basic It is an online It is an online data retrieving

transactional system and and data analysis system.

manages database

modification.

Focus Insert, Update, Delete Extract data for analyzing

information from the that helps in decision

database. making.

Data OLTP and its transactions Different OLTPs database

are the original source of becomes the source of data

data. for OLAP.


BCA 5th Sem [Link]

BASIS FOR
OLTP OLAP
COMPARISON

Transaction OLTP has short OLAP has long transactions.

transactions.

Time The processing time of a The processing time of a

transaction is transaction is comparatively

comparatively less in more in OLAP.

OLTP.

Queries Simpler queries. Complex queries.

Normalization Tables in OLTP database Tables in OLAP database are

are normalized (3NF). not normalized.

Integrity OLTP database must OLAP database does not get

maintain data integrity frequently modified. Hence,

constraint. data integrity is not affected.


BCA 5th Sem [Link]

UNIT- II

41
BCA 5th Sem [Link]

 Building a Data Warehouse

A Data warehouse is a heterogeneous collection of different data sources


organized under unified schema. Builders should take a broad view of the
anticipated use of the warehouse while constructing a data warehouse. During the
design phase, there is no way to anticipate all possible queries or analyses. Some
characteristic of Data warehouse are:
 Subject oriented
 Integrated
 Time Variant
 Non-volatile
Building a Data Warehouse –
Some steps that are needed for building any data warehouse are as following
below:
 To extract the data (transnational) from different data sources:
For building a data warehouse, a data is extracted from various data sources
and that data is stored in central storage area. For extraction of the data
Microsoft has come up with an excellent tool. When you purchase
Microsoft SQL Server, then this tool will be available at free of cost.

 To transform the transnational data:


There are various DBMS where many of the companies stores their data.
Some of them are: MS Access, MS SQL Server, Oracle, Sybase etc. Also
these companies saves the data in spreadsheets, flat files, mail systems etc.
Relating a data from all these sources is done while building a data
warehouse.

 To load the data (transformed) into the dimensional database:


After building a dimensional model, the data is loaded in the dimensional
database. This process combines the several columns together or it may split
one field into the several columns. There are two stages at which
transformation of the data can be performed and they are: while loading the
data into the dimensional model or while data extraction from their origins.

 To purchase a front-end reporting tool:


There are top notch analytical tools are available in the market. These tools
are provided by the several major vendors. A cost effective tool and Data
Analyzer is released by the Microsoft on its own.

42
BCA 5th Sem [Link]

For the warehouse there is an acquisition of the data. There must be a use of
multiple and heterogeneous sources for the data extraction, example databases.
There is a need for the consistency for which formation of data must be done
within the warehouse. Reconciliation of names, meanings and domains of data
must be done from unrelated sources. There is also a need for the installation of
the data from various sources in the data model of the warehouse.
 To provide the time variant data
 To store the data as per the data model of the warehouse
 Purging the data
 To support the updating of the warehouse data

 Design considerations

Several structural design considerations should be taken into account for


economical and efficient welding. Many of these apply to other joining methods,
and all apply to both subassemblies and the complete structure.

Recognize and analyze the design problem: Designs must perform well under
expected and worst-case conditions. The designer should consider this before
sitting down at the drawing board or CAD terminal. Considerations include: Is it
more economical to build an irregular shape from welded pieces or to cut it from a
plate, with the accompanying waste? Can bending replace a welded joint? Are
preformed sections available? How, when, and how much should the structure be
welded? Can weight be reduced cost-effectively by using welded joints? Will
fewer parts offer equal or better performance

Determine load conditions: Structures will be subject to tension, compression,


torsion, and bending. These loads must be calculated under service conditions.
Locations of critical loads must be determined and the structure designed to
handle the loads efficiently. Careful designers will locate joints away from high-
stress areas when possible.

Consider producibility:The most elegant design is useless if it cannot be made


efficiently. Welders cannot always fabricate what designers think up. Designers
should spend time in the shop and consult foremen or manufacturing engineers
during design to become familiar with the challenges of translating drawings into
products.

Optimize layout: When drawing the preliminary design, engineers should plan
layout to reduce waste when the pieces are cut from plate. Preformed beams,
channels, and tubes also may reduce costs without sacrificing quality.
43
BCA 5th Sem [Link]

Anticipate plate preparation: Many designers assume that metals are


homogeneous, but real-world metal does not have equal properties in all
directions. Therefore, the type of plates used should be considered.

Consider using standard sections and forms: Preformed sections and forms
should be used whenever possible. Specifying standard sections for welding is
usually cheaper than welding many individual parts. In particular, specifying bent
components is preferable to making welded corners.

Select weld-joint design: There are five basic types of joints: butt joints, corner
joints, T-joints, lap joints, and edge joints. In addition, the American Welding
Society recognizes about 80 different types of welding and joining processes.
Each process has its own characteristics and capabilities, so joint design must be
suitable for the desired welding process. In addition, the joint design will affect
access to the weld.

Restrain size and number of welds: Welds should match, not exceed, the
strength of the base metal for full joint efficiency. Over welding is unnecessary,
increases costs, and reduces strength.

Welding sometimes induces distortions and residual stresses in structures. It is


best to specify the minimum amount of welding needed. To check for
overwelding, determine joint stresses versus stress in the adjoining members

 Implementation Considerations

I. Access Tools

Currently no single tool in the market can handle all possible data warehouse
access needs. Therefore, most implementations rely on a suite of tools.

ii. Data Placement Strategies

As Data Warehouse grows, there are at least two options for Data
Placement. One is to put some of the data in the data warehouse into another
storage media (WORM, RAID). Second option is to distribute data in data
warehouse across multiple servers.

iii. Data Extraction, Cleanup, Transformation, and Migration

44
BCA 5th Sem [Link]

As components of the Data Warehouse architecture, proper attention must be


given to Data Extraction, which represents a critical success factor for a data
warehouse architecture.

1. The ability to identify data in the data source environments that can be
read by conversion tool is important.

2. Support for the flat files. (VSAM, ISM, IDMS) is critical, since bulk of
the corporate data is still maintained in this type of data storage.

3. The capability to merge data from multiple data stores is required in


many installations.

4. The specification interface to indicate the data to extracted and the


conversion criteria is important.

5. The ability to read information from data dictionaries or import


information from repository product is desired

iv. Metadata

A frequently occurring problem in Data Warehouse is the problem of


communicating to the end user what information resides in the data warehouse
and how it can be accessed. The key to providing users and applications with a
roadmap to the information stored in the warehouse is the metadata. It can
define all data elements and their attributes, data sources and timing, and the
rules that govern data use and data transformations. Meta data needs to be
collected as the warehouse is designed and built.

 Data Warehouse- Technical Consideration


The following technical issues are required to be considered for designing and
implementing a data warehouse.

1. Hardware platform for data warehouse


2. DBMS for supporting data warehouse
3. Communication and network infrastructure for a data warehouse
4. Software tools for building, operating and using data warehouse

1. Hardware platforms for data warehouse:- Organization normally like to


utilize the already existing platforms for data warehouse development. However,

45
BCA 5th Sem [Link]

the disk storage requirements for a data warehouse will be significantly large,
especially in comparison with single application.

Thus, hardware with large data storage capacity is essential for data warehousing.
For every data size identified, the disk space provided should be two to three times
that of the data to accommodate processing, indexing etc.

2. DBMS for supporting data warehouse:- After hardware selection, a factor


most important is the DBMA selection. This determines the speed performance of
the data warehousing environment. The requirements of a DBMA for data
warehousing environment are scalability, performance in high volume storage
and processing and throughput in traffic. All the well known RDBMA vendors
like:- IBM,ORACLE,Sybase support parallel database processing, even some of
them have improved their architectures so as to better suit the specialized
requirements of a data warehouse.

3. Communication and network infrastructure for a data warehouse:- Data


warehouse can be internet or web enabled or intranet enabled as the choice may
be. If web enabled, the networking is taken care by the internet. If only intranet
based, then the appropriate LAN operational environment should be provided so
as to be accessible to all the identified users.

4. Software tools for building, operating, and using data warehouse:- All the
data warehouse vendors are not currently providing comprehensive single window
software tools capable of handling all aspects of a data warehousing project
implementation.

The types of access and reporting are as follow.

 Statistical analysis and forecasting


 Data visualization, graphing, and charting
 Complex textual search
 Ad hoc user specific queries
 Predefined repeatable queries
 Reporting and analysis by drilling down

46
BCA 5th Sem [Link]

 Data Pre-processing
Today’s real-world databases are highly subject to noisy, missing and inconsistent
data due to their typically huge size, and their likely origin from
multiple,heterogenous source. Incomplete data can occur for a number of reasons.
Attributes of interest may not always be available , such as customer information
for sales transaction data. Other data may not be included simply because it was
not considered important at the time of entry. Some of the major reason for noisy
data is:

 The data collection instruction used may be faulty


 There may have been human or computer errors occurring at data entry.
 Error in data transmission can also occur
 There may be technology limitations, such as limited buffer size for
coordinating synchronized data transfer and consumption.

Objectives of data Preprocessing


 Size reduction of the Input space:- Reducing the number of input
variables or the size of the input space is a common goal of the
preprocessing. The objectives is to get a reasonable overview of the data set
without losing the most important relationship of the data. If the input space
is large, one may identify the most important input variables and eliminate
the unimportant variables by combining several variables as a single
variable.
 Smoother relationship: - Another commonly used type of preprocessing is
problem transformation. The original problem is transformed into a simpler
problem.
 Data normalization:- For many practical problem, the units used to
measure each of the input variables can change the data and make the range
of values much larger than others. This results in unnecessarily complex
relationship by making the nature of the mapping along some dimensions
much different from others.
 Noise Reduction:- A sequence of data may involve useful data, noisy data,
and inconsistent data. Preprocessing may reduce the noisy and inconsistent
data. The data corrupted with noise can be recovered with preprocessing
techniques.
47
BCA 5th Sem [Link]

 Features Extraction:- If the key attribute or features characterizing the


data can be extracted, the problem encountered can be easily solved .

 Data Summarization

Data Summarization is a simple term for a short conclusion of a big theory or a


paragraph. This is something where you write the code and in the end, you declare
the final result in the form of summarizing data. Data summarization has the great
importance in the data mining. As nowadays a lot of programmers and developers
work on big data theory. Earlier, you used to face difficulties to declare the result,
but now there are so many relevant tools in the market where you can use in the
programming or wherever you want in your data.

Why Data Summarization?

Why we need more summarization of data in the mining process, we are living in
a digital world where data transfers in a second and it is much faster than a human
capability. In the corporate field, employees work on a huge volume of data which
is derived from different sources like Social Network, Media, Newspaper, Book,
cloud media storage etc. But sometimes it may create difficulties for you to
summarize the data. Sometimes you do not expect data volume because when
you retrieve data from relational sources you cannot predict that how much data
will be stored in the database.
As a result, data becomes more complex and takes time to summarize information.
Let me tell you the solution to this problem. Always retrieve data in the form of
category what type of data you want in the data or we can say use filtration when
you retrieve data. Although, “Data Summarization” technique gives the good
amount of quality to summarize the data. Moreover, a customer or user can take
benefits in their research.

 Data Preprocessing Techniques

Data Cleaning

Data Cleaning is a process of cleaning raw data by handling irrelevant and missing
tuples. While working our machine learning projects, the data sets which we take
might not be perfect they might have many impurities, Noisy values and a majority
of times the actual data might be missing. the major problems we will be facing
during data cleaning are:

1. Missing Values: If it is noted that there are many tuples that have no
48
BCA 5th Sem [Link]

recorded value for several attributes, then the missing values can be filled in
for the attribute by various methods described below:

 Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This
method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
 Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like \Unknown",
or -∞. If missing values are replaced by, say, \Unknown", then the
mining program may mistakenly think that they form an interesting
concept, since they all have a value in common | that of \Unknown".
Hence, although this method is simple, it is not recommended.
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class as
the given tuple.
 Use the most probable value to fill in the missing value: This may be
determined with inference-based tools using a Bayesian formalism or
decision tree induction.

Methods 3 to 6 bias the data. The filled-in value may not be correct. Method
6, however, is a popular strategy. In comparison to the other methods, it uses
the most information from the present data to predict missing values.

2. Noisy Data: Noise is a random error or variance in a measured variable.


Given a numeric attribute such as, say, price, how can the data be “smoothed"
to remove the noise? The following data smoothing techniques describes this.

1. Binning methods: Binning methods smooth a sorted data value by


consulting the
\neighborhood” or values around it. The sorted values are distributed
into a number of 'buckets', or bins. Because binning methods consult
the neighborhood of values, they perform local smoothing values
around it. The sorted values are distributed into a number of 'buckets', or
bins. Because binning methods consult the neighborhood of values,
they perform local smoothing.

49
BCA 5th Sem [Link]

2. Clustering: Outliers may be detected by clustering, where similar


values are organized into groups or \clusters".

3. Combined computer and human inspection: Outliers may be


identified through a combination of computer and human inspection. In
one application, for example, an information-theoretic measure was
used to help identify outlier patterns in a handwritten character database
for classification. The measure's value reflected the
\surprise" content of the predicted character label with respect to the
known label. Outlier patterns may be informative (e.g., identifying
useful data exceptions, such as different versions of the characters \0"
or \7"), or \garbage" (e.g., mislabeled characters). Patterns whose
surprise content is above a threshold are output to a list. A human can
then sort through the patterns in the list to identify the actual garbage
ones.

This is much faster than having to manually search through the entire
database. The garbage patterns can then be removed from the (training)
database.
4. Regression: Data can be smoothed by fitting the data to a function,
such as with regression. Linear regression involves finding the \best"
line to fit two variables, so that one variable can be used to predict the
other. Multiple linear regression is an extension of linear regression,
where more than two variables are involved and the data are fit to a
multidimensional surface. Using regression to find a mathematical
equation to fit the data helps smooth out the noise.

3. Inconsistent data: There may be inconsistencies in the data recorded for


some transactions. Some data inconsistencies may be corrected manually
using external references. For example, errors made at data entry may be
corrected by performing a paper trace. This may be coupled with routines
designed to help correct the inconsistent use of codes. Knowledge engineering
tools may also be used to detect the violation of known data constraints. For
example, known functional dependencies between attributes can be used to
find values contradicting the functional constraints.

 Data Transformation

In data transformation, the data are transformed or consolidated into forms


appropriate for mining. Data transformation can involve the following:

50
BCA 5th Sem [Link]

1. Normalization, where the attribute data are scaled so as to fall within a


small specified range, such as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from data. Such techniques include
binning, clustering, and regression.
3. Aggregation, where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing a
data cube for analysis of the data at multiple granularities.
4. Generalization of the data, where low level or 'primitive' (raw) data are
replaced by higher level concepts through the use of concept hierarchies. For
example, categorical attributes, like street, can be generalized to higher level
concepts, like city or county. Similarly, values for numeric attributes, like age,
may be mapped to higher level concepts, like young, middle-aged, and senior.

 Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts. Consider a concept hierarchy for
the dimension location. City values for location include Vancouver, Toronto, New
York, and Chicago. Each city, however, can be mapped to the province or state to
which it belongs. For example, Vancouver can be mapped to British Columbia,
and Chicago to Illinois. The provinces and states can in turn be mapped to the
country (e.g., Canada or the United States) to which they belong. These mappings
form a concept hierarchy for the dimension location, mapping a set of low-level
concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).

51
BCA 5th Sem [Link]

 Patterns and Models

A pattern is an entity used to represent an abstract concept or a physical object. It


may contain several attributes or features to characterize an object. Data
preprocessing is to remove the irrelevant information and extract key features of
the data to simplify a pattern recognition problem without throwing away any
important information. Pattern provides a common framework for describing
problems and solution.

Models are a cornerstone of design. Engineers build a model of a car to work out
any details before putting it into production. In the same manner, system designers
develop models to explore ideas and improve the understanding of the database
design.

A data model is a graphical view of data created for analysis and design purposes.
Data modeling designing data warehouse databases in detail. It can be defined as
an integrated collection of concept that can be used to describe the structure of the
database including data types, relationships between data and constraints that
should apply on the data.

A data model comprises of following three components


1) Structural part:- It consists of a set of rules according to which database can
be constructed.

2) Manipulative part:- It defines the types of operations that are allowed on the
data. This includes the operations that are used for updating or retrieving data
from the database and for changing the structure of the students.

3) Integrity rules:- Rules, which ensures that the data is accurate.

52
BCA 5th Sem [Link]

 Multi Dimensional Data Model

A multidimensional model views data in the form of a data-cube. A data cube


enables data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization


keeps records. For example, a shop may create a sales data warehouse to keep
records of the store's sales for the dimension time, item, and location. These
dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table
related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item name,
brand, and type.

A multidimensional data model is organized around a central theme, for example,


sales. This theme is represented by a fact table. Facts are numerical measures. The
fact table contains the names of the facts or measures of the related dimensional
tables.

Logical cubes:- Logical cubes are designed to organize measures or procedures


that have the same shape or dimensions. Measures in the same cube have the same
relationship to other logical objects and can easily be analyzed and displayed
together.

56
BCA 5th Sem [Link]

Logical measures:- With logical measures ,cells of the logical cube are filled with
facts collected about an organization’s operations or functions. The measures are
organized according to the dimensions, which typically include a time dimension.
Logical dimensions:- Dimensions contain a set of unique values that identify and
categorize data. Dimensions represent the different views for an entity that an
organization is interested in. For example, a store will create a sales data
warehouse in order to keep track of the store sales with respect to different
dimensions such as time, branch and location

 Data Cube

When data is grouped or combined in multidimensional matrices called Data


Cubes. The data cube method has a few alternative names or a few variants, such
as "Multidimensional databases," "materialized views," and "OLAP (On-Line
Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations


that are frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-
price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.

57
BCA 5th Sem [Link]

Example: In the 2-D representation, we will look at the All Electronics sales
data for items sold per quarter in the city of Vancouver. The measured display in
dollars sold (in thousands)

3-Dimensional Cuboids

Let suppose we would like to view the sales data with a third dimension. For
example, suppose we would like to view the data according to time, item as well as
the location for the cities Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands). These 3-D data are shown in the
table. The 3-D data of the table are represented as a series of 2-D tables.

58
BCA 5th Sem [Link]

 Schemas For Multidimensional Data


Multidimensional Schema is especially designed to model data warehouse
systems. The schemas are designed to address the unique needs of very large
databases designed for the analytical purpose (OLAP).
Types of Data Warehouse Schema:

Following are 3 chief types of multidimensional schemas each having its unique
advantages.

 Star Schema
 Snowflake Schema
 Galaxy Schema

1. Star Schema

In the STAR Schema, the center of the star can have one fact table and a number
of associated dimension tables. It is known as star schema as its structure
resembles a star. The star schema is the simplest type of Data Warehouse schema.
It is also known as Star Join Schema and is optimized for querying large data sets.

In the following example,the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID &
other attributes like Units sold and revenue.

59
BCA 5th Sem [Link]

Characteristics of Star Schema:

 Every dimension in a star schema is represented with the only one-


dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized. For instance, in the above figure,
Country_ID does not have Country lookup table as an OLTP design would
have.
 The schema is widely supported by BI Tools

2. Snowflake Schema

SNOWFLAKE SCHEMA is a logical arrangement of tables in a


multidimensional database such that the ER diagram resembles a snowflake shape.
A Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions. The dimension tables are normalized which splits data into additional
tables.

In the following example, Country is further normalized into an individual table

60
BCA 5th Sem [Link]

Characteristics of Snowflake Schema:

 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced
 The primary challenge that you will face while using the snowflake Schema
is that you need to perform more maintenance efforts because of the more
lookup tables.

Star Vs Snowflake Schema: Key Differences


Star Schema Snow Flake Schema

Hierarchies for the dimensions are Hierarchies are divided into


stored in the dimensional table. separate tables.

It contains a fact table surrounded One fact table surrounded by


by dimension tables. dimension table which are in turn
surrounded by dimension table

In a star schema, only single join A snowflake schema requires


creates the relationship between many joins to fetch the data.
the fact table and any dimension
tables.

Simple DB Design. Very Complex DB Design.

61
BCA 5th Sem [Link]

Denormalized Data structure and Normalized Data Structure.


query also run faster.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains Data Split into different Dimension


aggregated data. Tables.

Cube processing is faster. Cube processing might be slow


because of the complex join.

Offers higher performing queries The Snow Flake Schema is


using Star Join Query represented by centralized fact
Optimization. Tables may be table which unlikely connected
connected with multiple with multiple dimensions.
dimensions.

3. Fact Constellation Schema( Galaxy Schema)

A GALAXY SCHEMA contains two fact table that share dimension tables
between them. It is also called Fact Constellation Schema. The schema is viewed
as a collection of stars hence the name Galaxy Schema.

62
BCA 5th Sem [Link]

As you can see in above example, there are two facts table

1. Revenue
2. Product.

In Galaxy schema shares dimensions are called Conformed Dimensions.

Characteristics of Galaxy Schema:

 The dimensions in this schema are separated into separate dimensions based
on the various levels of hierarchy.
 For example, if geography has four levels of hierarchy like region, country,
state, and city then Galaxy schema should have four dimensions.
 Moreover, it is possible to build this type of schema by splitting the one-star
schema into more Star schemes.
 The dimensions are large in this schema which is needed to build based on
the levels of hierarchy.
 This schema is helpful for aggregating fact tables for better understanding.

 Data Warehouse Architecture


A data warehouse architecture is a method of defining the overall architecture of
data communication processing and presentation that exist for end-clients
computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.

Production applications such as payroll accounts payable product purchasing and


inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting, and
trend analysis.

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic

63
BCA 5th Sem [Link]

o Data Warehouse Architecture: With Staging Area


o Data Warehouse Architecture: With Staging Area and Data Mart

Operational System

An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author,
data build, and data changed, and file size are examples of very basic document
metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

64
BCA 5th Sem [Link]

The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business


managers for strategic decision-making. These customers interact with the
warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the
warehouse.

We can do this programmatically, although data warehouses uses a staging


area (A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.

65
BCA 5th Sem [Link]

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within
our organization.

We can do this by adding data marts. A data mart is a segment of a data


warehouses that can provided information for reporting and analysis on a section,
unit, department or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated.
In this example, a financial analyst wants to analyze historical data for purchases
and sales or mine historical information to make predictions about customer
behavior.

 Data Warehouse Design

66
BCA 5th Sem [Link]

A data warehouse is a single data repository where a record from multiple data
sources is integrated for online business analytical processing (OLAP). This
implies a data warehouse needs to meet the requirements from all the business
stages within the entire organization. Thus, data warehouse design is a hugely
complex, lengthy, and hence error-prone process. Furthermore, business analytical
functions change over time, which results in changes in the requirements for the
systems. Therefore, data warehouse and OLAP systems are dynamic, and the
design process is continuous.

There are two approaches

1. "top-down" approach
2. "bottom-up" approach

Top-down Design Approach

In the "Top-Down" design approach, a data warehouse is described as a subject-


oriented, time-variant, non-volatile and integrated data repository for the entire
enterprise data from different sources are validated, reformatted and saved in a
normalized (up to 3NF) database as the data warehouse. The data warehouse stores
"atomic" information, the data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the data required for specific
business subjects or particular departments. An approach is a data-driven approach
as the information is gathered and integrated first and then business requirements
by subjects for building data marts are formulated. The advantage of this method is
which it supports a single integrated data source. Thus data marts built from it will
have consistency when they overlap.

Advantages of top-down design

 Data Marts are loaded from the data warehouses.


 Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design

 This technique is inflexible to changing departmental needs.


 The cost of implementing the project is high.

67
BCA 5th Sem [Link]

Bottom-Up Design Approach

In the "Bottom-Up" approach, a data warehouse is described as "a copy of


transaction data specifically architecture for query and analysis," term the star
schema. In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes (or subjects). Thus it is
needed to be a business-driven approach in contrast to Inman’s data-driven
approach.

The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time
and effort than developing an enterprise-wide data warehouse. Also, the risk of
failure is even less. This method is inherently incremental. This method allows the
project team to learn and grow.

68
BCA 5th Sem [Link]

Advantages of bottom-up design

 Documents can be generated quickly.


 The data warehouse can be extended to accommodate new business units.
 It is just developing new data marts and then integrating with other data
marts.

Disadvantages of bottom-up design

 The locations of the data warehouse and the data marts are reversed in the
bottom-up approach design.

Differentiate between Top-Down Design Approach and Bottom-Up Design


Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into Solves the essential low-level problem and
smaller subproblems. integrates them into a higher one.

Inherently architected- not a Inherently incremental; can schedule


union of several data marts. essential data marts first.

Single, central storage of Departmental information stored.


information about the content.

Centralized rules and control. Departmental rules and control.

It includes redundant Redundancy can be removed.


information.

It may see quick results if Less risk of failure, favorable return on


implemented with repetitions. investment, and proof of techniques.

69
BCA 5th Sem [Link]

 Three-Tier Data Warehouse Architecture


Generally a data warehouses adopts a three-tier architecture. Following are the
three tiers of the data warehouse architecture.
 Bottom Tier − The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end
tools and utilities to feed data into the bottom tier. These back end tools and
utilities perform the Extract, Clean, Load, and refresh functions.
 Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational
database management system. The ROLAP maps the operations on
multidimensional data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly
implements the multidimensional data and operations.
 Top-Tier − This tier is the front-end client layer. This layer holds the query
tools and reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse

70
BCA 5th Sem [Link]

From the architecture point of view there are three data warehouse models

1. Enterprise warehouse

2. data marts

3. Virtual warehouse

Enterprise Warehouse: - An enterprise warehouse collects all details comparing


of all information about subjects spanning the entire organization. It provides
corporate wide data integration, usually from one or more operational systems and
from external information providers. It takes extensive business modeling and it
takes many years to design and build.

Data Marts:- A data mart consists of a subset of corporate wide data that is of
value to specific group of users. The scope is restricted to specific selected
subjects. The data contained in a data mart tend to be summarized.

Virtual warehouse:- A virtual warehouse is a set of views over operational


databases. A virtual warehouse is essentially a business database. The data found
in a virtual warehouse is usually copied from multiple sources throughout a
production system.

 OLAP

OLAP (online analytical processing) is a computing method that enables users to


easily and selectively extract and query data in order to analyze it from different
points of view. OLAP business intelligence queries often aid in trends analysis,
financial reporting, sales forecasting, budgeting and other planning purposes.

For example, a user can request that data be analyzed to display a spreadsheet
showing all of a company's beach ball products sold in Florida in the month of
71
BCA 5th Sem [Link]

July, compare revenue figures with those for the same products in September and
then see a comparison of other product sales in Florida in the same time period.

Working of OLAP system

To facilitate this kind of analysis, data is collected from multiple data sources and
stored in data warehouses then cleansed and organized into data cubes.
Each OLAP cube contains data categorized by dimensions (such as customers,
geographic sales region and time period) derived by dimensional tables in the data
warehouses. Dimensions are then populated by members (such as customer names,
countries and months) that are organized hierarchically. OLAP cubes are often pre-
summarized across dimensions to drastically improve query time over relational
databases.

Basic analytical operations of OLAP

Four types of analytical operations in OLAP are:

1. Roll-up
72
BCA 5th Sem [Link]

2. Drill-down
3. Slice and dice
4. Pivot (rotate)

1) Roll-up:

Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation


can be performed in 2 ways

1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping
things based on their order or level.

Consider the following diagram

 In this example, cities New jersey and Lost Angles and rolled up into
country USA
 The sales figure of New Jersey and Los Angeles are 440 and 1560
respectively. They become 2000 after roll-up
 In this aggregation process, data is location hierarchy moves up from city to
the country.
 In the roll-up process at least one or more dimensions need to be removed.
In this example, Quater dimension is removed.

2) Drill-down

In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via

 Moving down the concept hierarchy


 Increasing a dimension
73
BCA 5th Sem [Link]

Consider the diagram above

 Quarter Q1 is drilled down to months January, February, and March.


Corresponding sales are also registers.
 In this example, dimension months are added.

3) Slice:

Here, one dimension is selected, and a new sub-cube is created.

Following diagram explain how slice operation performed:

74
BCA 5th Sem [Link]

 Dimension Time is Sliced with Q1 as the filter.


 A new cube is created altogether.

Dice:

This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.

75
BCA 5th Sem [Link]

4) Pivot

In Pivot, you rotate the data axes to provide a substitute presentation of data.

In the following example, the pivot is based on item types.

76
BCA 5th Sem [Link]

 Types of OLAP systems

OLAP Hierarchical Structure

Type of OLAP Explanation

Relational OLAP(ROLAP): ROLAP is an extended RDBMS along


with multidimensional data mapping to
perform the standard relational
operation.

Multidimensional OLAP MOLAP Implementes operation in


(MOLAP) multidimensional data.

Hybrid OnlineAnalytical In HOLAP approach the aggregated


Processing (HOLAP) totals are stored in a multidimensional
database while the detailed data is
stored in the relational database. This
offers both data efficiency of the
ROLAP model and the performance of
the MOLAP model.

Desktop OLAP (DOLAP) In Desktop OLAP, a user downloads a


part of the data from the database

77
BCA 5th Sem [Link]

locally, or on their desktop and analyze


it.

DOLAP is relatively cheaper to deploy


as it offers very few functionalities
compares to other OLAP systems.

Web OLAP (WOLAP) Web OLAP which is OLAP system


accessible via the web browser.
WOLAP is a three-tiered architecture.
It consists of three components: client,
middleware, and a database server.

Mobile OLAP: Mobile OLAP helps users to access


and analyze OLAP data using their
mobile devices

Spatial OLAP : SOLAP is created to facilitate


management of both spatial and non-
spatial data in a Geographic
Information system (GIS)

ROLAP

ROLAP works with data that exist in a relational database. Facts and dimension
tables are stored as relational tables. It also allows multidimensional analysis of
data and is the fastest growing OLAP.

Advantages of ROLAP model:

 High data efficiency. It offers high data efficiency because query


performance and access language are optimized particularly for the
multidimensional data analysis.
 Scalability. This type of OLAP system offers scalability for managing large
volumes of data, and even when the data is steadily increasing.

Drawbacks of ROLAP model:


78
BCA 5th Sem [Link]

 Demand for higher resources: ROLAP needs high utilization of


manpower, software, and hardware resources.
 Aggregately data limitations. ROLAP tools use SQL for all calculation of
aggregate data. However, there are no set limits to the for handling
computations.
 Slow query performance. Query performance in this model is slow when
compared with MOLAP

MOLAP

MOLAP uses array-based multidimensional storage engines to display


multidimensional views of data. Basically, they use an OLAP cube.

Hybrid OLAP

Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation
of MOLAP and higher scalability of ROLAP. HOLAP uses two databases.

1. Aggregated or computed data is stored in a multidimensional OLAP cube


2. Detailed information is stored in a relational database.

Benefits of Hybrid OLAP:

 This kind of OLAP helps to economize the disk space, and it also remains
compact which helps to avoid issues related to access speed and
convenience.
 Hybrid HOLAP's uses cube technology which allows faster performance for
all types of data.
 ROLAP are instantly updated and HOLAP users have access to this real-
time instantly updated data. MOLAP brings cleaning and conversion of data
thereby improving data relevance. This brings best of both worlds.

Drawbacks of Hybrid OLAP:

 Greater complexity level: The major drawback in HOLAP systems is that it


supports both ROLAP and MOLAP tools and applications. Thus, it is very
complicated.
 Potential overlaps: There are higher chances of overlapping especially into
their functionalities.

Advantages of OLAP
79
BCA 5th Sem [Link]

 OLAP is a platform for all type of business includes planning, budgeting,


reporting, and analysis.
 Information and calculations are consistent in an OLAP cube. This is a
crucial benefit.
 Quickly create and analyze "What if" scenarios
 Easily search OLAP database for broad or specific terms.
 OLAP provides the building blocks for business modeling tools, Data
mining tools, performance reporting tools.
 Allows users to do slice and dice cube data all by various dimensions,
measures, and filters.
 It is good for analyzing time series.
 Finding some clusters and outliers is easy with OLAP.
 It is a powerful visualization online analytical process system which
provides faster response times

Disadvantages of OLAP

 OLAP requires organizing data into a star or snowflake schema. These


schemas are complicated to implement and administer
 You cannot have large number of dimensions in a single OLAP cube
 Transactional data cannot be accessed with OLAP system.
 Any modification in an OLAP cube needs a full update of the cube. This is a
time-consuming process

OLAP software then locates the intersection of dimensions, such as all products
sold in the Eastern region above a certain price during a certain time period, and
displays them. The result is the "measure"; each OLAP cube has at least one to
perhaps hundreds of measures, which are derived from information stored in fact
tables in the data warehouse.

Types of OLAP systems

OLAP (online analytical processing) systems typically fall into one of three types:

 Multidimensional OLAP (MOLAP) is OLAP that indexes directly into


a multidimensional database.

80
BCA 5th Sem [Link]

 Relational OLAP (ROLAP) is OLAP that performs dynamic


multidimensional analysis of data stored in a relational database.

 Hybrid OLAP (HOLAP) is a combination of ROLAP and MOLAP. HOLAP


was developed to combine the greater data capacity of ROLAP with the
superior processing capability of MOLAP.

 Indexing & Querying in OLAP

To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views. Two indexing techniques that are popular for
olap data are:

 Bitmap Indexing

 Join Indexing

1) Bitmap Indexing

 The bitmap indexing method is popular in OLAP products because it allows


quick searching in data cubes.

 A bitmap index is a very efficient method for storing sparse data columns.
Sparse data columns are one which contain data values from a very small set
of possibilities .

 In the bitmap index for a given attribute, there is a distinct bit vector ,for
each value V in the domain of the attribute.

 If the domain for the attribute consists of n values, then n bits are needed for
each entry in the bitmap index.

81
BCA 5th Sem [Link]

 The length of the bit vector is equal to the number of record in the base
table.

 Not suitable for high cardinality domains.

2) Join Indexing

 A join index is an index set on a join result.

 Join indexing method gained popularity from its use in relational database
query processing.

 In data warehousing, join indexing is especially useful in the star schema


model to join the records of the fact table with the corresponding dimension
table.

 Consider 2 relations R(RID ,A) and S(B,SID) that join on attribute A and B.
Then the join index contains the pair(RID,SID) where RID and SID are
record identifiers from the R and S relations.

 Querying in OLAP

OLAP is a database technology that has been optimized for querying and reporting,
instead of processing transactions. The source data for OLAP is online
transactional processing(OLTP) databases that are commonly stored in data
warehouses.

OLAP is implemented in a multi user client /server environment and offers


consistently rapid response to queries, regardless of database size and complexity .

The purpose of constructing OLAP index structures is to speed up query


processing in data cubes. Given materialized views, query processing should
proceed as follows:

82
BCA 5th Sem [Link]

1. Determine which operations should be performed on the available


cuboids:- This involves transforming any selection,projection,roll up and
drill-down operations specified in the query into corresponding SQL or
OLAP operations.

2. Determine to which materialized cuboids the relevant operations should


be applied:- This involves identifying all of the materialized cuboids that
may potentially be used to answer the query, estimating the costs of using
the remaining materialized cuboids, and selecting the cuboids with the least
cost.

 OLAM ( ONLINE ANALYTICAL MINING)

Online analytical mining integrates online analytical processing (OLAP) and data
mining. It represent a promising direction for mining large databases and data
warehouses.

Importance of OLAM
OLAM is important for the following reasons −
 High quality of data in data warehouses − The data mining tools are
required to work on integrated, consistent, and cleaned data. These steps are
very costly in the preprocessing of data. The data warehouses constructed
by such preprocessing are valuable sources of high quality data for OLAP
and data mining as well.
 Available information processing infrastructure surrounding data
warehouses − Information processing infrastructure refers to accessing,
integration, consolidation, and transformation of multiple heterogeneous
databases, web-accessing and service facilities, reporting and OLAP
analysis tools.
 OLAP−based exploratory data analysis − Exploratory data analysis is
required for effective data mining. OLAM provides facility for data mining
on various subset of data and at different levels of abstraction.
 Online selection of data mining functions − Integrating OLAP with
multiple data mining functions and online analytical mining provide users
83
BCA 5th Sem [Link]

with the flexibility to select desired data mining functions and swap data
mining tasks dynamically.

 OLAM Architecture

An OLAM engine performs analytical mining in data cubes in a similar manner as


an OLAP engine performs online analytical processing. Therefore, it is suggested
to have an integrated OLAM and OLAP architecture . Where the OLAM and
OLAP engines both accept users online queries via a user graphical user interface.

An OLAM engine can perform multiple data mining tasks, such as concept
description ,association ,classification, prediction, clustering and time series
analysis. Therefore, it usually consists of multiple, integrated data mining modules,
making it more sophisticated than an OLAP engine. There is no fundamental
difference between the data cube required for OLAP, although OLAM analysis
might require more powerful data cube construction and accessing tools.

 Efficient methods of cube computation

Data cube computation is an important task in data warehouse implementation.


The pre- computation of all or part of a data cube can greatly reduce the response

84
BCA 5th Sem [Link]

time and enhance the performance of online analytical processing. However, such
computation is challenging since it may require large computational time and
storage space. This section explores efficient methods for data cube computation.

Multiway Array Aggregation for full cube computation

 The multiway array aggregation or simply multiway method computers a


full data cube by using multidimensional array as its basic data structure.

 It is a typical MOLAP (Multidimensional online Analytical Processing)


approach that uses direct array addressing, where dimension values are
accessed via the position or index of their corresponding array locations.

A different approach is developed for the array based cube construction, as


follows:

1. Partition the array into chunks:- A chunk is a sub cube that is small
enough to fit into the memory available for cube computation. chunking is a
method for dividing an N-dimensional array into small N- dimensional
chunks, where each chunk is stored as an object on disk. The chunks are
compressed so as to remove wasted space resulting from empty array cells.

2. Compute aggregates by visiting cube cells:- The order in which cells are
visited can be optimized so as to minimize the number of times that each cell
must be revisited, thereby reducing memory access and storage costs.

BUC (Bottom-up Construction) : Computing Iceberg cubes from the apex


cuboid downward

 BUC stands for bottom –up construction is an algorithm for the computation
of sparse and iceberg cubes.

85
BCA 5th Sem [Link]

 Unlike multiway ,BUC constructs the cube from the apex cuboid towards
the base cuboid. This allows BUC to share data partitioning costs.

 This representation of a lattice of cuboids, with the apex at the top and the
base at the bottom, is commonly accepted in data warehousing. It
consolidated the notions of drill down and roll up.

Star cubing: computing iceberg cubes using a dynamic star tree structure

 Star cubing integrates top down and bottom up cube computation and
explores both multidimensional aggregations.

 It operates from a data structure called a star tree, which performs lossless
data compression, thereby reducing the computation time and memory
requirements.

 A key idea behind star cubing is the concept of shared dimensions . To build
up to this notion.

 The order of computation is from the base cuboid, upwards towards the apex
cuboid . This order of computation is similar to that to Multiway.

86
BCA 5th Sem [Link]

 Discovery Driven Exploration of data cubes

A data cube may have a large number of cuboids and each cuboids, and each
cuboid may contain in large number of cells. With such an extremely large space,
it becomes a burden for users just browse a cube. Tools need to be developed to
assist users in intelligently exploring the huge aggregated space of a data cube.

Discovery driven exploration is such a cube exploration approach. The main


features of this approach are:

 In discovery driven exploration, pre computed measures or procedures


indicating data exceptions are used to guide the user in the data analysis
process, at all levels of aggregation.

 In this approach, an exception is a data cube cell’s value that is significantly


different from the expected value, based on a statistical model.

 This approach considers variations and patterns in the measures value across
all of the dimensions to which a cell belongs.

 Visual cues or signs such as background color are used to reflect the degree
of exception of each cell, based on the pre- computed exception indicators.

 The computation of exception indicators can be overlapped with cube


construction, so that the overall construction of data cubes for discovery
driven exploration is efficient.

87
BCA 5th Sem [Link]

 Three measures are used as exception indicators to help identify data


anomalies.

1. SelfExp: - This indicates the degree of surprise of the cell value,


relative to other cells at the same levels of aggregation.

2. InExp: - This indicates the degree of surprise somewhere under the


cell, if we were to drill down from it.

3. PathExp: - This indicates the degree of surprise for each drill down
path from the cell.

 Attribute Oriented Induction In Data Mining - Data


Characterization
Attribute-Oriented Induction

The Attribute-Oriented Induction (AOI) approach to data generalization and


summarization – based characterization was first proposed in 1989 (KDD ‘89
workshop) a few years before the introduction of the data cube approach.

The data cube approach can be considered as a data warehouse – based, pre
computational – oriented, materialized approach.

It performs off-line aggregation before an OLAP or data mining query is submitted


for processing.

On the other hand, the attribute oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized – based, on-line data
analysis technique.

However, there is no inherent barrier distinguishing the two approaches based on


online aggregation versus offline precomputation.

Some aggregations in the data cube can be computed on-line, while off-line
precomputation of multidimensional space can speed up attribute-oriented
88
BCA-5th SEM [Link]

What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for further use. The key properties of data mining are Automatic
discovery of patterns Prediction of likely outcomes Creation of actionable information Focus on
large datasets and databases.

The Scope of Data Mining Data mining derives its name from the similarities between searching
for valuable business information in a large database — for example, finding linked products in
gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both
processes require either sifting through an immense amount of material, or intelligently probing
it to find exactly where the value resides. Given databases of sufficient size and quality, data
mining technology can generate new business opportunities by providing these capabilities.

Data Mining Methodology

Tasks of Data Mining Data mining involves six common classes of tasks:

Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data


records, that might be interesting or data errors that require further investigation.

Association rule learning (Dependency modelling) – Searches for relationships between


variables. For example, a supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which products are frequently bought

SBS ©PROPRIETARY Page 2


BCA-5th SEM [Link]

together and use this information for marketing purposes. This is sometimes referred to as
market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some way or
another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For example,
an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

Regression – attempts to find a function which models the data with the least error.

Architecture of Data Mining

A typical data mining system may have the following major components.

SBS ©PROPRIETARY Page 3


BCA-5th SEM [Link]

1. Knowledge Base:

This is the domain knowledge that is used to guide the search or evaluate the interestingness of
resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or
attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be
used to assess a pattern’s interestingness based on its unexpectedness, may also be included.
Other examples of domain knowledge are additional interestingness constraints or thresholds,
and metadata (e.g., describing data from multiple heterogeneous sources).

2. Data Mining Engine:

This is essential to the data mining system and ideally consists of a set of functional modules for
tasks such as characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:

This component typically employs interesting measures that interacts with the data mining
modules so as to focus the search toward interesting patterns. It may use interesting thresholds to
filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated
with the mining module, depending on the implementation of the datamining method used. For
efficient data mining, it is highly recommended to push the evaluation of pattern interestingness
as deep as possible into the mining process to confine the search to only interesting patterns.

4. User interface:

This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to help
focus the search, and performing exploratory datamining based on the intermediate data mining
results. In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.

SBS ©PROPRIETARY Page 4


BCA-5th SEM [Link]

Data Mining Process:


Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data. The general experimental procedure adapted to data-mining problems
involves the following steps:

1. State the problem and formulate the hypothesis: Most data-based modeling studies are
performed in a particular application domain. Hence, domain-specific knowledge and experience
are usually necessary to come up with a meaningful problem statement. Unfortunately, many
application studies tend to focus on the data-mining technique at the expense of a clear problem
statement. In this step, a modeler usually specifies a set of variables for the unknown dependency
and, if possible, a general form of this dependency as an initial hypothesis. There may be several
hypotheses formulated for a single problem at this stage. The first step requires the combined
expertise of an application domain and a data-mining model. In practice, it usually means a close
interaction between the data-mining expert and the application expert. In successful data-mining
applications, this cooperation does not stop in the initial phase; it continues during the entire
data-mining process.

2. Collect the data: This step is concerned with how the data are generated and collected. In
general, there are two distinct possibilities. The first is when the data-generation process is under
the control of an expert (modeler): this approach is known as a designed experiment. The second
possibility is when the expert cannot influence the data- generation process: this is known as the
observational approach. An observational setting, namely, random data generation, is assumed in
most data-mining applications. Typically, the sampling distribution is completely unknown after
data is collected, or it is partially and implicitly given in the data-collection procedure. It is very
important, however, to understand how data collection affects its theoretical distribution, since
such a priori knowledge can be very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for estimating a model and the data
used later for testing and applying a model come from the same, unknown, sampling distribution.

SBS ©PROPRIETARY Page 5


BCA-5th SEM [Link]

If this is not the case, the estimated model cannot be successfully used in a final application of
the results.

3. Preprocessing the data: In the observational setting, data are usually "collected" from the
existing databases, data warehouses, and data marts. Data preprocessing usually includes at least
two common tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not consistent with
most observations. Commonly, outliers result from measurement errors, coding and recording
errors, and, sometimes, are natural, abnormal values. Such nonrepresentative samples can
seriously affect the model produced later. There are two strategies for dealing with outliers: a.
Detect and eventually remove outliers as a part of the preprocessing phase, or b. Develop robust
modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing includes several steps such as
variable scaling and different types of encoding. For example, one feature with the range [0, 1]
and the other with the range [−100, 1000] will not have the same weights in the applied
technique; they will also influence the final data-mining results differently. Therefore, it is
recommended to scale them and bring both features to the same weight for further analysis. Also,
application-specific encoding methods usually achieve dimensionality reduction by providing a
smaller number of informative features for subsequent data modeling. These two classes of
preprocessing tasks are only illustrative examples of a large spectrum of preprocessing activities
in a data-mining process. Data-preprocessing steps should not be considered completely
independent from other data-mining phases. In every iteration of the data-mining process, all
activities, together, could define new and improved data sets for subsequent iterations. Generally,
a good preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and encoding.

4. Estimate the model: The selection and implementation of the appropriate data-mining
technique is the main task in this phase. This process is not straightforward; usually, in practice,
the implementation is based on several models, and selecting the best one is an additional task.

SBS ©PROPRIETARY Page 6


BCA-5th SEM [Link]

5. Interpret the model and draw conclusions: In most cases, data-mining models should help
in decision making. Hence, such models need to be interpretable to be useful because humans are
not likely to base their decisions on complex "black box" models. Note that the goals of accuracy
of the model and accuracy of its interpretation are somewhat contradictory. Usually, simple
models are more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using high dimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific.
techniques to validate the results. A user does not want hundreds of pages of numeric results. He
does not understand them; he cannot summarize, interpret, and use them for successful decision
making.

Major Issues in Data Mining


Mining different kinds of knowledge in databases. - The need for different users is not the
same. And Different user may be in interested in different kind of knowledge. Therefore, it is
necessary for data mining to cover a broad range of knowledge discovery tasks.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.

Incorporation of background knowledge. - To guide the discovery process and to express the
discovered patterns, background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.

Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

SBS ©PROPRIETARY Page 7


BCA-5th SEM [Link]

Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. These representations
should be easily understandable by the users.

Handling noisy or incomplete data. - Data cleaning methods are required that can handle the
noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to the interestingness of the problem. The patterns discovered
should be interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - To effectively extract information from
huge amounts of data in databases, data mining algorithms must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which are further processed parallel. Then the result from the partitions is merged.
The incremental algorithms update databases without having to mine the data again from scratch.

 Association Rule Mining

Association rule mining is the scientific technique used to find out interesting and
frequent patterns from the transactional, spatial, temporal, or other databases and to
set associations or relations among patterns( also known as item sets) to discover
knowledge.

Association rules can be applied in various fields like network management,


catalog design, clustering, classification, marketing etc.

SBS ©PROPRIETARY Page 8


BCA-5th SEM [Link]

A typical example of association rule mining is market basket analysis. This


process analysis analyzes customer- buying habits by finding associations between
the different items that customers place in their shopping basket. The discovery of
such associations can help retailers develop marketing strategies by gaining insight
into which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread ( and of what
kind of bread) on the same trip to the supermarket? such marketing and plan their
shelf space.

 Market Basket Analysis

Suppose as manager of a branch of HTC company, you would like to learn more
about the buying habits of your customers. You may also want to know which
groups or sets of items customers are likely to purchase on a given trip to store.

To answer your question, market basket analysis may be performed on the retail
data of customer transactions at your store. The result may be used to plan
marketing or advertising strategies , as well as catalogue design. For instance,
market basket analysis may help managers design different store layouts. In one
strategy, items that are frequently purchased together can be placed in proximity to
further encourage the sale of such items together. If customers who purchase
computers also tend to buy cell phones at the same time, then placing the computer
display close to the cell phone display may help to increase the sales of both items.
In an alternative strategy, placing computers and cell phones at the opposite ends
of the store may attract the customers who purchase such items to pick up other
items along the way.

Market basket analysis can also help retailers to plan which items to put on sale at
reduced prices. If customers tend to purchase computers and printers together, then
having a sale of printers may encourage the sale of printers as well as computers.

SBS ©PROPRIETARY Page 9


BCA-5th SEM [Link]

Association rule mining is a two-step process.

1. Find all frequent items sets or patterns.

2. Generate strong association rules from the frequent items sets.

Market basket analysis is just one form of association rule mining. In fact, there
are many kinds of association rules. Association rules can be classified in
various ways, based on the following criteria.

 Based on the types of values handled in the rule:- If a rule concerns


associations between the presence or absence of items, it is a Boolean
association rule.

If a rule describes associations between quantitative items or attributes,


then it is a quantitative association rule. In these rules, quantitative
values for items or attributes are partitioned into intervals.

 Based on the dimensions of data involved in the rule:- If the items or


attributes in an association rule reference only one dimension, then it is a
single dimensional association rule. The rule could be written as:

Buys [ X, “computer”] buys[X, cell phone]

 Based on the levels of abstractions involved in the rule set:- Some


methods for association rule mining can find rules at different levels of
abstractions. For example, suppose that asset of association rules mined
include the following rules:

age [X, “30…39”] buys[ X,” laptop computer”]

SBS ©PROPRIETARY Page 10


BCA-5th SEM [Link]

age[X, “30…39”] buys[X,”computer”]

 Based on various extensions to association mining :- Association


mining can be extended to correlation analysis, where the absence or
presence of correlated items can be identified . It can also be extended to
mining max patterns and frequent closed itemset.

 Apriori Algorithm
The Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding
frequent item sets in a dataset for Boolean association rule. The name of the
algorithm is Apriori because it uses prior knowledge of frequent item set
properties. We apply an iterative approach or level-wise search where k-frequent
item sets are used to find k+1 item sets.
To improve the efficiency of level-wise generation of frequent item sets, an
important property is used called Apriori property which helps by reducing the
search space.

Apriori Property

All non-empty subset of frequent itemset must be frequent. The key concept of
Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes
that.
All subsets of a frequent item set must be frequent(Apriori property).
If an item set is infrequent, all its supersets will be infrequent.
Before we start understanding the algorithm, go through some definitions which
are explained in my previous post.
Consider the following dataset and we will find frequent itemset and generate
association rules for them.

SBS ©PROPRIETARY Page 11


BCA-5th SEM [Link]

Step-1: K=1
Create a table containing support count of each item present in dataset – Called C1
(candidate set)

(II)Compare candidate set item’s support count with minimum support count (here
min_support=2 if support count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.

Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
 Check all subsets of an item set are frequent or not and if not frequent
remove that item set.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent. Check for each item set)
 Now find support count of these item sets by searching in dataset.

SBS ©PROPRIETARY Page 12


BCA-5th SEM [Link]

(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support count of candidate set item is less than min_support then
remove those items) this gives us item set L2.

Step-3:

 Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common. So here, for L2, first
element should match.
So item set generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2,
I3, I4}{I2, I4, I5}{I2, I3, I5}
 Check if all subsets of these itemset are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}
which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so
remove it. Similarly check for every item set)
 Find support count of these remaining item set by searching in dataset.

SBS ©PROPRIETARY Page 13


BCA-5th SEM [Link]

I) Compare candidate (C3) support count with minimum support count(here


min_support=2 if support count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.

Step-4:
 Generate candidate set C4 using L3 (join step). The condition of joining
Lk-1 and Lk-1 (K=4) is that they should have (K-2) elements in common. So
here, for L3, the first 2 elements (items) should match.
 Check all subsets of these itemset are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is
not frequent). So, no itemset in C4
 We stop here because no frequent itemset are found further.

Limitations of Apriori algorithm


 Computationally Expensive. Even though the Apriori algorithm reduces
the number of candidates itemset to consider, this number could still be huge
when store inventories are large or when the support threshold is low.
However, an alternative solution would be to reduce the number of
comparisons by using advanced data structures, such as hash tables, to sort
candidate itemset more efficiently.
 Spurious Associations. Analysis of large inventories would involve more
itemset configurations, and the support threshold might have to be lowered
to detect certain associations. However, lowering the support threshold
might also increase the number of spurious associations detected. To ensure
that identified associations are generalizable, they could first be distilled
from a training dataset, before having their support and confidence assessed
in a separate test dataset.

SBS ©PROPRIETARY Page 14


BCA-5th SEM [Link]

 Mining Multilevel Association Rules

Multilevel association means mining the data in different levels. For many
applications, it is difficult to find strong associations among data items at low level
of abstraction due to the sparsity of data in multidimensional space.

Strong associations discovered at high concept levels that might represent common
sense knowledge. However, what may represent common sense to one user may
seem new or novel to another. Therefore, data mining systems should provide
capabilities to mine association rules at multiple levels of abstraction.

Approaches to mining multilevel Association Rules


“How can we mine multiple association rules efficiently using concept
hierarchies?” Let’s look at some approaches based on a support – confidence
framework.

In general , a top-down strategy is employed, where counts are built up for


the calculation of frequent items sets at each concept level, starting at the concept
level 1 and working towards the lower, more specific concept levels, until no more
frequent itemset can found.

 Uniform minimum support for all levels (referred to as uniform


support) :- The same minimum support threshold or limit is used when
mining at each level of abstraction.

SBS ©PROPRIETARY Page 15


BCA-5th SEM [Link]

In Fig, a minimum support threshold of 5% is used throughout ( eg for mining


from computer downward to laptop computer) . Both computer and laptop
computer are found to be frequent whereas desktop computer is not.

The uniform support approach, however, has some difficulties. It is unlikely that
items at lower levels of abstraction will occur as frequently as those at higher
levels of abstraction.

 Reduced minimum support at lower levels (referred to as reduced


support):- Each abstraction level has its own minimum support
threshold or limit. The deeper the abstraction levels, the smaller the
corresponding threshold.

 For example, in fig. the minimum support threshold for levels 1 and 2 are
6% and 4% respectively. In this way, “computer “ laptop computer and “
desktop computer “ are all considered frequent.

SBS ©PROPRIETARY Page 16


BCA-5th SEM [Link]

UNIT-IV

SBS ©PROPRIETARY Page 2


BCA-5th SEM [Link]

I. Overview of classification
Classification is a form of data analysis that extracts models describing
important data classes. Such models, called classifiers, predict categorical
(discrete, unordered) class labels. For example, we can build a classification
model to categorize bank loan applications as either safe or risky. Such
analysis can help provide us with a better understanding of the data at large.
Many classification methods have been proposed by researchers in machine
learning, pattern recognition, and statistics. Most algorithms are memory
resident, typically assuming a small data size. Recent data mining research
has been built on such work, developing scalable classification and
prediction techniques capable of handling large amounts of disk-resident
data. Classification has numerous applications, including fraud detection,
target marketing, performance prediction, manufacturing, and medical
diagnosis.
classification, where a model or classifier is constructed to predict class
(categorical) labels.

SBS ©PROPRIETARY Page 3


BCA-5th SEM [Link]

II. Classification process


“How does classification work?” Data classification is a two-step process,
consisting of a learning step (where a classification model is constructed)
and a classification step (where the model is used to predict class labels for
given data)

In the context of classification, data tuples can be referred to as samples,


examples, instances, data points, or objects.

The data classification process: (a) Learning: Training data are analyzed by
a classification algorithm. Here, the class label attribute is loan decision, and
the learned model or classifier is represented in the form of classification
rules. (b) Classification: Test data are used to estimate the accuracy of the
classification rules. If accuracy is considered acceptable, the rules can be
applied to the classification of new data tuples.

The accuracy of a classifier on a given test set is the percentage of test set
tuples that are correctly classified by the classifier.

SBS ©PROPRIETARY Page 4


BCA-5th SEM [Link]

III. Decision tree


A decision tree is a tree-like model that makes decisions based on a set of
rules. Each internal node represents a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents a class
label. Decision trees are interpretable and widely used for classification.
IV. Decision Tree Induction
A decision tree is a tree-like model that makes decisions based on a set of
rules. Each internal node represents a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents a class
label. Decision trees are interpretable and widely used for classification.
Decision tree induction is the learning of decision trees from class-labeled
training tuples. A decision tree is a flowchart-like tree structure, where each
internal node (non-leaf node) denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (or terminal node)
holds a class label. The topmost node in a tree is the root node.

A decision tree for the concept buys a computer, indicating whether an All
Electronics customer is likely to purchase a computer. Each internal (non-leaf)
node represents a test on an attribute. Each leaf node represents a class (either buys
computer = yes or buys computer = no).

SBS ©PROPRIETARY Page 5


BCA-5th SEM [Link]

Suppose we want to build a decision tree to predict whether a person is likely to


buy a new car based on their demographic and behavior data. The decision tree
starts with the root node, which represents the entire dataset. The root node splits
the dataset based on the “income” attribute. If the person’s income is less than or
equal to Rs.50,000, the decision tree follows the left branch, and if the income is
greater than Rs.50,000, the decision tree follows the right branch.

The left branch leads to a node that represents the “age” attribute. If the person’s
age is less than or equal to 30, the decision tree follows the left branch, and if the
age is greater than 30, the decision tree follows the right branch. The right branch
leads to a leaf node that predicts that the person is unlikely to buy a new car.

The left branch leads to another node that represents the “education” attribute. If
the person’s education level is less than or equal to high school, the decision tree
follows the left branch, and if the education level is greater than high school, the
decision tree follows the right branch. The left branch leads to a leaf node that
predicts that the person is unlikely to buy a new car. The right branch leads to
another node that represents the “credit score” attribute. If the person’s credit score
is less than or equal to 650, the decision tree follows the left branch, and if the
credit score is greater than 650, the decision tree follows the right branch. The left
branch leads to a leaf node that predicts that the person is unlikely to buy a new
car. The right branch leads to a leaf node that predicts that the person is likely to
buy a new car.

In summary, a decision tree is a graphical representation of all the possible


outcomes of a decision based on the input data. It is a powerful tool for modeling
and predicting outcomes in a wide range of domains, including business, finance,
healthcare, and more.

Decision tree induction is the process of constructing a decision tree from a


given dataset. It involves selecting the best attribute at each node to split the
data into subsets, recursively repeating this process until a stopping criterion
is met (e.g., a certain depth is reached).

SBS ©PROPRIETARY Page 6


BCA-5th SEM [Link]

V. Attribute Selection Measures


Attribute selection measures help decide the order in which attributes are
chosen during decision tree induction.

Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split. The attribute
selection measure provides a ranking for each attribute describing the given
training tuples.

Popular measures include:

Information Gain: Measures the reduction in uncertainty about the class


label.
The attribute with the highest information gain is chosen as the splitting
attribute for node N. This attribute minimizes the information needed to
classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions. Such an approach minimizes
the expected number of tests needed to classify a given tuple and guarantees
that a simple (but not necessarily the simplest) tree is found. The expected
information needed to classify a tuple in D is given by.

Info(D) =

Gain Ratio: Adjusts information gained to handle bias towards attributes


with many values. The information gain measure is biased toward tests with
many outcomes. That is, it prefers to select attributes having a large number
of values.
The gain ratio is defined as

The attribute with the maximum gain ratio is selected as the splitting
attribute. Note, however, that as the split information approaches 0, the ratio

SBS ©PROPRIETARY Page 7


BCA-5th SEM [Link]

becomes unstable. A constraint is added to avoid this, whereby the


information gain of the test selected must be large—at least as great as the
average gain over all tests examined.

Gini Index: Measures impurity or disorder in a set of data.


The Gini index is used in CART. Using the notation previously described,
the Gini index measures the impurity of D, a data partition or set of training
tuples, as

The Gini index considers a binary split for each attribute.

VI. Overview of classifier’s accuracy


Classifier accuracy is a measure of how well a classification model correctly
predicts the class labels. It is defined as the ratio of correctly predicted
instances to the total instances. The classification accuracy is the ratio of the
number of correct predictions to the total number of input samples.
This section presents measures for assessing how good or how “accurate”
your classifier is at predicting the class label of tuples.
There are four terms we need to know that are the “building blocks” used in
computing many evaluation measures. Understanding them will make it easy
to grasp the meaning of the various measures.

SBS ©PROPRIETARY Page 8


BCA-5th SEM [Link]

 True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
 True negatives(TN): These are the negative tuples that were
correctly labeled by the classifier. Let TN be the number of true
negatives.
 False positives (FP): These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys computer =
no for which the classifier predicted buys computer = yes). Let FP be
the number of false positives.
 False negatives (FN): These are the positive tuples that were
mislabeled as negative (e.g., tuples of class buys computer = yes for
which the classifier predicted buys computer = no). Let FN be the
number of false negatives.

VII. Evaluating classifier’s accuracy


Common metrics for evaluating classifier accuracy include:

Accuracy: The proportion of correctly classified instances. The accuracy


of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier. That is,

SBS ©PROPRIETARY Page 9


BCA-5th SEM [Link]

Precision: The ratio of true positive instances to the total predicted


positive instances.

Recall (Sensitivity): The ratio of true positive instances to the total


actual positive instances.

F1 Score: The harmonic mean of precision and recall.

VIII. Techniques for accuracy estimation


We will see techniques to evaluate the accuracy of classifiers.
 Holdout
In the holdout method, the largest dataset is randomly divided into
three subsets:
A training set is a subset of the dataset which are been used to build
predictive models.
The validation set is a subset of the dataset which is been used to assess the
performance of the model built in the training phase. It provides a test
platform for fine-tuning of the model’s parameters and selecting the best-
performing model. It is not necessary for all modeling algorithms to need a
validation set.
Test sets or unseen examples are the subset of the dataset to assess the likely
future performance of the model. If a model fits into the training set much

SBS ©PROPRIETARY Page 10


BCA-5th SEM [Link]

better than it fits into the test set, then overfitting is probably the cause that
occurred here.
Basically, two-thirds of the data are being allocated to the training set and
the remaining one-third has been allocated to the test set.

 Random Subsampling
 Random subsampling is a variation of the holdout method. The
holdout method has been repeated K times.
 The holdout subsampling involves randomly splitting the data
into a training set and a test set.
 On the training set the data has been trained and the mean
square error (MSE) has been obtained from the predictions on
the test set.
 As MSE is dependent on the split, this method is not
recommended. So, a new split can give you a new MSE.

SBS ©PROPRIETARY Page 11


BCA-5th SEM [Link]

 Cross-Validation
 K-fold cross-validation has been used when there is only a limited
amount of data available, to achieve an unbiased estimation of the
performance of the model.
 Here, we divide the data into K subsets of equal sizes.
 We build models K times, each time leaving out one of the subsets
from the training and use it as the test set.
 If K equals the sample size, then this is called a “Leave-One-Out.”

 Bootstrapping
 Bootstrapping is one of the techniques which is used to make the
estimations from the data by taking an average of the estimates from
smaller data samples.
 The bootstrapping method involves the iterative resampling of a
dataset with replacement.
 On resampling instead of only estimating the statistics once on
complete data, we can do it many times.
 Repeating this multiple times helps to obtain a vector of estimates.

SBS ©PROPRIETARY Page 12


BCA-5th SEM [Link]

 Bootstrapping can compute variance, expected value, and other


relevant statistics of these estimates.

IX. Increasing the accuracy of classifier


Machine learning largely relies on classification models, and the accuracy of
these models is a key performance indicator. It can be difficult to increase a
classification model's accuracy since it depends on several variables,
including data quality, model complexity, hyperparameters, and others.

We’ll look at a few methods for improving a classification model's


precision.

 Data Preprocessing

o Each machine learning project must include data preprocessing since


the model's performance may be greatly impacted by the quality of the
training data. There are various processes in preprocessing, like
cleaning, normalization, and feature engineering. Here are some
recommendations for preparing data to increase a classification
model's accuracy:

o Cleansing Data Remove missing values, outliers, and duplicate data


points to clean up the data. Techniques like mean imputation, median
imputation, or eliminating rows or columns with missing data can all
be used to accomplish this.

o To make sure that all characteristics are scaled equally, normalize the
data. Techniques like min−max normalization, z−score normalization,
or log transformation can be used for this.

o Feature engineering is the process of building new features from


already existing ones to reflect the underlying data more accurately.
Techniques like polynomial features, interaction features, or feature
selection can be used for this.

SBS ©PROPRIETARY Page 13


BCA-5th SEM [Link]

 Feature Selection

o The process of choosing the most pertinent characteristics from a


dataset that might aid in classification is known as feature selection.
The complexity of the model may be reduced, and overfitting can be
avoided with the use of feature selection. Feature selection methods
include the following:

o Analysis of Correlation: The correlation between each characteristic


and the target variable is determined during a correlation analysis.
High correlation features may be used for the model.

o Sorting features according to their significance in the classification


process is known as "feature importance ranking." Techniques like
decision tree-based feature importance or permutation importance can
be used for this.

o Dimensionality Reduction: It is possible to decrease the number of


features in a dataset while keeping most of the data by using
dimensionality reduction techniques like PCA.

 Model Selection

o The accuracy of the model can be considerably impacted by the


classification algorithm selection. Various data kinds or categorization
jobs may lend themselves to different algorithms performing better.
These are a few typical categorization methods:

o Logistic Regression: A linear model that may be applied to binary


classification is logistic regression. It operates by calculating the
likelihood of a binary result depending on the properties of the input.

o Decision Trees: Decision trees are non−linear models that may be


applied to multi−class classification as well as binary classification.
Based on the input characteristics, they divide the input space into
more manageable chunks.

SBS ©PROPRIETARY Page 14


BCA-5th SEM [Link]

o Support Vector Machines (SVM): SVM is a non−linear model that


may be applied to multi−class classification as well as binary
classification. The method finds a hyperplane based on the input
characteristics that maximum isolates the input data.

o Random Forest: To increase the model's accuracy, random forest is an


ensemble approach that mixes different decision trees. It operates by
combining the forecasts from many decision trees.

 Hyperparameter Tuning

o Options for model configuration known as hyperparameters cannot be


inferred from data. The hyperparameters are tweaked to enhance the
model's performance. Listed below are numerous approaches to
hyperparameter tuning:
o Grid Search: In grid search, a grid of hyperparameter values are used
to evaluate the model's performance for each conceivable
combination.

o Random Search: In random search, values for the model's


hyperparameters are selected at random from a distribution, and the
model's performance is evaluated for each set of hyperparameters.

o Bayesian optimization involves using a probabilistic model to predict


how the model will perform given different values for its
hyperparameters to select the hyperparameters that will maximize the
performance of the model.

 Imbalanced Data

o In classification tasks, unbalanced data frequently arises when one


class has a disproportionately large number of data points compared to
the other class. Biased models might result from unbalanced data and
underperform for minority classes. The following are some methods
for dealing with unbalanced data:

SBS ©PROPRIETARY Page 15


BCA-5th SEM [Link]

o Oversampling: To equalize the quantity of data points in each class,


oversampling entails reproducing the minority class data points.
o Under sampling: To balance the quantity of data points in each class,
under sampling entails arbitrarily eliminating data points from the
majority class.

o Learning that is cost−sensitive entails allocating various


misclassification costs to various classes. This can aid in lessening the
model's bias towards the class that is in the majority.

SBS ©PROPRIETARY Page 16


BCA-5th SEM [Link]

X. Introduction to Clustering
Clustering is the process of grouping a set of data objects into multiple
groups or clusters so that objects within a cluster have high similarity but are
very dissimilar to objects in other clusters. Dissimilarities and similarities
are assessed based on the attribute values describing the objects and often
involve distance measures.1 Clustering as a data mining tool has its roots in
many application areas such as biology, security, business intelligence, and
Web search.
Cluster analysis or simply clustering is the process of partitioning a set of
data objects (or observations) into subsets. Each subset is a cluster, such that
objects in a cluster are like one another, yet dissimilar to objects in other
clusters. The set of clusters resulting from a cluster analysis can be referred
to as a clustering. In this context, different clustering methods may generate
different clustering on the same data set. The partitioning is not performed
by humans, but by the clustering algorithm. Hence, clustering is useful in
that it can lead to the discovery of previously unknown groups within the
data.
Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
In business intelligence, clustering can be used to organize many customers
into groups, where customers within a group share strong similar
characteristics. This facilitates the development of business strategies for
enhanced customer relationship management. Moreover, consider a
consultant company with many projects. To improve project management,
clustering can be applied to partition projects into categories based on
similarity so that project auditing and diagnosis (to improve project delivery

SBS ©PROPRIETARY Page 17


BCA-5th SEM [Link]

and outcomes) can be conducted effectively.


Clustering has also found many applications in Web search. For example, a
keyword search may often return a very large number of hits (i.e., pages
relevant to the search) due to the extremely large number of web pages.
Clustering can be used to organize the search results into groups and present
the results in a concise and easily accessible way. Moreover, clustering
techniques have been developed to cluster documents into topics, which are
commonly used in information retrieval practice. As a data mining function,
cluster analysis can be used as a standalone tool to gain insight into the
distribution of data, to observe the characteristics of each cluster, and to
focus on a particular set of clusters for further analysis. Alternatively, it may
serve as a preprocessing step for other algorithms, such as characterization,
attribute subset selection, and classification, which would then operate on
the detected clusters and the selected attributes or features.

Requirements for Cluster Analysis


 Scalability: Many clustering algorithms work well on small data sets
containing fewer than several hundred data objects; however, a large
database may contain millions or even billions of objects, particularly
in Web search scenarios. Clustering only a sample of a given large
data set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed.
 Ability to deal with different types of attributes: Many algorithms
are designed to cluster numeric (interval-based) data. However,
applications may require clustering other data types, such as binary,
nominal (categorical), and ordinal data, or mixtures of these data

SBS ©PROPRIETARY Page 18


BCA-5th SEM [Link]

types. Recently, more and more applications need clustering


techniques for complex data types such as graphs, sequences, images,
and documents.
 Discovery of clusters with arbitrary shape: Many clustering
algorithms determine clusters based on Euclidean or Manhattan
distance measures . Algorithms based on such distance measures tend
to find spherical clusters with similar size and density. However, a
cluster could be of any shape.
 Requirements for domain knowledge to determine input
parameters: Many clustering algorithms require users to provide
domain knowledge in the form of input parameters such as the desired
number of clusters. Consequently, the clustering results may be
sensitive to such parameters. Parameters are often hard to determine,
especially for high-dimensionality data sets and where users have yet
to grasp a deep understanding of their data. Requiring the
specification of domain knowledge not only burdens users, but also
makes the quality of clustering difficult to control.
 Ability to deal with noisy data: Most real-world data sets contain
outliers and/or missing, unknown, or erroneous data. Sensor readings,
for example, are often noisy—some readings may be inaccurate due to
the sensing mechanisms, and some readings may be wrong due to
interferences from surrounding transient objects. Clustering
algorithms can be sensitive to such noise and may produce poor-
quality clusters.
 Incremental clustering and insensitivity to input order: In many
applications, representing newer data may arrive at any time. Some

SBS ©PROPRIETARY Page 19


BCA-5th SEM [Link]

clustering algorithms cannot incorporate this into existing clustering


structures and, instead, have to recompute a new clustering from
scratch. Clustering algorithms may also be sensitive to the input data
order. That is, given a set of data objects, clustering algorithms may
return dramatically different clustering depending on the order in
which the objects are presented.
 Capability of clustering high-dimensionality data: A data set can
contain numerous dimensions or attributes. When clustering
documents, for example, each keyword can be regarded as a
dimension, and there are often thousands of keywords. Most
clustering algorithms are good at handling low-dimensional data such
as data sets involving only two or three dimensions. Finding clusters
of data objects in a high dimensional space is challenging.
 Constraint-based clustering: Real-world applications may need to
perform clustering under various kinds of constraints. A challenging
task is to find data groups with good clustering behavior that satisfy
specified constraints.
 Interpretability and usability: Users want clustering results to be
interpretable, comprehensible, and usable. That is, clustering may
need to be tied in with specific semantic interpretations and
applications. It is important to study how an application goal may
influence the selection of clustering features and clustering methods.

SBS ©PROPRIETARY Page 20


BCA-5th SEM [Link]

XI. Types of clusters


 Partitioning methods: It is used to make partitions on the data to
form clusters. If “n” partitions are done on “p” objects of the
database, then each partition is represented by a cluster and n < p.
The two conditions which need to be satisfied with this Partitioning
Clustering Method are:
 One objective should only belong to only one group.
 There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative
relocation, which means the object will be moved from one group to
another to improve the partitioning.
 Hierarchical Method: In this method, a hierarchical decomposition
of the given set of data objects is created. We can classify hierarchical
methods and will be able to know the purpose of classification based
on how the hierarchical decomposition is formed. There are two types
of approaches for the creation of hierarchical decomposition, they
are:
 Agglomerative Approach: The agglomerative approach is also known as the
bottom-up approach. Initially, the given data is divided into which objects form
separate groups. Thereafter it keeps on merging the objects or the groups that
are close to one another, which means that they exhibit similar properties. This
merging process continues until the termination condition holds.
 Divisive Approach: The divisive approach is also known as the top-down
approach. In this approach, we would start with the data objects that are in the
same cluster. The group of individual clusters is divided into small clusters by
continuous iteration. The iteration continues until the condition of termination is
met or until each cluster contains one object.

SBS ©PROPRIETARY Page 21


BCA-5th SEM [Link]

Once the group is split or merged then it can never be undone as it is a rigid
method and is not so flexible. The two approaches which can be used to improve
the Hierarchical Clustering Quality in Data Mining are: –

 One should carefully analyze the linkages of the object at every partitioning
of hierarchical clustering.
 One can use a hierarchical agglomerative algorithm for the integration of
hierarchical agglomeration. In this approach, first, the objects are grouped
into micro-clusters. After grouping data objects into micro clusters, macro
clustering is performed on the micro cluster.

 Density-Based Method: The density-based method mainly focuses on


density. In this method, the given cluster will keep on growing continuously
as long as the density in the neighborhood exceeds some threshold, i.e., for
each data point within a given cluster. The radius of a given cluster must
contain at least a minimum number of points.

 Grid-Based Method: In the Grid-Based method a grid is formed using the


object together, the object space is quantized into a finite number of cells that
form a grid structure. One of the major advantages of the grid-based method
is fast processing time and it is dependent only on the number of cells in each
dimension in the quantized space. The processing time for this method is
much faster so it can save time.

SBS ©PROPRIETARY Page 22


BCA-5th SEM [Link]

XII. Clustering methods


1. K-Means: A Centroid-Based Technique: A centroid-based
partitioning technique uses the centroid of a cluster. ‘K’ in the name
of the algorithm represents the number of groups/clusters we want to
classify our items into. The algorithm will categorize the items into k
groups or clusters of similarity. To calculate that similarity, we will
use the Euclidean distance as a measurement.

The algorithm works as follows:

 First, we randomly initialize k points, called means or cluster


centroids.
 We categorize each item to its closest mean and we update the mean’s
coordinates, which are the averages of the items categorized in that
cluster so far.
 We repeat the process for a given number of iterations and at the end,
we have our clusters.

The “points” mentioned above are called means because they are the
mean values of the items categorized in them.

SBS ©PROPRIETARY Page 23


BCA-5th SEM [Link]

2. K-Medoids clustering: K-Medoids (also called Partitioning Around


Medoid(PAM)) algorithm was proposed in 1987 by Kaufman and
Rousseeuw. A medoid can be defined as a point in the cluster, whose
dissimilarities with all the other points in the cluster are minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by
using E = |Pi – Ci|
 Initialize: select k random points out of the n data points as
the medoids.
 Associate each data point to the closest medoid by using
any common distance metric methods.
 While the cost decreases: For each medoid m, for each data
o point which is not a medoid:
 Swap m and o, associate each data point to the closest
medoid, and recompute the cost.
 If the total cost is more than that in the previous step,
undo the swap.

3. BIRCH Clustering: Clustering algorithms like K-means clustering


do not perform clustering very efficiently and it is difficult to process
large datasets with a limited amount of resources
So, regular clustering algorithms do not scale well in terms of running
time and quality as the size of the dataset increases. This is where
BIRCH clustering comes in. Balanced Iterative Reducing and
Clustering using Hierarchies (BIRCH) is a clustering algorithm that

SBS ©PROPRIETARY Page 24


BCA-5th SEM [Link]

can cluster large datasets by first generating a small and compact


summary of the large dataset that retains as much information as
possible. This smaller summary is then clustered instead of clustering
the larger dataset. BIRCH is often used to complement other
clustering algorithms by creating a summary of the dataset that the
other clustering algorithm can now use. However, BIRCH has one
major drawback – it can only process metric attributes. A metric
attribute is any attribute whose values can be represented in Euclidean
space i.e., no categorical attributes should be present.
Before we implement BIRCH, we must understand two important
terms:
Clustering Feature (CF): BIRCH summarizes large datasets into
smaller, dense regions called Clustering Feature (CF) entries.
Formally, a Clustering Feature entry is defined as an ordered
triple, (N, LS, SS) where ‘N’ is the number of data points in the
cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is the
squared sum of the data points in the cluster. It is possible for a CF
entry to be composed of other CF entries.
CF – Tree: The CF tree is the actual compact representation that we
have been speaking of so far. A CF tree is a tree where each leaf node
contains a sub-cluster. Every entry in a CF tree contains a pointer to a
child node and a CF entry made up of the sum of CF entries in the
child nodes. There is a maximum number of entries in each leaf node.
This maximum number is called the threshold.

SBS ©PROPRIETARY Page 25


BCA-5th SEM [Link]

4. DBSCAN Clustering: Density-Based Spatial Clustering of


Applications With Noise, Clusters are dense regions in the data
space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of
“clusters” and “noise”. The key idea is that for each point of a
cluster, the neighborhood of a given radius must contain at least a
minimum number of points.

Parameters Required for DBSCAN Algorithm

 eps: It defines the neighborhood around a data point i.e. if


the distance between two points is lower or equal to ‘eps’
then they are considered neighbors. If the eps value is
chosen too small, then a large part of the data will be
considered as an outlier. If it is chosen very large, then the
clusters will merge, and most of the data points will be in

SBS ©PROPRIETARY Page 26


BCA-5th SEM [Link]

the same clusters. One way to find the eps value is based on
the k-distance graph.
 MinPts: Minimum number of neighbors (data points)
within eps radius. The larger the dataset, the larger value of
MinPts must be chosen. As a rule, the minimum MinPts
can be derived from the number of dimensions D in the
dataset as, MinPts >= D+1. The minimum value of MinPts
must be chosen at least 3.

Steps Used in DBSCAN Algorithm

 Find all the neighbor points within eps and identify the core
points or visited with more than MinPts neighbors.
 For each core point if it is not already assigned to a cluster,
create a new cluster.
 Recursively find all its density-connected points and assign
them to the same cluster as the core point.
 Points a and b are said to be density connected if there
exists a point c which has enough points in its neighbors
and both points a and b are within the eps distance. This is
a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is
neighbor of an implying that b is a neighbor of a.
 Iterate through the remaining unvisited points in the
dataset. Those points that do not belong to any cluster are
noise.

SBS ©PROPRIETARY Page 27


BCA-5th SEM [Link]

XIII. Data visualization


Data visualization is the graphical representation of information and data in
a pictorial or graphical format(Example: charts, graphs, and maps). Data
visualization tools provide an accessible way to see and understand trends,
patterns in data, and outliers. Data visualization tools and technologies are
essential to analyzing massive amounts of information and making data-
driven decisions. The concept of using pictures is to understand data that has
been used for centuries. General types of data visualization are Charts,
Tables, Graphs, Maps, Dashboards.

Advantages of Data Visualization


1. Better Agreement: In business, for numerous periods, it happens that we
need to look at the exhibitions of two components or two situations. A
conventional methodology is to experience the massive amount of
information of both the circumstances and afterward examine it. This will
clearly take a great deal of time.

SBS ©PROPRIETARY Page 28


BCA-5th SEM [Link]

2. A Superior Method: It can tackle the difficulty of placing the


information of both perspectives into the pictorial structure. This will
unquestionably give a superior comprehension of the circumstances. For
instance, Google patterns assist us with understanding information identified
with top ventures or inquiries in pictorial or graphical structures.
3. Simple Sharing of Data: With the representation of the information,
organizations present another arrangement of correspondence. Rather than
sharing the cumbersome information, sharing the visual data will draw in
and pass on across the data which is more absorbable.
4. Deals Investigation: With the assistance of information representation, a
salesman can, without much of a stretch, comprehend the business chart of
items. With information perception instruments like warmth maps, he will
have the option to comprehend the causes that are pushing the business
numbers up just as the reasons that are debasing the business numbers.
Information representation helps in understanding the patterns and
furthermore, different variables like sorts of clients keen on purchasing,
rehash clients, the impact of topography, and so forth.
5. Discovering Relations Between Occasions: A business is influenced by
a lot of elements. Finding a relationship between these elements or occasions
encourages chiefs to comprehend the issues identified with their business.
For instance, the online business market is anything but another thing today.
Each time during certain happy seasons, like Christmas or Thanksgiving, the
number of online organizations go up. Along these lines, state if an online
organization is doing a normal $1 million business in a specific quarter and
the business ascends straightaway, at that point they can rapidly discover the
occasions compared to it.

SBS ©PROPRIETARY Page 29


BCA-5th SEM [Link]

6. Investigating Openings and Patterns: With the huge loads of


information present, business chiefs can discover the profundity of
information regarding the patterns and openings around them. Utilizing
information representation, the specialists can discover examples of the
conduct of their clients, subsequently preparing for them to investigate
patterns and open doors for business.

Disadvantages of data visualization


1. Can be time-consuming: Creating visualizations can be a time-consuming
process, especially when dealing with large and complex datasets. This can
slow down the machine learning workflow and reduce productivity.
2. Can be misleading: While data visualization can help identify patterns and
relationships in data, it can also be misleading if not done correctly.
Visualizations can create the impression of patterns or trends that may not
actually exist, leading to incorrect conclusions and poor decision-making.
3. Can be difficult to interpret: Some types of visualizations, such as those
that involve 3D or interactive elements, can be difficult to interpret and
understand. This can lead to confusion and misinterpretation of the data.
4. May not be suitable for all types of data: Certain types of data, such as
text or audio data, may not lend themselves well to visualization. In these
cases, alternative methods of analysis may be more appropriate.
5. May not be accessible to all users: Some users may have visual
impairments or other disabilities that make it difficult or impossible for them
to interpret visualizations. In these cases, alternative methods of presenting
data may be necessary to ensure accessibility.

SBS ©PROPRIETARY Page 30


BCA-5th SEM [Link]

Importance of Data Visualization


1. Data Visualization Discovers the Trends in Data
The most important thing that data visualization does is discover the trends
in data. After all, it is much easier to observe data trends when all the data is
laid out in front of you in a visual form as compared to data in a table.
2. Data Visualization Provides a Perspective on the Data
Data Visualization provides a perspective on data by showing its meaning in
the larger scheme of things. It demonstrates how particular data references
stand with respect to the overall data picture.
3. Data Visualization Puts the Data into the Correct Context
It is very difficult to understand the context of the data with data
visualization. Since context provides the whole circumstances of the data, it
is very difficult to grasp by just reading numbers in a table.
4. Data Visualization Saves Time
It is faster to gather some insights from the data using data visualization
rather than just studying a chart.
5. Data Visualization Tells a Data Story
Data visualization is also a medium to tell a data story to the viewers.
Visualization can be used to present the data facts in an easy-to-understand
form while telling a story and leading the viewers to an inevitable
conclusion. This data story, like any other type of story, should have a good
beginning, a basic plot, and an ending that it is leading towards.

SBS ©PROPRIETARY Page 31


BCA-5th SEM [Link]

XIV. Various data visualization tools

Data Visualization Tools are software platforms that provide information in a


visual format such as a graph, chart, etc. to make it easily understandable and
usable. Data Visualization tools are so popular as they allow analysts and
statisticians to create visual data models easily according to their specifications by
conveniently providing an interface, database connections, and Machine Learning
tools all in one place.

The following are the 10 best Data Visualization Tools


 Tableau
 Looker
 Zoho Analytics
 Sisense
 IBM Cognos Analytics
 Qlik Sense
 Domo
 Microsoft Power BI
 Klipfolio
 SAP Analytics Cloud

1. Tableau

Tableau is a data visualization tool that can be used by data analysts, scientists,
statisticians, etc. to visualize the data and get a clear opinion based on the data
analysis. Tableau is very famous as it can take in data and produce the required
data visualization output in a very short time. And it can do this while providing
the highest level of security with a guarantee to handle security issues as soon as
they arise or are found by users.

SBS ©PROPRIETARY Page 32


BCA-5th SEM [Link]

Tableau also allows its users to prepare, clean, and format their data and then
create data visualizations to obtain actionable insights that can be shared with other
users. Tableau is available for individual data analysts or at scale for business
teams and organizations. It provides a 14-day free trial followed by the paid
version.

2. Looker

Looker is a data visualization tool that can go in-depth into the data and analyze it
to obtain useful insights. It provides real-time dashboards of the data for more in-
depth analysis so that businesses can make instant decisions based on the data
visualizations obtained. Looker also provides connections with Redshift,
Snowflake, and BigQuery, as well as more than 50 SQL-supported dialects so you
can connect to multiple databases without any issues.

Looker data visualizations can be shared with anyone using any tool. Also, you can
export these files in any format immediately. It also provides customer support
wherein you can ask any question and it shall be answered. A price quote can be
obtained by submitting a form.

3. Microsoft Power BI

Microsoft Power BI is a Data Visualization platform focused on creating a data-


driven business intelligence culture in all companies today. To fulfill this, it offers
self-service analytics tools that can be used to analyze, aggregate, and share data in
a meaningful fashion.

Microsoft Power BI offers hundreds of data visualizations to its customers along


with built-in Artificial Intelligence capabilities and Excel integration facilities. And
all this is very pocket friendly at a $9.99 monthly price per user for the Microsoft

SBS ©PROPRIETARY Page 33


BCA-5th SEM [Link]

Power BI Pro. It also provides you with multiple support systems such as FAQs,
forums, and live chat support with the staff.

SBS ©PROPRIETARY Page 34

You might also like