DWM Compiled Notes
DWM Compiled Notes
in
CLASS:BCA/ B . S C ( I T ) 5 t h
Program: BCA L: 3 T: 1 P: 0
Branch: Computer Applications Credits: 4
Semester: 5th Contact hours: 44 hours
Theory/Practical: Theory Percentage of numerical/design problems: 20%
Internal max. marks: 40 Duration of end semester exam (ESE): 3hrs
External max. marks: 60 Elective status: Elective
Total marks: 100
Course
Prerequisite: -NA-
Co requisite: -NA-
Additional material required in ESE: -NA-
Unit-II
Page 1 of 122
INDEX
Unit-III
Unit-IV
Page 2 of 122
BCA/[Link](IT) [Link]
Sem
UNIT-I
BCA/[Link](IT) [Link]
Sem
Data Warehousing:
Data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an
organization.
There many types of data warehouses but these are the three most common:
A data warehouse helps executives to organize, understand, and use their data to
take strategic decisions.
2. Saves times
A data warehouse standardizes, preserves, and stores data from distinct sources,
aiding the consolidation and integration of all the data. Since critical data is
available to all users, it allows them to make informed decisions on key aspects. In
BCA/[Link](IT) [Link]
Sem
addition, executives can query the data themselves with little to no IT support,
saving more time and money.
A data warehouse converts data from multiple sources into a consistent format.
Since the data from across the organization is standardized, each department will
produce results that are consistent. This will lead to more accurate data, which will
become the basis for solid decisions.
Companies experience higher revenues and cost savings than those that haven’t
invested in a data warehouse.
Data warehouses help get a holistic view of their current standing and evaluate
opportunities and risks, thus providing companies with a competitive advantage.
Data professionals can analyze business data to make market forecasts, identify
potential KPIs, and gauge predicated results, allowing key personnel to plan
accordingly.
Data warehouses are relational databases that act as data analysis tools,
aggregating data from multiple departments of a business into one data store.
Data warehouses are typically updated as an end-of-day batch job, rather than
being churned by real time transactional data. Their primary benefits are giving
BCA/[Link](IT) [Link]
Sem
managers better and timelier data to make strategic decisions for the company.
However, they have some drawbacks as well.
Depending on the size of the organization, a data warehouse runs the risk of
extra work on departments. Each type of data that's needed in the warehouse
typically has to be generated by the IT teams in each division of the business.
This can be as simple as duplicating data from an existing database, but at other
times, it involves gathering data from customers or employees that wasn't
gathered before.
Cost/Benefit Ratio
Data Flexibility
Data warehouses tend to have static data sets with minimal ability to "drill
down" to specific solutions. The data is imported and filtered through a schema,
and it is often days or weeks old by the time it's actually used. In addition, data
warehouses are usually subject to ad hoc queries and are thus notoriously
difficult to tune for processing speed and query speed. While the queries are
often ad hoc, the queries are limited by what data relations were set when the
aggregation was assembled.
BCA/[Link](IT) [Link]
Sem
Difference between data warehouse and data mart
Data type The data stored inside the Data Data Marts are built for particular
Warehouse are always detailed user groups. Therefore, data short
BCA 5th Sem [Link]
The source systems are fully optimized in order to process many small
transactions, such as orders, in a short time. Generating information about the
performance of the organization only requires a few large ‘transactions’ in which
large volumes of data are gathered and aggregated. The structure of a data
warehouse is specifically designed to quickly analyze such large volumes of (big)
data.
The structure of both data warehouses and data marts enables end users to report
in a flexible manner and to quickly perform interactive analysis based on various
predefined angles (dimensions). They may, for example, with a single mouse click
jump from year level, to quarter, to month level, and quickly switch between the
customer dimension and the product dimension, all while the indicator remains
fixed. In this way, end users can actually mix the data and thus quickly gain
knowledge about business operations and performance indicators.
Source systems don’t usually keep a history of certain data. For example, if a
customer relocates or a product moves to a different product group, the (old)
values will most likely be overwritten. This means they disappear from the system
– or at least they’re very difficult to trace back.
Stakeholders and users frequently overestimate the quality of data in the source
systems. Unfortunately, source systems quite often contain data of poor quality.
When we use a data warehouse, we can greatly improve the data quality, either
through – where possible – correcting the data while loading or by tackling the
problem at its source.
A data warehouse and Business Intelligence tools allow employees within the
organization to create reports and perform analyses independently. However, an
organization will first have to invest in order to set up the required infrastructure
for that data warehouse and those BI tools. The following principle applies: the
better the architecture is set up and developed, the more complex reports users can
independently create. Obviously, users first need sufficient training and support,
where necessary. Yet, what we see in practice is that many of the more complex
reports end up being created by the IT department. This is mostly due to users
lacking either the time or the knowledge.
7. Increasing findability
When we create a data warehouse, we make sure that users can easily access the
meaning of data. (In the source system, these meanings are either non-existent or
poorly accessible.) With a data warehouse, users can find data more quickly, and
thus establish information and knowledge faster. All the goals of the data
BCA 5th Sem [Link]
warehouse serve the aims of Business Intelligence: making better decisions faster
at all levels within the organization and even across organizational boundaries.
Non-volatile − Non-volatile means the previous data is not erased when new data
is added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
BCA 5th Sem [Link]
Data warehouse can be controlled when the user has a shared way of explaining
the trends that are introduced as specific subject. Below are
major characteristics of data warehouse:
Subject-oriented –
A data warehouse is always a subject oriented as it delivers information about a
theme instead of organization’s current operations. It can be achieved on specific
theme. That means the data warehousing process is proposed to handle with a
specific theme which is more defined. These themes can be sales, distributions,
marketing etc.
A data warehouse never put emphasis only current operations. Instead, it focuses
on demonstrating and analysis of data to make various decision. It also delivers an
easy and precise demonstration around particular theme by eliminating data which
is not required to make the decisions.
Integrated –
It is somewhere same as subject orientation which is made in a reliable format.
Integration means founding a shared entity to scale the all similar data from the
different databases. The data also required to be resided into various data
warehouse in shared and generally granted manner.
A data warehouse is built by integrating data from various sources of data such
that a mainframe and a relational database. In addition, it must have reliable
naming conventions, format and codes. Integration of data warehouse benefits in
effective analysis of data. Reliability in naming conventions, column scaling,
encoding structure etc. should be confirmed. Integration of data warehouse
handles various subject related warehouse.
Time-Variant –
In this data is maintained via different intervals of time such as weekly, monthly,
or annually etc. It founds various time limit which are structured between the large
datasets and are held in online transaction process (OLTP). The time limits for
BCA 5th Sem [Link]
data warehouse is wide-ranged than that of operational systems. The data resided
in data warehouse is predictable with a specific interval of time and delivers
information from the historical perspective. It comprises elements of time
explicitly or implicitly. Another feature of time-variance is that once data is stored
in the data warehouse then it cannot be modified, alter, or updated.
Non-Volatile –
As the name defines the data resided in data warehouse is permanent. It also
means that data is not erased or deleted when new data is inserted. It includes the
mammoth quantity of data that is inserted into modification between the selected
quantity on logical business. It evaluates the analysis within the technologies of
warehouse.
Data Loading
Data Access
Staffing – Without cost justification, staffing with the right people may be
difficult. By having some real dollar numbers to back up a request, the request is
more likely to be satisfied.
Controlling Costs
Costs can and must be controlled. It is the project manager who has the
responsibility for controlling costs along with the other responsibilities. Adhering
to the Project Agreement is a major start for controlling costs. The Project
Agreement specifies the data that will be in the data warehouse, the periods for
which the data is kept, the number of users and predefined queries and reports.
Any one of these factors, if not held in check, will increase the cost and possibly
the schedule of the project. A primary role of the project manager will be to
control scope creep.
Additional Support
User Support staff or the Help Desk staff will be the users’ primary contact when
there are problems. Providing adequate User Support will require more people,
and more training of those people, to answer questions and help the users through
difficult situations. The cost for the additional people, the training and possibly an
upgrade in the number and knowledge-level of the staff answering the phones
must be added into the data warehouse costs.
Consultant and contractor expenses can balloon a project’s cost. Consultants are
used to supplement the lack of experience of the project team, contractors are used
to supplement the lack of skilled personnel. There are two types of
consultant/contractors:
BCA 5th Sem [Link]
Product specific contractors – These persons are brought in because they know
the product. They can either help or actually install the product, and they can tune
the product. They will customize the product, if it is necessary. The product-
specific consultants may either be in the employ of the tool vendor or may be
independent. An example of their services would be installing and using an ETL
tool to extract, transform and load data from your source files to the data
warehouse. In this activity they may be generating the ETL code on their own or
working with your people in this endeavor.
Products
The software products that support the data warehouse can be very expensive. The
first thing to consider is which categories of tools you need. Do not bring in more
categories of products than you need. Do not try to accomplish everything with
your first implementation. Be very selective.
Existing tools
Your organization most likely already has an RDBMS. Should you have to pay for
it as part of your data warehouse project? If there is a site license, there may be no
charge to your department or you may have to pay a portion of the site license.
BCA 5th Sem [Link]
You may have to pay if the data warehouse will be on another CPU, and if the
RDBMS is charged by CPU. You may have to pay an upgrade if the data
warehouse requires going to a larger CPU, and if there is an additional cost for the
larger CPU.
Capacity planning
The actual amount of data that will be in the warehouse is very difficult to
anticipate.
The time of day and the day in the week when the queries will be run is difficult to
guess (we know there will not be an even distribution, expecting more activity at
month-end, etc.).
The nature of the queries, the number of I/Os, the internal processing is almost
impossible to estimate.
Hardware Costs
For the data warehouse, you will need CPUs, disks, networks and desktop
workstations. The hardware vendors can help size the machines and disks. Be
aware that unanticipated growth of the data, increased number of users and
increased usage will explode the hardware costs. Existing desktop workstations
may not be able to support the query tool. Do not ask the query tool vendor for the
minimum desktop configuration. Ask for the recommended configuration. Call
references to find out if and how they had to upgrade their desktop workstations.
There are many debates over how much disk is needed as a multiplier of the raw
data. Besides the raw data itself, space is needed for indexes, summary tables and
working space. Additional space may be needed for replicated data that may be
required for both performance and security reasons. The actual space is very
dependent on how much is indexed and how many summary tables are needed.
BCA 5th Sem [Link]
Existing Hardware
How should you account for existing hardware that can be used for the data
warehouse? It may mean you do not have to buy any additional hardware. The
Y2K testing may have required hardware that is now redundant and unused.
Should that be included in our data warehouse cost? It is a safe assumption that
your organization will need additional hardware in the future. By using the
redundant hardware for the data warehouse, it means that additional hardware for
non-data warehouse purposes must be purchased sooner. You may be able to defer
the cost of the redundant hardware; you will eventually have to pay. At the time
the hardware is purchased, it will undoubtedly be less than today’s costs.
Your ability to control hardware costs will depend primarily on whether your
organization has a chargeback system. Even though department heads are
supposed to have the best interests of the organization at heart, what they care
most about is meeting their performance objectives. These, of course, include the
costs assigned to their department. If department heads are paying for what they
get, they will be more thoughtful about asking for resources that may not be cost
justified. We had an experience with a user asking to store ten years worth of
detailed data. When he was presented with the bill (an additional $1.5 million), he
decided that two years worth of data was adequate.
These people are getting paid anyway regardless of whether we use them on this
project or not. Why should we have to include their costs in our budget? We have
to assume these people would be working on other productive projects. Otherwise,
there is no reason for the organization to keep them employed. Count on having to
include the fully burdened costs of the people on your project. Keep in mind that
you are much better off with a small team of highly skilled and dedicated workers
than with a larger team of the type of people to avoid for your project.
User Training
User training is usually done on the premises and not at a vendor site. There are
four cost areas for user training that must be considered.
The cost to engage a trainer from the outside or the time it takes for your in-house
trainer to develop and teach the class.
BCA 5th Sem [Link]
The time the users spend away from the job being in class, and the time it takes
them to become proficient with the tool.
If not all the users are in the same location, travel expenses for either the users or
the trainer must be included.
IT Training
On-Going Costs
Most organizations focus on the cost to implement the initial data warehouse
application and give little thought to on-going expense. Over a period of years, the
continuing cost will very likely exceed the cost of the initial application. The data
warehouse will grow in size, in the number of users and in the number of queries
and reports. The database will not remain static. New data will be added,
sometimes more than for the initial implementation and the design most probably
will change, and the database will need to be tuned. New software will be
introduced, new releases will be installed and some interfaces will have to be
rewritten. As the data warehouse grows, the hardware and network will have to be
upgraded.
The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the business.
The data frequently changes as updates are made and reflect the current value of
the last transactions.
BCA 5th Sem [Link]
Data Warehouse Systems serve users or knowledge workers in the purpose of data
analysis and decision-making. Such systems can organize and present information
in specific formats to accommodate the diverse needs of various users. These
systems are called as Online-Analytical Processing (OLAP) Systems
The up-to-date view into operational status also makes it easier for users to
diagnose problems before digging into component systems. For example, an
ODS enables service representatives to immediately find a customer order,
its status, and any troubleshooting information that might be helpful.
There are some functions that go on within the enterprise that have to do
with planning,forcasting and managing the organization. These functions
are also critical to the survival of the organization, especially in our current
fast paced world.
BCA 5th Sem [Link]
Where operational data needs are normally focused upon a single area,
informational data needs often span a number of different areas and need
large amounts of related operational data.
OPERATIONAL INFORMATIONAL
Budgeting
Activity-based costing
Promotion analysis
Customer analysis
Production
Production planning
Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a
data model more intuitive to them than a tabular model. This model is called a
Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to
achieve using tabular models.
Characteristics of OLAP
BCA 5th Sem [Link]
In the FASMI characteristics of OLAP methods, the term derived from the first
letters of the characteristics are:
Fast
It defines which the system targeted to deliver the most feedback to the client
within about five seconds, with the elementary analysis taking no more than one
second and very few taking more than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical
analysis that is relevant for the function and the user, keep it easy enough for the
target client. Although some preprogramming may be needed we do not think it
acceptable if all application definitions have to be allow the user to define new
Adhoc calculations as part of the analysis and to document on the data in any
desired method, without having to program so we excludes products (like Oracle
Discoverer) that do not allow the user to define new Adhoc calculation as part of
the analysis and to document on the data in any desired product that do not allow
adequate end user-oriented calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding
and, if multiple write connection is needed, concurrent update location at an
appropriated level, not all functions need customer to write data back, but for the
increasing number which does, the system should be able to manage multiple
updates in a timely, secure manner.
BCA 5th Sem [Link]
Multidimensional
Information
The system should be able to hold all the data needed by the applications. Data
sparsity should be handled in an efficient manner.
Multi-User Support: Since the OLAP techniques are shared, the OLAP operation
should provide normal database operations, containing retrieval, update, adequacy
control, integrity, and security.
Storing OLAP results: OLAP results are kept separate from data sources.
OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
OLAP system should ignore all missing values and compute correct aggregate
values.
OLAP facilitate interactive query and complex analysis for the users.
OLAP allows users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimension.
Benefits of OLAP
OLAP holds several benefits for businesses: -
Types of OLAP
There are three main types of OLAP servers are as following:
HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.
ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
ROLAP systems work primarily from the data that resides in a relational database,
where the base data and dimension tables are stored as relational tables. This
model permits the multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to
give the presence of traditional OLAP's slicing and dicing functionality. In
essence, each method of slicing and dicing is equivalent to adding a "WHERE"
clause in the SQL statement.
Database server.
ROLAP server.
Front-end tool.
BCA 5th Sem [Link]
Some products in this segment have supported reliable SQL engines to help the
complexity of multidimensional analysis. This includes creating multiple SQL
statements to handle user requests, being 'RDBMS' aware and also being capable
of generating the SQL statements based on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP
technology is depends on the data size of the underlying RDBMS. So, ROLAP
itself does not restrict the data amount.
Disadvantages
One of the significant distinctions of MOLAP against a ROLAP is that data are
summarized and are stored in an optimized format in a multidimensional cube,
instead of in a relational database. In MOLAP model, data are structured into
proprietary formats by client's reporting requirements with the calculations pre-
generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
Database server.
MOLAP server.
Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has
limited capabilities to dynamically create aggregations or to evaluate results which
have not been pre-calculated and stored.
Advantages
Excellent Performance: A MOLAP cube is built for fast information
retrieval, and is optimal for slicing and dicing operations.
Disadvantages
Limited in the amount of information it can handle: Because all
calculations are performed when the cube is built, it is not possible to
contain a large amount of data in the cube itself.
HOLAP incorporates the best features of MOLAP and ROLAP into a single
architecture. HOLAP systems save more substantial quantities of detailed data in
the relational tables while the aggregations are stored in the pre-calculated cubes.
HOLAP also can drill through from the cube down to the relational tables for
delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.
BCA 5th Sem [Link]
Advantages of HOLAP
HOLAP provide benefits of both MOLAP and ROLAP.
HOLAP balances the disk space requirement, as it only stores the aggregate
information on the OLAP server and the detail record remains in the
relational database. So no duplicate copy of the detail record is maintained.
Disadvantages of HOLAP
HOLAP architecture is very complicated because it supports both MOLAP and
ROLAP servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble
upon every so often. We have listed some of the less popular brands existing in the
OLAP industry.
WOLAP pertains to OLAP application which is accessible via the web browser.
Unlike traditional client/server OLAP applications, WOLAP is considered to have
a three-tiered architecture which consists of three components: a client, a
middleware, and a database server.
DOLAP permits a user to download a section of the data from the database or
source, and work with that dataset locally, or on their desktop.
Mobile OLAP enables users to access and work on OLAP data and applications
remotely through the use of their mobile devices.
ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical
Analytical Processing. Analytical Processing. Processing.
The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations connects attributes of both
aggregation of the of the division and a copy MOLAP and ROLAP.
division to be stored in of its source information to Like MOLAP, HOLAP
indexed views in the be saved in a causes the aggregation of
relational database that multidimensional the division to be stored in
was specified in the operation in analysis a multidimensional
partition's data source. services when the operation in an SQL
separation is processed. Server analysis services
instance.
BCA 5th Sem [Link]
ROLAP does not This MOLAP operation is HOLAP does not causes a
because a copy of the highly optimize to copy of the source
source information to be maximize query information to be stored.
stored in the Analysis performance. The storage For queries that access the
services data folders. area can be on the only summary record in
Instead, when the computer where the the aggregations of a
outcome cannot be partition is described or on division, HOLAP is the
derived from the query another computer running equivalent of MOLAP.
cache, the indexed views Analysis services. Because
in the record source are a copy of the source
accessed to answer information resides in the
queries. multidimensional
operation, queries can be
resolved without accessing
the partition's source
record.
Query response is Query response times can Queries that access source
frequently slower with be reduced substantially by record for example, if we
ROLAP storage than using aggregations. The want to drill down to an
with the MOLAP or record in the partition's atomic cube cell for which
HOLAP storage mode. MOLAP operation is only there is no aggregation
Processing time is also as current as of the most information must retrieve
frequently slower with recent processing of the data from the relational
ROLAP. separation. database and will not be as
fast as they would be if the
source information were
stored in the MOLAP
architecture.
BCA 5th Sem [Link]
OLTP Systems are user friendly and can be used by anyone having basic
understanding
It allows its user to perform operations like read, write and delete data
quickly.
Characteristics of OLTP
Following are important characteristics of OLTP:
BASIS FOR
OLTP OLAP
COMPARISON
manages database
modification.
database. making.
BASIS FOR
OLTP OLAP
COMPARISON
transactions.
OLTP.
UNIT- II
41
BCA 5th Sem [Link]
42
BCA 5th Sem [Link]
For the warehouse there is an acquisition of the data. There must be a use of
multiple and heterogeneous sources for the data extraction, example databases.
There is a need for the consistency for which formation of data must be done
within the warehouse. Reconciliation of names, meanings and domains of data
must be done from unrelated sources. There is also a need for the installation of
the data from various sources in the data model of the warehouse.
To provide the time variant data
To store the data as per the data model of the warehouse
Purging the data
To support the updating of the warehouse data
Design considerations
Recognize and analyze the design problem: Designs must perform well under
expected and worst-case conditions. The designer should consider this before
sitting down at the drawing board or CAD terminal. Considerations include: Is it
more economical to build an irregular shape from welded pieces or to cut it from a
plate, with the accompanying waste? Can bending replace a welded joint? Are
preformed sections available? How, when, and how much should the structure be
welded? Can weight be reduced cost-effectively by using welded joints? Will
fewer parts offer equal or better performance
Optimize layout: When drawing the preliminary design, engineers should plan
layout to reduce waste when the pieces are cut from plate. Preformed beams,
channels, and tubes also may reduce costs without sacrificing quality.
43
BCA 5th Sem [Link]
Consider using standard sections and forms: Preformed sections and forms
should be used whenever possible. Specifying standard sections for welding is
usually cheaper than welding many individual parts. In particular, specifying bent
components is preferable to making welded corners.
Select weld-joint design: There are five basic types of joints: butt joints, corner
joints, T-joints, lap joints, and edge joints. In addition, the American Welding
Society recognizes about 80 different types of welding and joining processes.
Each process has its own characteristics and capabilities, so joint design must be
suitable for the desired welding process. In addition, the joint design will affect
access to the weld.
Restrain size and number of welds: Welds should match, not exceed, the
strength of the base metal for full joint efficiency. Over welding is unnecessary,
increases costs, and reduces strength.
Implementation Considerations
I. Access Tools
Currently no single tool in the market can handle all possible data warehouse
access needs. Therefore, most implementations rely on a suite of tools.
As Data Warehouse grows, there are at least two options for Data
Placement. One is to put some of the data in the data warehouse into another
storage media (WORM, RAID). Second option is to distribute data in data
warehouse across multiple servers.
44
BCA 5th Sem [Link]
1. The ability to identify data in the data source environments that can be
read by conversion tool is important.
2. Support for the flat files. (VSAM, ISM, IDMS) is critical, since bulk of
the corporate data is still maintained in this type of data storage.
iv. Metadata
45
BCA 5th Sem [Link]
the disk storage requirements for a data warehouse will be significantly large,
especially in comparison with single application.
Thus, hardware with large data storage capacity is essential for data warehousing.
For every data size identified, the disk space provided should be two to three times
that of the data to accommodate processing, indexing etc.
4. Software tools for building, operating, and using data warehouse:- All the
data warehouse vendors are not currently providing comprehensive single window
software tools capable of handling all aspects of a data warehousing project
implementation.
46
BCA 5th Sem [Link]
Data Pre-processing
Today’s real-world databases are highly subject to noisy, missing and inconsistent
data due to their typically huge size, and their likely origin from
multiple,heterogenous source. Incomplete data can occur for a number of reasons.
Attributes of interest may not always be available , such as customer information
for sales transaction data. Other data may not be included simply because it was
not considered important at the time of entry. Some of the major reason for noisy
data is:
Data Summarization
Why we need more summarization of data in the mining process, we are living in
a digital world where data transfers in a second and it is much faster than a human
capability. In the corporate field, employees work on a huge volume of data which
is derived from different sources like Social Network, Media, Newspaper, Book,
cloud media storage etc. But sometimes it may create difficulties for you to
summarize the data. Sometimes you do not expect data volume because when
you retrieve data from relational sources you cannot predict that how much data
will be stored in the database.
As a result, data becomes more complex and takes time to summarize information.
Let me tell you the solution to this problem. Always retrieve data in the form of
category what type of data you want in the data or we can say use filtration when
you retrieve data. Although, “Data Summarization” technique gives the good
amount of quality to summarize the data. Moreover, a customer or user can take
benefits in their research.
Data Cleaning
Data Cleaning is a process of cleaning raw data by handling irrelevant and missing
tuples. While working our machine learning projects, the data sets which we take
might not be perfect they might have many impurities, Noisy values and a majority
of times the actual data might be missing. the major problems we will be facing
during data cleaning are:
1. Missing Values: If it is noted that there are many tuples that have no
48
BCA 5th Sem [Link]
recorded value for several attributes, then the missing values can be filled in
for the attribute by various methods described below:
Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This
method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like \Unknown",
or -∞. If missing values are replaced by, say, \Unknown", then the
mining program may mistakenly think that they form an interesting
concept, since they all have a value in common | that of \Unknown".
Hence, although this method is simple, it is not recommended.
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class as
the given tuple.
Use the most probable value to fill in the missing value: This may be
determined with inference-based tools using a Bayesian formalism or
decision tree induction.
Methods 3 to 6 bias the data. The filled-in value may not be correct. Method
6, however, is a popular strategy. In comparison to the other methods, it uses
the most information from the present data to predict missing values.
49
BCA 5th Sem [Link]
This is much faster than having to manually search through the entire
database. The garbage patterns can then be removed from the (training)
database.
4. Regression: Data can be smoothed by fitting the data to a function,
such as with regression. Linear regression involves finding the \best"
line to fit two variables, so that one variable can be used to predict the
other. Multiple linear regression is an extension of linear regression,
where more than two variables are involved and the data are fit to a
multidimensional surface. Using regression to find a mathematical
equation to fit the data helps smooth out the noise.
Data Transformation
50
BCA 5th Sem [Link]
Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts. Consider a concept hierarchy for
the dimension location. City values for location include Vancouver, Toronto, New
York, and Chicago. Each city, however, can be mapped to the province or state to
which it belongs. For example, Vancouver can be mapped to British Columbia,
and Chicago to Illinois. The provinces and states can in turn be mapped to the
country (e.g., Canada or the United States) to which they belong. These mappings
form a concept hierarchy for the dimension location, mapping a set of low-level
concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
51
BCA 5th Sem [Link]
Models are a cornerstone of design. Engineers build a model of a car to work out
any details before putting it into production. In the same manner, system designers
develop models to explore ideas and improve the understanding of the database
design.
A data model is a graphical view of data created for analysis and design purposes.
Data modeling designing data warehouse databases in detail. It can be defined as
an integrated collection of concept that can be used to describe the structure of the
database including data types, relationships between data and constraints that
should apply on the data.
2) Manipulative part:- It defines the types of operations that are allowed on the
data. This includes the operations that are used for updating or retrieving data
from the database and for changing the structure of the students.
52
BCA 5th Sem [Link]
56
BCA 5th Sem [Link]
Logical measures:- With logical measures ,cells of the logical cube are filled with
facts collected about an organization’s operations or functions. The measures are
organized according to the dimensions, which typically include a time dimension.
Logical dimensions:- Dimensions contain a set of unique values that identify and
categorize data. Dimensions represent the different views for an entity that an
organization is interested in. For example, a store will create a sales data
warehouse in order to keep track of the store sales with respect to different
dimensions such as time, branch and location
Data Cube
For example, a relation with the schema sales (part, supplier, customer, and sale-
price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
57
BCA 5th Sem [Link]
Example: In the 2-D representation, we will look at the All Electronics sales
data for items sold per quarter in the city of Vancouver. The measured display in
dollars sold (in thousands)
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For
example, suppose we would like to view the data according to time, item as well as
the location for the cities Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands). These 3-D data are shown in the
table. The 3-D data of the table are represented as a series of 2-D tables.
58
BCA 5th Sem [Link]
Following are 3 chief types of multidimensional schemas each having its unique
advantages.
Star Schema
Snowflake Schema
Galaxy Schema
1. Star Schema
In the STAR Schema, the center of the star can have one fact table and a number
of associated dimension tables. It is known as star schema as its structure
resembles a star. The star schema is the simplest type of Data Warehouse schema.
It is also known as Star Join Schema and is optimized for querying large data sets.
In the following example,the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID &
other attributes like Units sold and revenue.
59
BCA 5th Sem [Link]
2. Snowflake Schema
60
BCA 5th Sem [Link]
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake Schema
is that you need to perform more maintenance efforts because of the more
lookup tables.
61
BCA 5th Sem [Link]
A GALAXY SCHEMA contains two fact table that share dimension tables
between them. It is also called Fact Constellation Schema. The schema is viewed
as a collection of stars hence the name Galaxy Schema.
62
BCA 5th Sem [Link]
As you can see in above example, there are two facts table
1. Revenue
2. Product.
The dimensions in this schema are separated into separate dimensions based
on the various levels of hierarchy.
For example, if geography has four levels of hierarchy like region, country,
state, and city then Galaxy schema should have four dimensions.
Moreover, it is possible to build this type of schema by splitting the one-star
schema into more Star schemes.
The dimensions are large in this schema which is needed to build based on
the levels of hierarchy.
This schema is helpful for aggregating fact tables for better understanding.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting, and
trend analysis.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
63
BCA 5th Sem [Link]
Operational System
Flat Files
A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author,
data build, and data changed, and file size are examples of very basic document
metadata.
64
BCA 5th Sem [Link]
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
We must clean and process your operational information before put it into the
warehouse.
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
65
BCA 5th Sem [Link]
We may want to customize our warehouse's architecture for multiple groups within
our organization.
The figure illustrates an example where purchasing, sales, and stocks are separated.
In this example, a financial analyst wants to analyze historical data for purchases
and sales or mine historical information to make predictions about customer
behavior.
66
BCA 5th Sem [Link]
A data warehouse is a single data repository where a record from multiple data
sources is integrated for online business analytical processing (OLAP). This
implies a data warehouse needs to meet the requirements from all the business
stages within the entire organization. Thus, data warehouse design is a hugely
complex, lengthy, and hence error-prone process. Furthermore, business analytical
functions change over time, which results in changes in the requirements for the
systems. Therefore, data warehouse and OLAP systems are dynamic, and the
design process is continuous.
1. "top-down" approach
2. "bottom-up" approach
67
BCA 5th Sem [Link]
The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time
and effort than developing an enterprise-wide data warehouse. Also, the risk of
failure is even less. This method is inherently incremental. This method allows the
project team to learn and grow.
68
BCA 5th Sem [Link]
The locations of the data warehouse and the data marts are reversed in the
bottom-up approach design.
Breaks the vast problem into Solves the essential low-level problem and
smaller subproblems. integrates them into a higher one.
69
BCA 5th Sem [Link]
70
BCA 5th Sem [Link]
From the architecture point of view there are three data warehouse models
1. Enterprise warehouse
2. data marts
3. Virtual warehouse
Data Marts:- A data mart consists of a subset of corporate wide data that is of
value to specific group of users. The scope is restricted to specific selected
subjects. The data contained in a data mart tend to be summarized.
OLAP
For example, a user can request that data be analyzed to display a spreadsheet
showing all of a company's beach ball products sold in Florida in the month of
71
BCA 5th Sem [Link]
July, compare revenue figures with those for the same products in September and
then see a comparison of other product sales in Florida in the same time period.
To facilitate this kind of analysis, data is collected from multiple data sources and
stored in data warehouses then cleansed and organized into data cubes.
Each OLAP cube contains data categorized by dimensions (such as customers,
geographic sales region and time period) derived by dimensional tables in the data
warehouses. Dimensions are then populated by members (such as customer names,
countries and months) that are organized hierarchically. OLAP cubes are often pre-
summarized across dimensions to drastically improve query time over relational
databases.
1. Roll-up
72
BCA 5th Sem [Link]
2. Drill-down
3. Slice and dice
4. Pivot (rotate)
1) Roll-up:
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping
things based on their order or level.
In this example, cities New jersey and Lost Angles and rolled up into
country USA
The sales figure of New Jersey and Los Angeles are 440 and 1560
respectively. They become 2000 after roll-up
In this aggregation process, data is location hierarchy moves up from city to
the country.
In the roll-up process at least one or more dimensions need to be removed.
In this example, Quater dimension is removed.
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via
3) Slice:
74
BCA 5th Sem [Link]
Dice:
This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.
75
BCA 5th Sem [Link]
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
76
BCA 5th Sem [Link]
77
BCA 5th Sem [Link]
ROLAP
ROLAP works with data that exist in a relational database. Facts and dimension
tables are stored as relational tables. It also allows multidimensional analysis of
data and is the fastest growing OLAP.
MOLAP
Hybrid OLAP
Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation
of MOLAP and higher scalability of ROLAP. HOLAP uses two databases.
This kind of OLAP helps to economize the disk space, and it also remains
compact which helps to avoid issues related to access speed and
convenience.
Hybrid HOLAP's uses cube technology which allows faster performance for
all types of data.
ROLAP are instantly updated and HOLAP users have access to this real-
time instantly updated data. MOLAP brings cleaning and conversion of data
thereby improving data relevance. This brings best of both worlds.
Advantages of OLAP
79
BCA 5th Sem [Link]
Disadvantages of OLAP
OLAP software then locates the intersection of dimensions, such as all products
sold in the Eastern region above a certain price during a certain time period, and
displays them. The result is the "measure"; each OLAP cube has at least one to
perhaps hundreds of measures, which are derived from information stored in fact
tables in the data warehouse.
OLAP (online analytical processing) systems typically fall into one of three types:
80
BCA 5th Sem [Link]
To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views. Two indexing techniques that are popular for
olap data are:
Bitmap Indexing
Join Indexing
1) Bitmap Indexing
A bitmap index is a very efficient method for storing sparse data columns.
Sparse data columns are one which contain data values from a very small set
of possibilities .
In the bitmap index for a given attribute, there is a distinct bit vector ,for
each value V in the domain of the attribute.
If the domain for the attribute consists of n values, then n bits are needed for
each entry in the bitmap index.
81
BCA 5th Sem [Link]
The length of the bit vector is equal to the number of record in the base
table.
2) Join Indexing
Join indexing method gained popularity from its use in relational database
query processing.
Consider 2 relations R(RID ,A) and S(B,SID) that join on attribute A and B.
Then the join index contains the pair(RID,SID) where RID and SID are
record identifiers from the R and S relations.
Querying in OLAP
OLAP is a database technology that has been optimized for querying and reporting,
instead of processing transactions. The source data for OLAP is online
transactional processing(OLTP) databases that are commonly stored in data
warehouses.
82
BCA 5th Sem [Link]
Online analytical mining integrates online analytical processing (OLAP) and data
mining. It represent a promising direction for mining large databases and data
warehouses.
Importance of OLAM
OLAM is important for the following reasons −
High quality of data in data warehouses − The data mining tools are
required to work on integrated, consistent, and cleaned data. These steps are
very costly in the preprocessing of data. The data warehouses constructed
by such preprocessing are valuable sources of high quality data for OLAP
and data mining as well.
Available information processing infrastructure surrounding data
warehouses − Information processing infrastructure refers to accessing,
integration, consolidation, and transformation of multiple heterogeneous
databases, web-accessing and service facilities, reporting and OLAP
analysis tools.
OLAP−based exploratory data analysis − Exploratory data analysis is
required for effective data mining. OLAM provides facility for data mining
on various subset of data and at different levels of abstraction.
Online selection of data mining functions − Integrating OLAP with
multiple data mining functions and online analytical mining provide users
83
BCA 5th Sem [Link]
with the flexibility to select desired data mining functions and swap data
mining tasks dynamically.
OLAM Architecture
An OLAM engine can perform multiple data mining tasks, such as concept
description ,association ,classification, prediction, clustering and time series
analysis. Therefore, it usually consists of multiple, integrated data mining modules,
making it more sophisticated than an OLAP engine. There is no fundamental
difference between the data cube required for OLAP, although OLAM analysis
might require more powerful data cube construction and accessing tools.
84
BCA 5th Sem [Link]
time and enhance the performance of online analytical processing. However, such
computation is challenging since it may require large computational time and
storage space. This section explores efficient methods for data cube computation.
1. Partition the array into chunks:- A chunk is a sub cube that is small
enough to fit into the memory available for cube computation. chunking is a
method for dividing an N-dimensional array into small N- dimensional
chunks, where each chunk is stored as an object on disk. The chunks are
compressed so as to remove wasted space resulting from empty array cells.
2. Compute aggregates by visiting cube cells:- The order in which cells are
visited can be optimized so as to minimize the number of times that each cell
must be revisited, thereby reducing memory access and storage costs.
BUC stands for bottom –up construction is an algorithm for the computation
of sparse and iceberg cubes.
85
BCA 5th Sem [Link]
Unlike multiway ,BUC constructs the cube from the apex cuboid towards
the base cuboid. This allows BUC to share data partitioning costs.
This representation of a lattice of cuboids, with the apex at the top and the
base at the bottom, is commonly accepted in data warehousing. It
consolidated the notions of drill down and roll up.
Star cubing: computing iceberg cubes using a dynamic star tree structure
Star cubing integrates top down and bottom up cube computation and
explores both multidimensional aggregations.
It operates from a data structure called a star tree, which performs lossless
data compression, thereby reducing the computation time and memory
requirements.
A key idea behind star cubing is the concept of shared dimensions . To build
up to this notion.
The order of computation is from the base cuboid, upwards towards the apex
cuboid . This order of computation is similar to that to Multiway.
86
BCA 5th Sem [Link]
A data cube may have a large number of cuboids and each cuboids, and each
cuboid may contain in large number of cells. With such an extremely large space,
it becomes a burden for users just browse a cube. Tools need to be developed to
assist users in intelligently exploring the huge aggregated space of a data cube.
This approach considers variations and patterns in the measures value across
all of the dimensions to which a cell belongs.
Visual cues or signs such as background color are used to reflect the degree
of exception of each cell, based on the pre- computed exception indicators.
87
BCA 5th Sem [Link]
3. PathExp: - This indicates the degree of surprise for each drill down
path from the cell.
The data cube approach can be considered as a data warehouse – based, pre
computational – oriented, materialized approach.
On the other hand, the attribute oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized – based, on-line data
analysis technique.
Some aggregations in the data cube can be computed on-line, while off-line
precomputation of multidimensional space can speed up attribute-oriented
88
BCA-5th SEM [Link]
Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for further use. The key properties of data mining are Automatic
discovery of patterns Prediction of likely outcomes Creation of actionable information Focus on
large datasets and databases.
The Scope of Data Mining Data mining derives its name from the similarities between searching
for valuable business information in a large database — for example, finding linked products in
gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both
processes require either sifting through an immense amount of material, or intelligently probing
it to find exactly where the value resides. Given databases of sufficient size and quality, data
mining technology can generate new business opportunities by providing these capabilities.
Tasks of Data Mining Data mining involves six common classes of tasks:
together and use this information for marketing purposes. This is sometimes referred to as
market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some way or
another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For example,
an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression – attempts to find a function which models the data with the least error.
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the interestingness of
resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or
attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be
used to assess a pattern’s interestingness based on its unexpectedness, may also be included.
Other examples of domain knowledge are additional interestingness constraints or thresholds,
and metadata (e.g., describing data from multiple heterogeneous sources).
This is essential to the data mining system and ideally consists of a set of functional modules for
tasks such as characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.
This component typically employs interesting measures that interacts with the data mining
modules so as to focus the search toward interesting patterns. It may use interesting thresholds to
filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated
with the mining module, depending on the implementation of the datamining method used. For
efficient data mining, it is highly recommended to push the evaluation of pattern interestingness
as deep as possible into the mining process to confine the search to only interesting patterns.
4. User interface:
This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to help
focus the search, and performing exploratory datamining based on the intermediate data mining
results. In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.
1. State the problem and formulate the hypothesis: Most data-based modeling studies are
performed in a particular application domain. Hence, domain-specific knowledge and experience
are usually necessary to come up with a meaningful problem statement. Unfortunately, many
application studies tend to focus on the data-mining technique at the expense of a clear problem
statement. In this step, a modeler usually specifies a set of variables for the unknown dependency
and, if possible, a general form of this dependency as an initial hypothesis. There may be several
hypotheses formulated for a single problem at this stage. The first step requires the combined
expertise of an application domain and a data-mining model. In practice, it usually means a close
interaction between the data-mining expert and the application expert. In successful data-mining
applications, this cooperation does not stop in the initial phase; it continues during the entire
data-mining process.
2. Collect the data: This step is concerned with how the data are generated and collected. In
general, there are two distinct possibilities. The first is when the data-generation process is under
the control of an expert (modeler): this approach is known as a designed experiment. The second
possibility is when the expert cannot influence the data- generation process: this is known as the
observational approach. An observational setting, namely, random data generation, is assumed in
most data-mining applications. Typically, the sampling distribution is completely unknown after
data is collected, or it is partially and implicitly given in the data-collection procedure. It is very
important, however, to understand how data collection affects its theoretical distribution, since
such a priori knowledge can be very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for estimating a model and the data
used later for testing and applying a model come from the same, unknown, sampling distribution.
If this is not the case, the estimated model cannot be successfully used in a final application of
the results.
3. Preprocessing the data: In the observational setting, data are usually "collected" from the
existing databases, data warehouses, and data marts. Data preprocessing usually includes at least
two common tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not consistent with
most observations. Commonly, outliers result from measurement errors, coding and recording
errors, and, sometimes, are natural, abnormal values. Such nonrepresentative samples can
seriously affect the model produced later. There are two strategies for dealing with outliers: a.
Detect and eventually remove outliers as a part of the preprocessing phase, or b. Develop robust
modeling methods that are insensitive to outliers.
2. Scaling, encoding, and selecting features – Data preprocessing includes several steps such as
variable scaling and different types of encoding. For example, one feature with the range [0, 1]
and the other with the range [−100, 1000] will not have the same weights in the applied
technique; they will also influence the final data-mining results differently. Therefore, it is
recommended to scale them and bring both features to the same weight for further analysis. Also,
application-specific encoding methods usually achieve dimensionality reduction by providing a
smaller number of informative features for subsequent data modeling. These two classes of
preprocessing tasks are only illustrative examples of a large spectrum of preprocessing activities
in a data-mining process. Data-preprocessing steps should not be considered completely
independent from other data-mining phases. In every iteration of the data-mining process, all
activities, together, could define new and improved data sets for subsequent iterations. Generally,
a good preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and encoding.
4. Estimate the model: The selection and implementation of the appropriate data-mining
technique is the main task in this phase. This process is not straightforward; usually, in practice,
the implementation is based on several models, and selecting the best one is an additional task.
5. Interpret the model and draw conclusions: In most cases, data-mining models should help
in decision making. Hence, such models need to be interpretable to be useful because humans are
not likely to base their decisions on complex "black box" models. Note that the goals of accuracy
of the model and accuracy of its interpretation are somewhat contradictory. Usually, simple
models are more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using high dimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific.
techniques to validate the results. A user does not want hundreds of pages of numeric results. He
does not understand them; he cannot summarize, interpret, and use them for successful decision
making.
Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.
Incorporation of background knowledge. - To guide the discovery process and to express the
discovered patterns, background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. These representations
should be easily understandable by the users.
Handling noisy or incomplete data. - Data cleaning methods are required that can handle the
noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to the interestingness of the problem. The patterns discovered
should be interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - To effectively extract information from
huge amounts of data in databases, data mining algorithms must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which are further processed parallel. Then the result from the partitions is merged.
The incremental algorithms update databases without having to mine the data again from scratch.
Association rule mining is the scientific technique used to find out interesting and
frequent patterns from the transactional, spatial, temporal, or other databases and to
set associations or relations among patterns( also known as item sets) to discover
knowledge.
Suppose as manager of a branch of HTC company, you would like to learn more
about the buying habits of your customers. You may also want to know which
groups or sets of items customers are likely to purchase on a given trip to store.
To answer your question, market basket analysis may be performed on the retail
data of customer transactions at your store. The result may be used to plan
marketing or advertising strategies , as well as catalogue design. For instance,
market basket analysis may help managers design different store layouts. In one
strategy, items that are frequently purchased together can be placed in proximity to
further encourage the sale of such items together. If customers who purchase
computers also tend to buy cell phones at the same time, then placing the computer
display close to the cell phone display may help to increase the sales of both items.
In an alternative strategy, placing computers and cell phones at the opposite ends
of the store may attract the customers who purchase such items to pick up other
items along the way.
Market basket analysis can also help retailers to plan which items to put on sale at
reduced prices. If customers tend to purchase computers and printers together, then
having a sale of printers may encourage the sale of printers as well as computers.
Market basket analysis is just one form of association rule mining. In fact, there
are many kinds of association rules. Association rules can be classified in
various ways, based on the following criteria.
Apriori Algorithm
The Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding
frequent item sets in a dataset for Boolean association rule. The name of the
algorithm is Apriori because it uses prior knowledge of frequent item set
properties. We apply an iterative approach or level-wise search where k-frequent
item sets are used to find k+1 item sets.
To improve the efficiency of level-wise generation of frequent item sets, an
important property is used called Apriori property which helps by reducing the
search space.
Apriori Property
All non-empty subset of frequent itemset must be frequent. The key concept of
Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes
that.
All subsets of a frequent item set must be frequent(Apriori property).
If an item set is infrequent, all its supersets will be infrequent.
Before we start understanding the algorithm, go through some definitions which
are explained in my previous post.
Consider the following dataset and we will find frequent itemset and generate
association rules for them.
Step-1: K=1
Create a table containing support count of each item present in dataset – Called C1
(candidate set)
(II)Compare candidate set item’s support count with minimum support count (here
min_support=2 if support count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an item set are frequent or not and if not frequent
remove that item set.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent. Check for each item set)
Now find support count of these item sets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support count of candidate set item is less than min_support then
remove those items) this gives us item set L2.
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common. So here, for L2, first
element should match.
So item set generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2,
I3, I4}{I2, I4, I5}{I2, I3, I5}
Check if all subsets of these itemset are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}
which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so
remove it. Similarly check for every item set)
Find support count of these remaining item set by searching in dataset.
Step-4:
Generate candidate set C4 using L3 (join step). The condition of joining
Lk-1 and Lk-1 (K=4) is that they should have (K-2) elements in common. So
here, for L3, the first 2 elements (items) should match.
Check all subsets of these itemset are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is
not frequent). So, no itemset in C4
We stop here because no frequent itemset are found further.
Multilevel association means mining the data in different levels. For many
applications, it is difficult to find strong associations among data items at low level
of abstraction due to the sparsity of data in multidimensional space.
Strong associations discovered at high concept levels that might represent common
sense knowledge. However, what may represent common sense to one user may
seem new or novel to another. Therefore, data mining systems should provide
capabilities to mine association rules at multiple levels of abstraction.
The uniform support approach, however, has some difficulties. It is unlikely that
items at lower levels of abstraction will occur as frequently as those at higher
levels of abstraction.
For example, in fig. the minimum support threshold for levels 1 and 2 are
6% and 4% respectively. In this way, “computer “ laptop computer and “
desktop computer “ are all considered frequent.
UNIT-IV
I. Overview of classification
Classification is a form of data analysis that extracts models describing
important data classes. Such models, called classifiers, predict categorical
(discrete, unordered) class labels. For example, we can build a classification
model to categorize bank loan applications as either safe or risky. Such
analysis can help provide us with a better understanding of the data at large.
Many classification methods have been proposed by researchers in machine
learning, pattern recognition, and statistics. Most algorithms are memory
resident, typically assuming a small data size. Recent data mining research
has been built on such work, developing scalable classification and
prediction techniques capable of handling large amounts of disk-resident
data. Classification has numerous applications, including fraud detection,
target marketing, performance prediction, manufacturing, and medical
diagnosis.
classification, where a model or classifier is constructed to predict class
(categorical) labels.
The data classification process: (a) Learning: Training data are analyzed by
a classification algorithm. Here, the class label attribute is loan decision, and
the learned model or classifier is represented in the form of classification
rules. (b) Classification: Test data are used to estimate the accuracy of the
classification rules. If accuracy is considered acceptable, the rules can be
applied to the classification of new data tuples.
The accuracy of a classifier on a given test set is the percentage of test set
tuples that are correctly classified by the classifier.
A decision tree for the concept buys a computer, indicating whether an All
Electronics customer is likely to purchase a computer. Each internal (non-leaf)
node represents a test on an attribute. Each leaf node represents a class (either buys
computer = yes or buys computer = no).
The left branch leads to a node that represents the “age” attribute. If the person’s
age is less than or equal to 30, the decision tree follows the left branch, and if the
age is greater than 30, the decision tree follows the right branch. The right branch
leads to a leaf node that predicts that the person is unlikely to buy a new car.
The left branch leads to another node that represents the “education” attribute. If
the person’s education level is less than or equal to high school, the decision tree
follows the left branch, and if the education level is greater than high school, the
decision tree follows the right branch. The left branch leads to a leaf node that
predicts that the person is unlikely to buy a new car. The right branch leads to
another node that represents the “credit score” attribute. If the person’s credit score
is less than or equal to 650, the decision tree follows the left branch, and if the
credit score is greater than 650, the decision tree follows the right branch. The left
branch leads to a leaf node that predicts that the person is unlikely to buy a new
car. The right branch leads to a leaf node that predicts that the person is likely to
buy a new car.
Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split. The attribute
selection measure provides a ranking for each attribute describing the given
training tuples.
Info(D) =
The attribute with the maximum gain ratio is selected as the splitting
attribute. Note, however, that as the split information approaches 0, the ratio
True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
True negatives(TN): These are the negative tuples that were
correctly labeled by the classifier. Let TN be the number of true
negatives.
False positives (FP): These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys computer =
no for which the classifier predicted buys computer = yes). Let FP be
the number of false positives.
False negatives (FN): These are the positive tuples that were
mislabeled as negative (e.g., tuples of class buys computer = yes for
which the classifier predicted buys computer = no). Let FN be the
number of false negatives.
better than it fits into the test set, then overfitting is probably the cause that
occurred here.
Basically, two-thirds of the data are being allocated to the training set and
the remaining one-third has been allocated to the test set.
Random Subsampling
Random subsampling is a variation of the holdout method. The
holdout method has been repeated K times.
The holdout subsampling involves randomly splitting the data
into a training set and a test set.
On the training set the data has been trained and the mean
square error (MSE) has been obtained from the predictions on
the test set.
As MSE is dependent on the split, this method is not
recommended. So, a new split can give you a new MSE.
Cross-Validation
K-fold cross-validation has been used when there is only a limited
amount of data available, to achieve an unbiased estimation of the
performance of the model.
Here, we divide the data into K subsets of equal sizes.
We build models K times, each time leaving out one of the subsets
from the training and use it as the test set.
If K equals the sample size, then this is called a “Leave-One-Out.”
Bootstrapping
Bootstrapping is one of the techniques which is used to make the
estimations from the data by taking an average of the estimates from
smaller data samples.
The bootstrapping method involves the iterative resampling of a
dataset with replacement.
On resampling instead of only estimating the statistics once on
complete data, we can do it many times.
Repeating this multiple times helps to obtain a vector of estimates.
Data Preprocessing
o To make sure that all characteristics are scaled equally, normalize the
data. Techniques like min−max normalization, z−score normalization,
or log transformation can be used for this.
Feature Selection
Model Selection
Hyperparameter Tuning
Imbalanced Data
X. Introduction to Clustering
Clustering is the process of grouping a set of data objects into multiple
groups or clusters so that objects within a cluster have high similarity but are
very dissimilar to objects in other clusters. Dissimilarities and similarities
are assessed based on the attribute values describing the objects and often
involve distance measures.1 Clustering as a data mining tool has its roots in
many application areas such as biology, security, business intelligence, and
Web search.
Cluster analysis or simply clustering is the process of partitioning a set of
data objects (or observations) into subsets. Each subset is a cluster, such that
objects in a cluster are like one another, yet dissimilar to objects in other
clusters. The set of clusters resulting from a cluster analysis can be referred
to as a clustering. In this context, different clustering methods may generate
different clustering on the same data set. The partitioning is not performed
by humans, but by the clustering algorithm. Hence, clustering is useful in
that it can lead to the discovery of previously unknown groups within the
data.
Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
In business intelligence, clustering can be used to organize many customers
into groups, where customers within a group share strong similar
characteristics. This facilitates the development of business strategies for
enhanced customer relationship management. Moreover, consider a
consultant company with many projects. To improve project management,
clustering can be applied to partition projects into categories based on
similarity so that project auditing and diagnosis (to improve project delivery
Once the group is split or merged then it can never be undone as it is a rigid
method and is not so flexible. The two approaches which can be used to improve
the Hierarchical Clustering Quality in Data Mining are: –
One should carefully analyze the linkages of the object at every partitioning
of hierarchical clustering.
One can use a hierarchical agglomerative algorithm for the integration of
hierarchical agglomeration. In this approach, first, the objects are grouped
into micro-clusters. After grouping data objects into micro clusters, macro
clustering is performed on the micro cluster.
The “points” mentioned above are called means because they are the
mean values of the items categorized in them.
the same clusters. One way to find the eps value is based on
the k-distance graph.
MinPts: Minimum number of neighbors (data points)
within eps radius. The larger the dataset, the larger value of
MinPts must be chosen. As a rule, the minimum MinPts
can be derived from the number of dimensions D in the
dataset as, MinPts >= D+1. The minimum value of MinPts
must be chosen at least 3.
Find all the neighbor points within eps and identify the core
points or visited with more than MinPts neighbors.
For each core point if it is not already assigned to a cluster,
create a new cluster.
Recursively find all its density-connected points and assign
them to the same cluster as the core point.
Points a and b are said to be density connected if there
exists a point c which has enough points in its neighbors
and both points a and b are within the eps distance. This is
a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is
neighbor of an implying that b is a neighbor of a.
Iterate through the remaining unvisited points in the
dataset. Those points that do not belong to any cluster are
noise.
1. Tableau
Tableau is a data visualization tool that can be used by data analysts, scientists,
statisticians, etc. to visualize the data and get a clear opinion based on the data
analysis. Tableau is very famous as it can take in data and produce the required
data visualization output in a very short time. And it can do this while providing
the highest level of security with a guarantee to handle security issues as soon as
they arise or are found by users.
Tableau also allows its users to prepare, clean, and format their data and then
create data visualizations to obtain actionable insights that can be shared with other
users. Tableau is available for individual data analysts or at scale for business
teams and organizations. It provides a 14-day free trial followed by the paid
version.
2. Looker
Looker is a data visualization tool that can go in-depth into the data and analyze it
to obtain useful insights. It provides real-time dashboards of the data for more in-
depth analysis so that businesses can make instant decisions based on the data
visualizations obtained. Looker also provides connections with Redshift,
Snowflake, and BigQuery, as well as more than 50 SQL-supported dialects so you
can connect to multiple databases without any issues.
Looker data visualizations can be shared with anyone using any tool. Also, you can
export these files in any format immediately. It also provides customer support
wherein you can ask any question and it shall be answered. A price quote can be
obtained by submitting a form.
3. Microsoft Power BI
Power BI Pro. It also provides you with multiple support systems such as FAQs,
forums, and live chat support with the staff.