0% found this document useful (0 votes)
195 views36 pages

BI Framework for Data Warehousing

The document outlines the Business Intelligence (BI) component framework, which consists of three layers: Business, Administration and Operation, and Implementation. It details the importance of a sound architecture to address business needs, the roles of casual and power users, and various BI applications and analyses. Additionally, it describes the roles and responsibilities of BI program and project teams in executing BI strategies within organizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views36 pages

BI Framework for Data Warehousing

The document outlines the Business Intelligence (BI) component framework, which consists of three layers: Business, Administration and Operation, and Implementation. It details the importance of a sound architecture to address business needs, the roles of casual and power users, and various BI applications and analyses. Additionally, it describes the roles and responsibilities of BI program and project teams in executing BI strategies within organizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 3: BI Definitions and Concepts

BI Component Framework

• A sound architecture in an organization / enterprise is necessary to support the technical,


functional and data needs.
• Good Architecture help the organization to respond to business questions and queries of
users in a better way.
• The BI component framework is divided into 3 layers
1. Business layer
2. Administration and Operation layer
3. Implementation layer

Business layer

1. Business Requirements: It is basic step which includes identifying requirements, setting


targets and preparing proper plans to achieve the targets. This layer consists of following
components

PTES BCA, BELGAUM Page 1


Unit 3: BI Definitions and Concepts

• Business drivers: These are factors that initiate the need to act. Ex: changing
labour laws, changing economy, changing technology etc.
• Business goals: These are the targets to be achieved in response to business
drivers. Ex: increased productivity, improved market share, improved profits,
good customer satisfaction, cost reduction etc.
• Business strategies: These are the planned actions that will achieve the set goals.
Ex: outsourcing, global delivery model, customer and employee retention
programs, competitive pricing etc.
2. Business Value: To implement the planned strategy into action it needs some cost in the
form of money, time, effort, information etc. The final value should be feasible when
compared to cost involved. Business value is measured in terms of ROI , ROA , TCO
and TVO
• Return on Investment (ROI): It is a performance measure used to evaluate the
efficiency (benefits) of an investment. Ex: a company invests it 10% of daily
revenue on social media to get new clients and increase its prospects. This is the
ROI from social media.
• Return on Asset (ROA): It is the earning generated from invested capital or
assets. Ex: If a company’s net income is 1 million and its assets are 5 million
then ROA= (1/5) *100 = 20%
• Total cost of ownership (TCO): It is the purchase price of an asset plus the
costs of operation. Ex: The TCO of a car is not only the purchase price but also
the expenses incurred through its use, such as repairs, insurance and fuel.
• Total value of ownership (TVO): It denotes the total financial value of a service
or product plus some subcategories like stock, undistributed dividends etc.
TVO = Total assets – total liabilities
3. Program management:
• It is the component that ensures people, projects and priorities work in a manner in which
individual processes are compatible with each other, so as to ensure seamless integration
and smooth functioning of the entire program.

PTES BCA, BELGAUM Page 2


Unit 3: BI Definitions and Concepts

• It takes care of business priorities, missions and goals, strategies and risks, cost and
value, infrastructure, business rules, multiple projects etc.
4. Development: The development process consists of following components
• Database/ data warehouse: consisting of ETL, data profiling, data cleansing and
database tools
• Data integration system: integration and quality tools
• Business analytics development: about processes and various technologies used

Administration and Operation layer

This layer consists of four components

1. BI Architecture

PTES BCA, BELGAUM Page 3


Unit 3: BI Definitions and Concepts

2. BI and DW operations: Data warehouse administration requires the usage of various


tools to monitor the performance and usage of the warehouse, and perform
administrative tasks on it. Some of these tools would be
• Backup and restore
• Security
• Configuration management
• Database management
3. Data resource administration:
• Data governance: It is a technique for controlling data quality, which is used to assess,
improve, manage and maintain information. It helps to define standards that are required
to maintain data quality. The distribution of roles for governance of data is as follows
✓ Data ownership
✓ Data stewardship
✓ Data custodianship
• Metadata management: Metadata is data about data. For a data warehouse the metadata
is the timestamp at which data was extracted, data sources from where metadata has been
extracted, missing fields or new columns added during data cleaning or integration.
Metadata management involves tracking, assessment and maintenance of metadata.
Metadata can be divided into four groups
✓ Business metadata: It captures information such as Data definitions, business
structure, ownership characters, Metrics definitions, Subject models, Data models,
Business rules, Data rules, Data owners/stewards, etc. Ex. which branch is top
revenue generator, market shares etc.
✓ Process metadata: It consists of Source/target maps Transformation rules Data
cleansing rules Extract audit trail Transform audit trail Load audit trail Data quality
audit etc.
✓ Technical metadata: It is about how data is stored in database, rules and technology
used for storing data. It consists of Data locations, Data formats, Technical names,
Data sizes, Data types, Indexing, Data structures etc.

PTES BCA, BELGAUM Page 4


Unit 3: BI Definitions and Concepts

✓ Application metadata: Company’s dashboard is built on data warehouse and it is


accessible by all the branch heads and senior executives. Application metadata
includes Data access history, Who is accessing? Frequency of access? When
accessed? How accessed? etc.
4. Business Applications
• The usage of technology to produce value for business refers to generation of information
from data warehouses or data marts.
• Using BI tools strategic, financial, customer or risk intelligence is generated.
• BI applications are DSS, EIS, OLAP, data mining and discovery tools etc.

Implementation layer

• It consists of technical components required to capture, transform, clean and convert data
into meaningful information and deliver it to meet business goals and bring value to
business. It includes two components

1. Data warehousing

• It is the process that prepares basic repository / data store from which data is extracted.

• It is built on a dimensional model schema optimized specially for data retrieval.

• The roles involved intake, integration, distribution, delivery, access.

PTES BCA, BELGAUM Page 5


Unit 3: BI Definitions and Concepts

• Refer below diagram which is an example on data warehouse for a store.

2. Information services
• It is not only the process of producing information; rather, it involves ensuring
that the information produced is aligned with business requirements and can be
acted upon to produce value for the company.
• Information is delivered in the form of KPI’s, reports, charts, dashboards or
scorecards, etc., or in the form of analytics.
• Data mining is a practice used to increase the body of knowledge.
• Applied analytics is generally used to drive action and produce outcomes.

BI Users

There are two types of BI users:

Casual users

• They are the consumers of information who use the pre-existing reports created by power
users and make decisions / take actions.

• They do not create reports

PTES BCA, BELGAUM Page 6


Unit 3: BI Definitions and Concepts

• ex: executives, managers, field/operations workers, customers, suppliers.

Power users

• They are the producers of information


• They use powerful analytical and authoring tools to access data from data warehouses/
data marts and other sources from inside and outside the organization.
• Ex: developers, administrators, business analysts, IT professionals, analytical modelers.
• they take decisions on issues like
✓ what information should be placed on report?
✓ what is the best way to present information?
✓ who should see what information?
✓ how the information should be distributed?

BI Applications

BI applications can be divided into:

Technology solutions

• DSS: It is an information system which supports business decision making activities. Also
called knowledge based system, DSS known to support decision making that is required to
run day-to-day operations. It help in decision making at operational and tactical levels.
• EIS (executive information systems): Supports decision making at senior management
level by providing both internal and external information. It has an easy GUI with strong
reporting tools.
• OLAP (Online Analytical processing): The important thing about OLAP systems is
multidimensional data. These systems tools allow slicing and dicing of data. OLAP
system is depicted as below, it has 3 tiers.
➢ The bottom layer is data warehouse server layer. This tier houses the enterprise
wide data warehouse. The data is collected from multiple heterogeneous internal
data sources and a few external data sources. Different tools are used to at this
layer to cleanse, transform and load the data to warehouse. Data warehouse is

PTES BCA, BELGAUM Page 7


Unit 3: BI Definitions and Concepts

refreshed regularly to update the data from data sources. This layer has metadata
which stores information about data warehouse and sources of data.
➢ The middle layer is OLAP server layer. This layer has ROLAP and MOLAP
Servers. These process the data using OLAP cube.
➢ The top layer is data front end layer. This layer supports different tools like query,
reporting and analysis. This layer is mainly for the users to access the data in the
required form.
• Managed Query and Reporting: This includes standard reports, report wizards and
report designer which are used by developers to create reports. It also has business
rebuilder which is used by the business users to quickly create a report according to the
given report template.
• Data Mining: Data mining is about unravelling hidden patterns, hidden information,
spotting trends etc. For example in any online shopping sites when the user selects a
particular item suggestions will be shown saying that “those who brought this also
brought….”. This is done by the analysis of customers buying behavior.

Business Solutions

• Performance Analysis: This analysis helps in maximum utilization of employee,


finance, resources etc. The performance analysis of business provides clean insights into
the areas that need immediate attention. This also helps in rewarding the employees
depending on their performance.
• Customer Analysis: This helps in capturing data about customer’s behavior and
enabling businesses to make decisions such as direct marketing or improving
relationships with customers. This plays a vital role in predicting the customer’s
behavior, customer buying pattern etc.
• Market Place Analysis: This helps in understanding the market place better. It is about
understanding customers, the competitors, the products, the changing market dynamics
etc. This analysis helps in making decisions depending on the current situations such as
“is it right time to lunch new product?”, “how customers respond to new product” etc.

PTES BCA, BELGAUM Page 8


Unit 3: BI Definitions and Concepts

• Productivity Analysis: In economics, productivity is defined as the ratio of output


produced per unit of input. Productivity can be influenced by many factors such as
internal and external factors. The analysis helps in increasing the profit of the enterprise
and also it evaluates the performance of organization. To do productivity analysis the
organization must collect data and compare the actual values with planned or estimated.
• Sales Channel Analysis: This analysis helps in deciding the best channel for reaching
out products/services for use by customers. A good sales channel comprises 4 Ps –
Product, Price, place and Promotion. The sales channel analysis provides insights that
help in deciding which channel is most profitable.
• Behavioral Analysis: This analysis helps in predicting trends such as purchasing
patterns, online buying patterns etc.
• Supply Chain Analysis: This analysis helps in optimizing the supply chain from of
planning → manufacturing → sales. Supply chain analysis helps in spotting trends,
identifying problems and opportunities in the supply chain functions such as sourcing,
inventory management, manufacturing, sales, logistics etc.

BI roles and responsibilities

BI roles are classified into two categories

• Program team roles: The program team prepares the strategy on how the BI project will
execute. They are responsible for integration and coordination.
• Project team roles: This team executes the program team’s strategy

PTES BCA, BELGAUM Page 9


Unit 3: BI Definitions and Concepts

1) BI Program team roles


a) BI program manager

• He is responsible for several projects

• He Aligns the BI project with the organizations strategic objectives

• He defines metrics to measure and monitor progress on each objective/goal

• He plans and budgets the projects and follows up the progress of each project

• He distributes tasks , allocates/ de-allocates resources

• He Identifies and measures success/ ROI

b) BI data architect:

• He is accountable for enterprise data

• Optimizes current data usage and takes care of future data needs( design and content)

• Ensures proper definition, storage, distribution, archiving and management of data.

c) BI ETL architect:

• He determines the best way to obtain data from different operational sources/platforms.

• Trains the ETL specialists on data acquisition, transformation and loading.

d) BI technical architect:

• Interfaces with operations staff, technical staff, DBA staff

• Selects and evaluates BI tools (ETL, reporting etc.)

• Assesses current technical architecture and estimates system capacity for long term
processing needs.

• Defines BI strategy or process for technical architecture, Hardware, DBMS, middleware,


network requirements, server and client configurations.

• Defines strategy for data backup and recovery and disaster recovery.

PTES BCA, BELGAUM Page 10


Unit 3: BI Definitions and Concepts

e) Metadata manager

• Keeps track of structure of technical metadata

• Levels of details of data

• When was ETL audit performed?

• When was data warehouse updated?

• Who accessed the application metadata, when and what is the frequency of access?

f) BI Administrator:

• He is designer and architect of entire BI environment

• Architect of metadata layer

• Monitor the progress and security of entire BI environment

• Monitor all scheduled jobs such as ETL jobs, reports for business users

• Tune the performance of entire BI environment

• Maintain version control of all objects in BI environment

2) BI project team roles


a) Business Manager:

• Monitors the project from user group perspective

• Monitors activities of project team

• Addresses the business issues identified by project managers

b) BI business specialists:

• They have good understanding of the business area of focus.

• Identifies suitable data usage and structure for the business functional area.

PTES BCA, BELGAUM Page 11


Unit 3: BI Definitions and Concepts

• He is the lead in data stewardship and quality

• Ensures that the information is identified correctly at all levels and accessed at all modes.

c) BI project Manager:

• He leads the project and ensures delivery of all project needs and assesses risks.

• Translates business needs into technical terms

• Ensures all business standards and BI processes are followed

• Analyzes DSS and EIS to understand their functionality

• Predicts what users may/ will want

• Motivates and evaluates and communicates with team members

• Understands technical and information architecture

• Coordinates with program managers and other project managers

• Implements warehousing specific standards

d) Business Requirement Analyst:

• He communicates between end-users and BI project team and performs requirement


gathering.

• Questions the end users to determine requirements

• Transforms requirements into technical specifications working along with technical


architects

• Documents requirements

• Helps in identifying potential data sources

• Validates that BI meets requirements and service level agreements

• Coordinates prototype reviews and gathers feedback.

PTES BCA, BELGAUM Page 12


Unit 3: BI Definitions and Concepts

e) Decision Support Analyst:

• He is an expert on issues related to business objectives and problems and provides


required data to address these issues.

• Analyze business information and discover business translation rules.

• Designs training infrastructure & material and trains BI users and educates users on
warehousing capabilities.

• Defines and makes agreement with the users and also write user manual for services.

• Maps and validates business requirements and production data.

• Classifies business users by type.

• Develops necessary reports, DS and EIS.

• Develops security rules and standards.

• Execute product/service acceptance test plan.

• Plans and executes acceptance tests and helps users find right information.

• Works with process teams regarding business process reengineering.

f) BI Designer:

• He interprets the requirements and designs the data structure for optimal access,
performance and integration

• Creates the subject area model and business enterprise model.

• Creates a logical, structural and physical staging area model

• Creates a logical, structural and physical distribution model

• Creates a logical, structural and physical relational model

• Creates a logical, structural and physical dimensional model

• Validate models with production data

PTES BCA, BELGAUM Page 13


Unit 3: BI Definitions and Concepts

g) ETL specialist:

• Determines and implements the best technique for data extraction

• Understands and maps source and target BI systems

• Identifies and assess data sources

• Apply business rules as transformations

• Implement navigation methods / applications

• Design and specify data detection & extraction process

• Design and develop transformations of code/logic/programs for environment

• Design and develop data transport and population process for environment

• Build and perform unit test data transformation process

• Build and perform unit test source data transport and population process

• Design data cleanup process

• Work with production data to enhance data conditions and quality

• Adapt ETL processes to accommodate changes in source systems

• Define and capture ETL metadata and rules

• Coordinate with program level ETL architect

h) Database administrator:

• He is the guardian of data and data warehouse

• Keeps check of the physical data appended to BI environment in current project cycle

• Designs, implements and tune database schemas

• Manages storage space and memory

• Create and optimize and administer physical tables, triggers and partitions

PTES BCA, BELGAUM Page 14


Unit 3: BI Definitions and Concepts

• Implement all models, indexing strategies

• Log technical action reports

• Document configuration and integration with applications and network resources

• Maintain backup and recovery documentation

Need For Data Warehouse

• Data from several heterogeneous data sources can extracted and brought together in a
data warehouse.
• Even when DIIT expand into several branch on multiple cities, it still can have one data
warehouse to support the information needs of the institution.
• Data anomalies can be corrected through an ETL package.
• Missing or incomplete records can be detected and duly corrected.
• Uniformity can be maintained over each attribute if a table.
• Data can be maintained over each attribute of a table.
• Data can be conveniently retrieved for analysis and generating reports
• Fact-base decision making can be easily supported by a data warehouse.
• Ad hoc queries can be easily supported.

Some of issues/concerns of data usage

• Lack of information sharing: Even though Information is available, it is not being


shared between various departments. Due to this cross – selling opportunities cannot be
realized and hence there is loss of CRM.
• Lack of information credibility: Since the data is stored at different locations in
different formats, there is no data integrity and consistency. Hence this results in lot of
confusion.
• Reports take a longer time to be prepared: OLTP systems removes older data from
transaction processing systems for controlling the expected response time. (This old data
is then archived to data warehouse if available). Hence it’s difficult to prepare a report
based on characteristic at a previous point in time.

PTES BCA, BELGAUM Page 15


Unit 3: BI Definitions and Concepts

• Little or no scope for ad hoc querying or queries that require historical data:
Operational databases do not archive historical data. Hence queries that require usage and
analysis of historical data are cannot be satisfied or take long time.

Definition of Data Warehouse

According to William (bill) Inmon:

“A data warehouse is a subject oriented, integrated, time variant and non-volatile collection of
data in support of management’s decision making process”.

• Subject oriented: A data warehouse collects data of subjects like “customers”,


“suppliers, “sales”, “products” etc. spread across the organization.
• Integrated: data warehouse brings together the data from disparate (different formats
and content) sources after cleansing and transforming into a unified format to serve the
needs of enterprise.
• Time variant: The data warehouse keeps historical data. From a data warehouse one can
get the older data easily. Ex: it can keep record of the customer’s address recorded for
over last five years.
• Non-volatile: A data warehouse is a separate physical store of data transformed from the
application data found in operational environment.

Data Mart

• A data mart holds data and aggregations about one single subject area/ domain which can
be used for analysis, reporting or decision support.
• Concentrates on integrating information from a given subject area or set of source
systems
• Is built focused on a dimensional model using a star schema.
• Data marts are restricted in their scope and business purpose
• They might not ensure single version of truth
• There are two types of data marts

PTES BCA, BELGAUM Page 16


Unit 3: BI Definitions and Concepts

➢ Independent data mart: They are sourced directly from one or more operational
systems, external sources, data generated from within a department or unit.
➢ dependent data mart: They are sourced from enterprise data warehouse

ODS (Operational data store)

• ODS processes the operational data that is fed into data warehouse and provides a
homogeneous unified view which can be used for analysis and reporting.
• ODS stores current and very recent operational data.
• They don’t store historical data.
• ODS are used as staging area for data warehouse.
• The data warehouse on regular basis takes current processed data from ODS and adds it
to its own historical data.

Ralph Kimball’s Approach Vs. [Link]’s Approach

Ralph Kimball’s [Link]’s

A data warehouse is made up of all the data A data warehouse is made a subject-oriented ,
marts in an enterprise integrated, on- volatile , time-variants collection
of data in support of management’s decisions

This bottom-up approach Top-down approach

It has been found that small organizations will It is able to “achieve single version of truth”
benefit by building the D.W larger organizations.

kimball’s approach is faster ,cheaper ,and less [Link]’S approach more expensive and is
complex. a time-consuming slower process involving
several complexities.

The single version of truth might be It is worth investment of time and efforts.
compromised in kemball’s approach

PTES BCA, BELGAUM Page 17


Unit 3: BI Definitions and Concepts

Ralph Kimball’s Approach

[Link]’s Approach

Goals of a Data warehouse

• Information accessibility: Data in a data warehouse must be easy to understand by users


and developers. It must be properly labeled for easy access. Users must be allowed to
slice and dice the data.
• Information credibility: The data must be credible, complete, consistent and of desired
quality.
• Flexible to change: Data warehouse must adaptable to changes like business situations,
user requirements, technology and access tools etc. Adding of new data must not
invalidate the existing data.

PTES BCA, BELGAUM Page 18


Unit 3: BI Definitions and Concepts

• Support for fact based decision making: The data warehouse must have enough data to
support more precise decision making, data should be relevant and easily accessible to
business users.
• Support for data security: There must be security mechanisms enabled so that the
confidential data must be accessible only to valid authorized users.
• Information consistency: The information provided to the users should maintain
single/consistent version of truth.

Constituents of data Warehouse

Data from operational systems flows into the staging area where it undergoes transformation and
is placed in the presentation area and then the users can access the data using data access tools.

• Operational source systems: These systems maintain transactional/operational data.


They are outside the data warehouse and maintain little historical data. On querying, they
return a result set/ record set.
• Data staging area: A staging area, or landing zone, is an intermediate temporary storage
area used for data processing during the extract, transform and load (ETL) process and
data quality. Its contents are erased prior to running the ETL process or immediately
following successful completion of the ETL process. This area is off limits from business
users and does not answer any queries and does not offer presentation service.
• Data presentation area: It is the interface or front face of the data warehouse with
which the business users interact using data access tools. It is a collection of integrated
data marts.
• Data access tools: It consists of tools for ad hoc queries, reporting, data modeling
/mining applied over data presentation area.

PTES BCA, BELGAUM Page 19


Unit 3: BI Definitions and Concepts

Data Integration

• It is the ability to integrate data from several different sources for providing a unified
view of the data.
• It is the ability to consolidate data while maintaining the integrity and reliability of
data.
• Ex: in a library scenario, the data about library items (books, CD, magazines etc) and
student (student) is maintained in excel files. Whereas the transaction data like
issue/return is maintained in access database.
• Hence at some point it is required to integrate all the above data and present it to the
user in a unified view. This can be done using data integration techniques.

PTES BCA, BELGAUM Page 20


Unit 3: BI Definitions and Concepts

Two main Approaches to data integration

1. Schema integration: It is developing a unified representation of semantically similar


data which is structured and stored differently in individual databases.
• Multiple data sources provide data on same entity type. Hence schema integration allows
applications a transparent view and ability to query the data as if it is from one uniform
data source.
• Data in different database is stored using different schema, analyst uses the metadata to
mapping them into single target schema.
• Consider a retail outlet that has two branches that stores transaction data with different
schema as shown below
• The schema from both branches is integrated by mapping the respective columns by
looking up the metadata information of schemas like column names, type, length,
constraints, domain of values, NULL, zero and blank values etc.
• The integrated schema is shown in below diagram as target schema.

2. Instance Integration: It identifies and integrates all the instance of the data items that
represent the real word entity, distinct from schema instance integration.
• Data integration from multiple heterogeneous data source has become a high-priority task
in many large enterprises.

PTES BCA, BELGAUM Page 21


Unit 3: BI Definitions and Concepts

• Here the information is directly derived from the data to get accurate semantic
information on data content.
• A corporate company has almost 1000 employees working for it, it does not have any
ERP System but it have many enterprise applications, which store information about the
employees.
• To consolidate the data present in different applications to a single data warehouse
instance integration is used.
• It can be done using one concept of using primary key.
• In the following example, all the records are considered with “EmployeeNo” or “
SocialSecurityNo”. And all the values of employee name are replaced with one common
value.

Need and Advantages for data Integration

• It helps decision makes to quickly access and query information based on key
variable to gain meaningful insights

PTES BCA, BELGAUM Page 22


Unit 3: BI Definitions and Concepts

• Helps to reduce costs, overlaps, redundancies and minimize risks.


• Better monitoring of key variables like trending patterns and customer behavior
across geographies which leads to reduced R&D costs.

Common Approaches of Data Integration

Federated databases (FDBS)

• FDBS was first defined by McLeod and Heimbigner according to them FDBS defines the
architecture and interconnects the databases by supporting partial sharing and
coordination.
• A federated database is a system in which several databases (multiple and disparate)
appear to function as a single entity. Each component database in the system is
completely self-sustained and functional and is connected via computer network.
• When an application queries the federated database, the system figures out which of its
component databases contains the data being requested and passes the request to it.
• It is also called as virtual database.

• It consists of a Uniform User Interface through data abstraction, which enables the users
and clients to retrieve data from multiple non-contiguous databases using a single query.
• The FDBS had the ability to decompose larger/complex queries into sub queries before
submitting them to component DBMS. After which it combines the individual result set

PTES BCA, BELGAUM Page 23


Unit 3: BI Definitions and Concepts

of each sub query. It applies wrappers to sub queries so that they can be translated to
corresponding query languages used by component DBMS.
• It provides a unified way to look at data, avoids duplicate data and multiple queries.

Data warehousing

• Data integration is an important, fundamental part of data warehousing which creates a


robust, manageable and highly informative data source to deliver BI solutions.
• Data integration includes following activities:
➢ acquire data from sources (Extract)
➢ Transform and cleanse data (Transform)
➢ Load the transformed data into warehouse or data mart
• Using data warehousing method, data is pulled (extracted) from various data sources,
converted into a common format (transformed) and then loaded into its own database.
When the user runs a query, the required data is located, retrieved and presented in an
integrated view.
• primary concepts used in data warehousing are
➢ ETL(Extract, Transform, Load)
➢ component based (data mart)
➢ Dimensional models and schemas
➢ Metadata driven
• The below diagram shows that data from several different data sources is extracted,
transformed and loaded into a single database which can be queried as a single schema.
• Here the information about sales, customer, product is extracted from different sources,
then ETL process applied to it in the staging area. And then it is loaded into data
warehouse.

PTES BCA, BELGAUM Page 24


Unit 3: BI Definitions and Concepts

PTES BCA, BELGAUM Page 25


Unit 3: BI Definitions and Concepts

Memory mapped data structure:

• It is used with in-memory data manipulation and when the data structure is large
• It is used basically in dot net platform using c# and [Link]
• It is a faster way of accessing data

Federated Data Warehouse


Preferred when the databases are present across Preferred when the source information can be
various locations over a large area taken from one location
(geographically decentralized)
Requires high speed network connection Requires no network connection
It is easier to create as compared to data Its creation is not as easy as that of the
warehouse federated database
Requires no creation of new database Data warehouse must be created from scratch
Requires network experts to set up the network Requires database experts such as data steward
connection

Data Integration Technologies

1. Data interchange
• It is the structured transmission of organizational data between two or more
organization through electronic means; used for the transfer of electronic documents
from one computer to another
• Data interchange must not be seen merely as email. For instance, organization might
want to do away with bills of lading of lading (or even checks).and use appropriate
EDI message instead.
2. Object Brokering
• An ORB (Object Request Broker) is a certain variety of middleware software. It gives
programmers the freedom to make calls from one computer to another over a computer
network.
• It handles the transformation of in-process data structure to and from the byte sequence.

PTES BCA, BELGAUM Page 26


Unit 3: BI Definitions and Concepts

3. Modeling techniques
a. ER Modeling
• It is a logical design technique whose main goal is to reduce data redundancy and
hence solve the problems in insert, delete and update.
• It is used for transaction capture and helps in initial stages of data warehouse
construction
• Problems in ER modeling:
➢ Leads to creation of huge number of tables with lots of joins
➢ Difficult to understand and traverse by end users
➢ Not many s/w available to query
➢ Cannot be used for data warehouse where focus is performance, ad hoc/
unanticipated queries.
• Steps to draw a ER model:
➢ Identify entities
➢ Identify relationships between entities
➢ Identify key attributes
➢ Identify other relevant attributes for entities
➢ Draw ER diagram
➢ Review the ER diagram with business users and get their sign off.

b. Dimensional modeling

PTES BCA, BELGAUM Page 27


Unit 3: BI Definitions and Concepts

• It is a logical design technique whose goal is to present data in a standard format to


the end users.
• It is used for data warehouse with star or snowflake schema
• It consists of one large table called as fact table and a number of relatively smaller
tables called dimensional tables.
• Each fact table has a mutli-part primary key and each table has a single-part primary
key. The fact table contains its own primary key, the primary key of dimension tables
and a few numeric and additive facts.
• The dimension table consists of textual data
• Fact table maintains many to many relationships
• Benefits of dimensional modeling:
➢ easy to understand and navigate by end users
➢ quick response to ad hoc queries
➢ a lot of tools available

• Steps to transfer data from excel /access to SQL server


➢ Profiling source data (identifying key attributes, data anomalies and corrections to
be made)
➢ Resolving the anomalies and corrections and noting the results
➢ Choosing an appropriate ETL tool and creating ETL packages to transform data
from source DB to destination db.

PTES BCA, BELGAUM Page 28


Unit 3: BI Definitions and Concepts

➢ Identifying and implementing major rules, transformations and data cleaning


activities
➢ a) Populating dimensional tables with data

b) Populating fact tables with data

ER Modeling Dimensional Modeling


Optimized for transactional data. Optimized for query ability and performance.
Eliminates redundant data. Does not eliminate redundant data, where
appropriate.
Highly normalized (even at a logical level). It aggregates most of the attributes and
hierarchies of a dimension into a single entity.
It is a complex maze of hundreds of entities It has logical grouped set of Star schemas.
linked with each other.
Useful for transactional systems. Useful for analytical systems.

It is split as per the entities. It is split as per the dimensions and facts.

Data Quality

Data is often duplicated, inconsistent, ambiguous, and incomplete. We realize that all of this data
needs to be collected in one place and cleaned up .The obvious reason is bad data leads to bad
decision and bad decision lead to bad business.

Need For Data Quality

➢ Plan and prioritize data


➢ Parse data
➢ Standardize, Correct and Normalize data
➢ Verify and Validate data
➢ Apply business rules

PTES BCA, BELGAUM Page 29


Unit 3: BI Definitions and Concepts

Data integrity: It is the degree to which the attributes of data, associated with a certain entity
accurately describes the occurrence of that entity.

Examples:

➢ Primary key: a column or a collection of columns designated as primary key imposes


“unique” or “not null” constraint.
➢ Foreign key: a column defined as foreign key means it can have values that are present
in the primary key column of the same table or different table that it refers. A foreign key
can have NULL or duplicate values.
➢ Not Null: It means that it is mandatory to enter values in the column which has been
defines as Not Null. A space, zero, carriage return or line feed is not considered a NULL
value.
➢ Check Constraint: it allows imposing business rules on a column or a collection of
columns.

Data quality

• Data quality is measured with reference to appropriateness of purpose as defined by


business users and conformance to enterprise data quality standards as defined by
system architects and admins.
• Data quality has a wider scope and rooted in business
• Data integrity is local to domain of database technology

Dimensions of Data Quality

1. Correctness / accuracy:
• It is the degree to which the captured data correctly reflects /describes the world
entity/object/event.
• examples:
The address of a customer maintained in customer database is real address
The temperature recorded in thermometer is real temperature
The age of patient in hospital database is his real age

PTES BCA, BELGAUM Page 30


Unit 3: BI Definitions and Concepts

2. Consistency :
• It is about single version of truth
• The data throughout the enterprise must be in sync with each other
• Ex of consistent data is an employee has left a company and so his company email id is
made inactive
• Ex of inconsistent data is a customer has cancelled and surrendered his credit card. But
still his billing status reads “due”.
3. Completeness :
• It is the extent to which the expected attributes of data are provided
• Example on data completeness

data of all students of university is available

data of all patients of hospital is available

data of all clients of IT company is available

• Example on complete yet inaccurate data

a customer provides his address details at a restaurant but those details may be incorrect.

4. Timeliness :
• It is important to provide right data at right time to the right people in business
• Delayed supply of data becomes inconsequential and useless
• Ex of timely data:

airlines must provide most recent data to passengers

the quarterly results must be published at end of a financial year.

• Ex of non timely data: the population census results are published two years after the
census survey is completed
5. Metadata :
• It is data about the data.
• It helps in determining data usage

PTES BCA, BELGAUM Page 31


Unit 3: BI Definitions and Concepts

• Ex 1: for a relational database, the schema is its metadata


• Ex 2: the logical and conceptual models are also considered as metadata

Maintaining Data Quality

• Clean up the data by standardizing it using rules


• Use algorithms to detect duplicates
Ex: “ICS” and “informatics comp sys” may be duplicates if they have same address,
same no. of employees etc.
• Make use of external data to clean up the data for de-duplication.
• Ex: the US postal dept releases a CD of every valid address in US. This can be used to
convert all the address data to standard format
• Perform Data integration to remove inconsistent data
Ex: in a retail outlet that has many branches, a customer’s name (aleck stevenson) is
stored differently at all branches

Key Areas of Study in Data Integration

1. Data governance
• It includes implementing the policies that govern use if data in an organization.
• It also ensures compliance with standards
• It is used for data security from hackers and to prevent data leaking out
2. Metadata management
• It is a collection of definitions and relationships that describe the information stored.
• It was earlier known as “data dictionary “
3. Data architecture and Design
• Includes overall architecture like data storage, ETL process design, BI architecture
etc.
4. Database management
• Includes optimizing performance of database, backup & recovery, integrity
management etc.
5. Data security

PTES BCA, BELGAUM Page 32


Unit 3: BI Definitions and Concepts

• Who should have access? What data needs to be kept private?


6. Data quality
• Need to ensure single version of truth in data warehouse
• Inconsistent data must be transformed using software
7. Data warehousing and BI
• Used to measure, monitor and manage performance

Data Profiling

• Data profiling is the process of statistically examining and analyzing the content in a
data source, and hence collecting information about that data.
• It checks the data for accuracy and completeness and hence assesses data quality
• Helps understand structure, content, relationships about the data and helps discover
anomalies.
• Helps understand issues/challenges in a database project
• Helps assess risks associated with data integration
• Also helps assess and validate metadata

When To Conduct Data Profiling

• At the discovery/ requirements gathering phase


➢ As soon as the source data system have been identified and business requirements are
specified, an initial data profiling can be done to check for anomalies in data source or
any drawbacks that do not satisfy given requirements.
➢ It is more of data quality profiling
➢ Since it is done in early stages, it avoids corrections and rework.
• Just before the dimensional modeling process
➢ It involves more of database profiling and some data quality profiling
➢ Decision is made on most appropriate schema design for dimensional data warehouse
➢ Decision is made on best method to use for conversion of source data system to
dimensional model
• During ETL package design:

PTES BCA, BELGAUM Page 33


Unit 3: BI Definitions and Concepts

➢ It involves more of quality data profiling


➢ Helps identify errors that creep in due to ETL transformations applied to data
➢ Identifies what data to extract and what filters to apply
➢ Checks whether transformations are applied correctly

How to Conduct Data Profiling

• Data quality: analyze the quality of data at data source. Ex: a column containing
[Link] must be numeric. Hence remove any characters in the field.
• NULL values: look for the number of null values in an attribute
• Candidate keys: to select a candidate key, analysis of the extent to which certain
columns are distinct is done.
• Primary key selection: check if the candidate key does not violate NOT NULL and
UNIQUE constraint.
• Empty string values: check a column for null values or empty strings, since they create
problems while cube creation.
• String length: analyzing the largest, average and shortest string length helps decide
what data type is appropriate for that column
• Numeric length and type: assessing the max and min possible values for a numeric
column helps decide what datatype is suitable for that column.
• Identification of cardinality: The cardinality relationships are important for inner and
outer joins with respect to several BI tools. It is also important for design of fact-
dimension relationship
• Data format: Changing the data formats to make them more user friendly. Ex: marital
status from “M” & “S” to “married “ and “single”

Common Data Profiling Software

• Trillium Enterprise Data Quality: Powerful, yet very user-friendly software.


➢ It scans all the data systems you require it to manage

PTES BCA, BELGAUM Page 34


Unit 3: BI Definitions and Concepts

➢ Automatically runs continual scans periodically to check whether all the data is
consistently updated.
➢ Removes duplicate records.
➢ Provides for the separation of data into categories to allow easier data management.
➢ Generates statistical reports about the data systems regularly.
• Datiris Profiler: This tool is very flexible and can manage your data without inputs from
the user.
➢ A powerful metric system.
➢ Very good compatibility with other applications.
➢ Domain validation.
➢ Command-line interface.
➢ Pattern analysis.
➢ Real time data viewing,
➢ Profile templates and spreadsheets.
• Talend Data Profiler: It is a free, open-source software solution to data profiling, which
is now slowly becoming popular. It is good enough for small businesses and non-profit
organizations.
• IBM Infosphere Information Analyzer: A powerful profiling tool developed by IBM, it
does a deep-scan of the system in a very short time-period.
➢ IBM Infosphere security framework.
➢ Scanning scheduler.
➢ Reports.
➢ Source system profiling and analysis.
➢ Rules analysis.
• SSIS Data Profiling Task: This data profiling tool is not an independent software tool. It
is integrated into the ETL software called SQL Server Integration Services (SSIS)
provided by Microsoft. It can provide useful statistics about the source data as well as the
transformed chat is being loaded into the destination system.
• Oracle Warehouse Builder: Oracle warehouse builder is not strictly a data-profiling sot
tool. It has the necessary functionality to let a person with zero programming knowledge

PTES BCA, BELGAUM Page 35


Unit 3: BI Definitions and Concepts

build a data warehouse from scratch. One of the feature is data profiling functionality
which helps analyze source systems and hence provide clean data.

PTES BCA, BELGAUM Page 36

You might also like