BI Framework for Data Warehousing
BI Framework for Data Warehousing
BI Component Framework
Business layer
• Business drivers: These are factors that initiate the need to act. Ex: changing
labour laws, changing economy, changing technology etc.
• Business goals: These are the targets to be achieved in response to business
drivers. Ex: increased productivity, improved market share, improved profits,
good customer satisfaction, cost reduction etc.
• Business strategies: These are the planned actions that will achieve the set goals.
Ex: outsourcing, global delivery model, customer and employee retention
programs, competitive pricing etc.
2. Business Value: To implement the planned strategy into action it needs some cost in the
form of money, time, effort, information etc. The final value should be feasible when
compared to cost involved. Business value is measured in terms of ROI , ROA , TCO
and TVO
• Return on Investment (ROI): It is a performance measure used to evaluate the
efficiency (benefits) of an investment. Ex: a company invests it 10% of daily
revenue on social media to get new clients and increase its prospects. This is the
ROI from social media.
• Return on Asset (ROA): It is the earning generated from invested capital or
assets. Ex: If a company’s net income is 1 million and its assets are 5 million
then ROA= (1/5) *100 = 20%
• Total cost of ownership (TCO): It is the purchase price of an asset plus the
costs of operation. Ex: The TCO of a car is not only the purchase price but also
the expenses incurred through its use, such as repairs, insurance and fuel.
• Total value of ownership (TVO): It denotes the total financial value of a service
or product plus some subcategories like stock, undistributed dividends etc.
TVO = Total assets – total liabilities
3. Program management:
• It is the component that ensures people, projects and priorities work in a manner in which
individual processes are compatible with each other, so as to ensure seamless integration
and smooth functioning of the entire program.
• It takes care of business priorities, missions and goals, strategies and risks, cost and
value, infrastructure, business rules, multiple projects etc.
4. Development: The development process consists of following components
• Database/ data warehouse: consisting of ETL, data profiling, data cleansing and
database tools
• Data integration system: integration and quality tools
• Business analytics development: about processes and various technologies used
1. BI Architecture
Implementation layer
• It consists of technical components required to capture, transform, clean and convert data
into meaningful information and deliver it to meet business goals and bring value to
business. It includes two components
1. Data warehousing
• It is the process that prepares basic repository / data store from which data is extracted.
2. Information services
• It is not only the process of producing information; rather, it involves ensuring
that the information produced is aligned with business requirements and can be
acted upon to produce value for the company.
• Information is delivered in the form of KPI’s, reports, charts, dashboards or
scorecards, etc., or in the form of analytics.
• Data mining is a practice used to increase the body of knowledge.
• Applied analytics is generally used to drive action and produce outcomes.
BI Users
Casual users
• They are the consumers of information who use the pre-existing reports created by power
users and make decisions / take actions.
Power users
BI Applications
Technology solutions
• DSS: It is an information system which supports business decision making activities. Also
called knowledge based system, DSS known to support decision making that is required to
run day-to-day operations. It help in decision making at operational and tactical levels.
• EIS (executive information systems): Supports decision making at senior management
level by providing both internal and external information. It has an easy GUI with strong
reporting tools.
• OLAP (Online Analytical processing): The important thing about OLAP systems is
multidimensional data. These systems tools allow slicing and dicing of data. OLAP
system is depicted as below, it has 3 tiers.
➢ The bottom layer is data warehouse server layer. This tier houses the enterprise
wide data warehouse. The data is collected from multiple heterogeneous internal
data sources and a few external data sources. Different tools are used to at this
layer to cleanse, transform and load the data to warehouse. Data warehouse is
refreshed regularly to update the data from data sources. This layer has metadata
which stores information about data warehouse and sources of data.
➢ The middle layer is OLAP server layer. This layer has ROLAP and MOLAP
Servers. These process the data using OLAP cube.
➢ The top layer is data front end layer. This layer supports different tools like query,
reporting and analysis. This layer is mainly for the users to access the data in the
required form.
• Managed Query and Reporting: This includes standard reports, report wizards and
report designer which are used by developers to create reports. It also has business
rebuilder which is used by the business users to quickly create a report according to the
given report template.
• Data Mining: Data mining is about unravelling hidden patterns, hidden information,
spotting trends etc. For example in any online shopping sites when the user selects a
particular item suggestions will be shown saying that “those who brought this also
brought….”. This is done by the analysis of customers buying behavior.
Business Solutions
• Program team roles: The program team prepares the strategy on how the BI project will
execute. They are responsible for integration and coordination.
• Project team roles: This team executes the program team’s strategy
• He plans and budgets the projects and follows up the progress of each project
b) BI data architect:
• Optimizes current data usage and takes care of future data needs( design and content)
c) BI ETL architect:
• He determines the best way to obtain data from different operational sources/platforms.
d) BI technical architect:
• Assesses current technical architecture and estimates system capacity for long term
processing needs.
• Defines strategy for data backup and recovery and disaster recovery.
e) Metadata manager
• Who accessed the application metadata, when and what is the frequency of access?
f) BI Administrator:
• Monitor all scheduled jobs such as ETL jobs, reports for business users
b) BI business specialists:
• Identifies suitable data usage and structure for the business functional area.
• Ensures that the information is identified correctly at all levels and accessed at all modes.
c) BI project Manager:
• He leads the project and ensures delivery of all project needs and assesses risks.
• Documents requirements
• Designs training infrastructure & material and trains BI users and educates users on
warehousing capabilities.
• Defines and makes agreement with the users and also write user manual for services.
• Plans and executes acceptance tests and helps users find right information.
f) BI Designer:
• He interprets the requirements and designs the data structure for optimal access,
performance and integration
g) ETL specialist:
• Design and develop data transport and population process for environment
• Build and perform unit test source data transport and population process
h) Database administrator:
• Keeps check of the physical data appended to BI environment in current project cycle
• Create and optimize and administer physical tables, triggers and partitions
• Data from several heterogeneous data sources can extracted and brought together in a
data warehouse.
• Even when DIIT expand into several branch on multiple cities, it still can have one data
warehouse to support the information needs of the institution.
• Data anomalies can be corrected through an ETL package.
• Missing or incomplete records can be detected and duly corrected.
• Uniformity can be maintained over each attribute if a table.
• Data can be maintained over each attribute of a table.
• Data can be conveniently retrieved for analysis and generating reports
• Fact-base decision making can be easily supported by a data warehouse.
• Ad hoc queries can be easily supported.
• Little or no scope for ad hoc querying or queries that require historical data:
Operational databases do not archive historical data. Hence queries that require usage and
analysis of historical data are cannot be satisfied or take long time.
“A data warehouse is a subject oriented, integrated, time variant and non-volatile collection of
data in support of management’s decision making process”.
Data Mart
• A data mart holds data and aggregations about one single subject area/ domain which can
be used for analysis, reporting or decision support.
• Concentrates on integrating information from a given subject area or set of source
systems
• Is built focused on a dimensional model using a star schema.
• Data marts are restricted in their scope and business purpose
• They might not ensure single version of truth
• There are two types of data marts
➢ Independent data mart: They are sourced directly from one or more operational
systems, external sources, data generated from within a department or unit.
➢ dependent data mart: They are sourced from enterprise data warehouse
• ODS processes the operational data that is fed into data warehouse and provides a
homogeneous unified view which can be used for analysis and reporting.
• ODS stores current and very recent operational data.
• They don’t store historical data.
• ODS are used as staging area for data warehouse.
• The data warehouse on regular basis takes current processed data from ODS and adds it
to its own historical data.
A data warehouse is made up of all the data A data warehouse is made a subject-oriented ,
marts in an enterprise integrated, on- volatile , time-variants collection
of data in support of management’s decisions
It has been found that small organizations will It is able to “achieve single version of truth”
benefit by building the D.W larger organizations.
kimball’s approach is faster ,cheaper ,and less [Link]’S approach more expensive and is
complex. a time-consuming slower process involving
several complexities.
The single version of truth might be It is worth investment of time and efforts.
compromised in kemball’s approach
[Link]’s Approach
• Support for fact based decision making: The data warehouse must have enough data to
support more precise decision making, data should be relevant and easily accessible to
business users.
• Support for data security: There must be security mechanisms enabled so that the
confidential data must be accessible only to valid authorized users.
• Information consistency: The information provided to the users should maintain
single/consistent version of truth.
Data from operational systems flows into the staging area where it undergoes transformation and
is placed in the presentation area and then the users can access the data using data access tools.
Data Integration
• It is the ability to integrate data from several different sources for providing a unified
view of the data.
• It is the ability to consolidate data while maintaining the integrity and reliability of
data.
• Ex: in a library scenario, the data about library items (books, CD, magazines etc) and
student (student) is maintained in excel files. Whereas the transaction data like
issue/return is maintained in access database.
• Hence at some point it is required to integrate all the above data and present it to the
user in a unified view. This can be done using data integration techniques.
2. Instance Integration: It identifies and integrates all the instance of the data items that
represent the real word entity, distinct from schema instance integration.
• Data integration from multiple heterogeneous data source has become a high-priority task
in many large enterprises.
• Here the information is directly derived from the data to get accurate semantic
information on data content.
• A corporate company has almost 1000 employees working for it, it does not have any
ERP System but it have many enterprise applications, which store information about the
employees.
• To consolidate the data present in different applications to a single data warehouse
instance integration is used.
• It can be done using one concept of using primary key.
• In the following example, all the records are considered with “EmployeeNo” or “
SocialSecurityNo”. And all the values of employee name are replaced with one common
value.
• It helps decision makes to quickly access and query information based on key
variable to gain meaningful insights
• FDBS was first defined by McLeod and Heimbigner according to them FDBS defines the
architecture and interconnects the databases by supporting partial sharing and
coordination.
• A federated database is a system in which several databases (multiple and disparate)
appear to function as a single entity. Each component database in the system is
completely self-sustained and functional and is connected via computer network.
• When an application queries the federated database, the system figures out which of its
component databases contains the data being requested and passes the request to it.
• It is also called as virtual database.
• It consists of a Uniform User Interface through data abstraction, which enables the users
and clients to retrieve data from multiple non-contiguous databases using a single query.
• The FDBS had the ability to decompose larger/complex queries into sub queries before
submitting them to component DBMS. After which it combines the individual result set
of each sub query. It applies wrappers to sub queries so that they can be translated to
corresponding query languages used by component DBMS.
• It provides a unified way to look at data, avoids duplicate data and multiple queries.
Data warehousing
• It is used with in-memory data manipulation and when the data structure is large
• It is used basically in dot net platform using c# and [Link]
• It is a faster way of accessing data
1. Data interchange
• It is the structured transmission of organizational data between two or more
organization through electronic means; used for the transfer of electronic documents
from one computer to another
• Data interchange must not be seen merely as email. For instance, organization might
want to do away with bills of lading of lading (or even checks).and use appropriate
EDI message instead.
2. Object Brokering
• An ORB (Object Request Broker) is a certain variety of middleware software. It gives
programmers the freedom to make calls from one computer to another over a computer
network.
• It handles the transformation of in-process data structure to and from the byte sequence.
3. Modeling techniques
a. ER Modeling
• It is a logical design technique whose main goal is to reduce data redundancy and
hence solve the problems in insert, delete and update.
• It is used for transaction capture and helps in initial stages of data warehouse
construction
• Problems in ER modeling:
➢ Leads to creation of huge number of tables with lots of joins
➢ Difficult to understand and traverse by end users
➢ Not many s/w available to query
➢ Cannot be used for data warehouse where focus is performance, ad hoc/
unanticipated queries.
• Steps to draw a ER model:
➢ Identify entities
➢ Identify relationships between entities
➢ Identify key attributes
➢ Identify other relevant attributes for entities
➢ Draw ER diagram
➢ Review the ER diagram with business users and get their sign off.
b. Dimensional modeling
It is split as per the entities. It is split as per the dimensions and facts.
Data Quality
Data is often duplicated, inconsistent, ambiguous, and incomplete. We realize that all of this data
needs to be collected in one place and cleaned up .The obvious reason is bad data leads to bad
decision and bad decision lead to bad business.
Data integrity: It is the degree to which the attributes of data, associated with a certain entity
accurately describes the occurrence of that entity.
Examples:
Data quality
1. Correctness / accuracy:
• It is the degree to which the captured data correctly reflects /describes the world
entity/object/event.
• examples:
The address of a customer maintained in customer database is real address
The temperature recorded in thermometer is real temperature
The age of patient in hospital database is his real age
2. Consistency :
• It is about single version of truth
• The data throughout the enterprise must be in sync with each other
• Ex of consistent data is an employee has left a company and so his company email id is
made inactive
• Ex of inconsistent data is a customer has cancelled and surrendered his credit card. But
still his billing status reads “due”.
3. Completeness :
• It is the extent to which the expected attributes of data are provided
• Example on data completeness
a customer provides his address details at a restaurant but those details may be incorrect.
4. Timeliness :
• It is important to provide right data at right time to the right people in business
• Delayed supply of data becomes inconsequential and useless
• Ex of timely data:
• Ex of non timely data: the population census results are published two years after the
census survey is completed
5. Metadata :
• It is data about the data.
• It helps in determining data usage
1. Data governance
• It includes implementing the policies that govern use if data in an organization.
• It also ensures compliance with standards
• It is used for data security from hackers and to prevent data leaking out
2. Metadata management
• It is a collection of definitions and relationships that describe the information stored.
• It was earlier known as “data dictionary “
3. Data architecture and Design
• Includes overall architecture like data storage, ETL process design, BI architecture
etc.
4. Database management
• Includes optimizing performance of database, backup & recovery, integrity
management etc.
5. Data security
Data Profiling
• Data profiling is the process of statistically examining and analyzing the content in a
data source, and hence collecting information about that data.
• It checks the data for accuracy and completeness and hence assesses data quality
• Helps understand structure, content, relationships about the data and helps discover
anomalies.
• Helps understand issues/challenges in a database project
• Helps assess risks associated with data integration
• Also helps assess and validate metadata
• Data quality: analyze the quality of data at data source. Ex: a column containing
[Link] must be numeric. Hence remove any characters in the field.
• NULL values: look for the number of null values in an attribute
• Candidate keys: to select a candidate key, analysis of the extent to which certain
columns are distinct is done.
• Primary key selection: check if the candidate key does not violate NOT NULL and
UNIQUE constraint.
• Empty string values: check a column for null values or empty strings, since they create
problems while cube creation.
• String length: analyzing the largest, average and shortest string length helps decide
what data type is appropriate for that column
• Numeric length and type: assessing the max and min possible values for a numeric
column helps decide what datatype is suitable for that column.
• Identification of cardinality: The cardinality relationships are important for inner and
outer joins with respect to several BI tools. It is also important for design of fact-
dimension relationship
• Data format: Changing the data formats to make them more user friendly. Ex: marital
status from “M” & “S” to “married “ and “single”
➢ Automatically runs continual scans periodically to check whether all the data is
consistently updated.
➢ Removes duplicate records.
➢ Provides for the separation of data into categories to allow easier data management.
➢ Generates statistical reports about the data systems regularly.
• Datiris Profiler: This tool is very flexible and can manage your data without inputs from
the user.
➢ A powerful metric system.
➢ Very good compatibility with other applications.
➢ Domain validation.
➢ Command-line interface.
➢ Pattern analysis.
➢ Real time data viewing,
➢ Profile templates and spreadsheets.
• Talend Data Profiler: It is a free, open-source software solution to data profiling, which
is now slowly becoming popular. It is good enough for small businesses and non-profit
organizations.
• IBM Infosphere Information Analyzer: A powerful profiling tool developed by IBM, it
does a deep-scan of the system in a very short time-period.
➢ IBM Infosphere security framework.
➢ Scanning scheduler.
➢ Reports.
➢ Source system profiling and analysis.
➢ Rules analysis.
• SSIS Data Profiling Task: This data profiling tool is not an independent software tool. It
is integrated into the ETL software called SQL Server Integration Services (SSIS)
provided by Microsoft. It can provide useful statistics about the source data as well as the
transformed chat is being loaded into the destination system.
• Oracle Warehouse Builder: Oracle warehouse builder is not strictly a data-profiling sot
tool. It has the necessary functionality to let a person with zero programming knowledge
build a data warehouse from scratch. One of the feature is data profiling functionality
which helps analyze source systems and hence provide clean data.