0% found this document useful (0 votes)
113 views

Chapter 2 Data Management

This document discusses data management and relational databases. It explains that organizations gather data from business processes to monitor objectives and metrics defined in their strategy. Data must be structured simply to drive business impact. The document then covers database foundations, including how relational databases address issues with storing data in files. It describes database tables, keys, and relationships between tables visualized by joins. Normalization avoids data redundancy and anomalies by separating entities into multiple tables connected by keys.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views

Chapter 2 Data Management

This document discusses data management and relational databases. It explains that organizations gather data from business processes to monitor objectives and metrics defined in their strategy. Data must be structured simply to drive business impact. The document then covers database foundations, including how relational databases address issues with storing data in files. It describes database tables, keys, and relationships between tables visualized by joins. Normalization avoids data redundancy and anomalies by separating entities into multiple tables connected by keys.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Ateneo de Manila University

CSCI113i / CSCI213 / CS129.15: Business Intelligence

Chapter 2: Data Management


Based on the Balanced Scorecard system, there is ideally top-down organisational alignment
on strategy translated from the vision and mission statements. Once this strategy has been
laid out, management then designates certain objectives and metrics to monitor. Through
this monitoring, management can quickly see if certain decisions made bring about the
desired changes. At the same time, customer-facing employees can see how their day-to-day
actions affect the organisation’s pursuit of its objectives. For the organisation to see these
metrics, and to see whether they are achieving their objectives, they require data. This data
must be gathered from daily business processes then stored in some way so that they can be
easily accessed later on.

To allow for this quick feedback system, data must be structured in a way that
prioritises simplicity, understandability, and performance. While sophisticated technologies
and methodologies are impressive, if they are too complicated for the business user, then
the system will remain unused. This means data will fail to drive action and business impact.

In this chapter, we first go through a revision of relational databases and related


concepts such as normalisation and entity-relationship diagrams (ERDs). After these
foundational topics in databases, we tackle data warehousing using the dimensional
modelling method (sometimes called the Kimball method) [1]. We then end with a case
study of dimensional modelling in a retail business.

I. Database Foundations
In the past, data was usually stored in multiple files (imagine numerous Excel or CSV files).
This resulted in multiple issues that made it difficult to manage data [2].

1. Concurrency – it was difficult to manage the files if multiple users had to access the
same data, especially if they also had to modify it.
2. Organisation – it became difficult to manage systems with large numbers of files.
3. File Formatting – occasionally, more than one application would have to access the
files holding the data. For example, one application may be used to analyse it while
another is used to create visualisations. This meant all the applications had to agree
on a common format so that the data may be read properly.

Due to these difficulties, logically related collections of data called databases were created.
To create, maintain, and provide access to these databases, database management systems
(DBMS) were created alongside them. While there are other kinds of DBMS (GraphDBs,
Document-based DBs, etc.) we restrict our discussion here only to relational database
management systems (RDBMS). Examples of these RDBMS are MySQL, PostgreSQL, and
Microsoft SQL.

In a relational database, each database is composed of a number of tables. Below is


an example of what a table might look like.

1
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

customer_id name age member?


1 Michael John 25 true
2 Alexandra May 27 false
3 Philip Charles 32 true
4 Andrew George 45 false
5 Louis Anne 18 true
Table 1: Sample database table containing customers

These tables may then be queried for data using structured query language (SQL). While
this is useful to learn, we limit our discussion here only to the structure of these databases,
which make sense even without knowing SQL.

These databases are described as relational because tables are related to each other
using keys. To illustrate this, observe the two tables below.

Table Name: instructors


instructor_id (PK) name age position
1 Louis Charles 32 Lecturer
2 Mary Elizabeth 42 Instructor
3 Andrew Edward 56 Professor
4 Beatrice Anne 45 Senior Lecturer
5 William John 28 Lecturer
Table 2: Sample table with instructor data

Table Name: classes


class_id (PK) class_name units instructor_id (FK)
1 Business Intelligence 3 5
2 Data Analytics 6 2
3 Network Science 3 1
4 Calculus II 3 4
5 Computational Statistics 6 1
Table 3: Sample table with class data in the same DB as the instructor table above

The first column of each table has been underlined and labelled with (PK) to denote it as the
primary key, a unique identifier for each row in the table. Note that there must be a one-to-
one correspondence between primary keys and rows in the same table. For example, in the
instructor table, only Louis Charles may have the instructor_id of 1.

In the table containing different classes, the last column instructor_id is labelled (FK)
designating it as a foreign key. This column consists of primary keys from another table
(here from the instructors table). Since this is a foreign key, it may appear in the table more
than once. Use of this indicates a particular relationship between rows in the tables
depending on the context. For example, classes here are taught by instructors. Instructor 1,
“Louis Charles”, teaches both Network Science and Computational Statistics since these rows
have 1 as their instructor_id, the primary key of “Louis Charles” in the instructors table.

2
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

Relationships between tables can be visualised by matching rows in one table to rows
in another table using the primary and foreign keys. This matching is called a join in SQL.
Below is what the table would look like if the instructors and classes tables were joined on the
instructor_id key.

class_id class_name units instructor_id name age position


1 Business Intelligence 3 5 William 28 Lecturer
John
2 Data Analytics 6 2 Mary 42 Instructor
Elizabeth
3 Network Science 3 1 Louis 32 Lecturer
Charles
4 Calculus II 3 4 Beatrice 45 Senior
Anne Lecturer
5 Computational Statistics 6 1 Louis 32 Lecturer
Charles
Table 4: Joined version of instructors and classes tables

These database joins can be computationally expensive, especially when multiple tables must
be joined together. Depending on the DBMS used, there are multiple ways to optimise
common joins that are expected to happen frequently. To keep things here simple, we leave
these methods to the reader for further reading.

One might ask then why the need for joins instead of storing the entire table in its
already joined form. There are two main reasons for this.

1. Data Redundancy – note that in the joined table above, the name “Louis Charles,”
and the other fields pertaining to the instructor have been repeated. This duplicated
data can take up quite a lot of storage space in large databases. If the tables were
separated, it is only the instructor_id foreign key that is repeated instead of the entire
row of data.
2. Data Anomalies – because of the data redundancy involved, anomalies might be
introduced into the system. For example, if “Louis Charles” were to age one year, two
different rows in the joined table above would have to be updated, changing the age
value from 32 to 33. Since the database must be modified in two separate points, a
failed operation due to an intermittent connection to the database might modify one
row, but not the other. This would mean that one row containing data on “Louis
Charles” would have him at 32 years old, and the other at 33. By avoiding this data
redundancy through separate tables, only one row in the instructors table has to be
updated, preventing the possibility of data anomalies.

To avoid these two issues, databases often undergo normalisation, a process that changes
the structure of the database to some standard form that avoids data redundancy [2]. This
process essentially disentangles the different entities or objects represented by the tables.
However before tackling how this process is done, we first introduce a simpler way of
describing the structure of a database.

3
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

An entity-relationship diagram (ERD) is a visual that draws out the structure (also
called a schema) of the database. Each table is represented as an entity (drawn as a
rectangle), and relationships between tables are drawn as various connectors depending on
their nature. Below is a simple ERD representing the classes and instructors tables shown
earlier.

Figure 1: Instructors-classes ERD

In this ERD, there are two entities, classes and instructors, each corresponding to a database
table. Inside each entity are the fields corresponding to each of the columns in the original
table. The primary and foreign keys here are then labelled with PK and FK (with the bold
emphasis being optional). Between the two entities is an exactly-one to zero-or-many
connector. This means that each class must have exactly one instructor, but each instructor
may have 0 or more classes. The type of connector depends on the type of connector head
used. Below are some other examples of connector heads.

Connector Head Description


One-or-zero

Exactly-one

Zero-or-many
(≥ 0)

One-or-many
(≥ 1)
Table 5: Connector heads and their descriptions

Now that we know how these diagrams are read, we can proceed to a discussion of the
normalisation process. Suppose we have a database with only the single table shown below.

student_id student_name major


course_code course_title instructor_name room grade
187 Andrew MathCS 122 Databases Sofia Anne F115 B
Michael CS 110 Algorithms Felipe Luis F114 A
323 Henry George DS MA 111 Calculus Gustaf John B204 C+
MA 127 Statistics Carl William B102 B+
MA 151 Geometry Sarah May K304 B
Table 6: Sample database table to be normalised

4
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

In general, the normalisation process fits the database into standard structures called normal
forms. Usually we normalise into first, then second, and finally the third normal form. There
are other forms past this, but they are outside the scope of this course.

First Normal Form (1NF)

In the first normal form, we ensure rows in the table cannot be divided further. This means
splitting rows that contain multiple values and repeating common fields [2]. Placing the
above table into 1NF, results in the following.
student_id student_name major course_code course_title instructor_name room grade
187 Andrew Math CS 122 Databases Sofia Anne F115 B
Michael
187 Andrew Math CS110 Algorithms Felipe Luis F114 A
Michael
323 Henry George DS MA 111 Calculus Gustaf John B204 C+
323 Henry George DS MA127 Statistics Carl William B102 B+
323 Henry George DS MA151 Geometry Sarah May K304 B
Table 7: First Normal Form (1NF) of sample table

Here we have split rows with multiple values into separate rows, duplicating the common
fields in the process. Since the student Henry George takes three different courses, for
example, we split the row into three, with one course in each.

Second Normal Form (2NF)

Once a table is in the 1NF, we can then move it into the 2NF. This means removing partial
functional dependencies such that every non-key attribute can be retrieved with the whole
key and not just part of it [2]. This is best illustrated through an example. We assume three
things here.

1. A student cannot have multiple majors


2. A student cannot repeat a course
3. Only one instructor per course

Observe Table 7 in 1NF. Recall that a primary key has to be a unique identifier for
each row of the table. No single field in the table appears to be a good primary key (since
student_id has been duplicated, and if more than one student takes a certain course,
course_code would also be repeated).

We can propose making use of a compound key, a primary key made up of the
concatenation of other fields. In this case, we can use {student_id, course_code} as a candidate
primary key. Note that if we want to find out the grade of a specific student in some class,
we require both the student_id and course_code. Since we require the entire candidate key,
this is called a full functional dependency. However, a few of the other non-key fields do
not require the entire candidate key. Below are fields that can be retrieved using only one
part of the key, either student_id or course_code.

student_id → student_name, major


course_code → course_title, instructor_name, room

5
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

These cases are called partial functional dependencies. This is usually a sign that the table
contains data on two different objects, in this case students and courses. In general, we can
convert this into the 2NF by splitting the table based on the different objects it seeks to
represent. We show this below in the form of an ERD.

Figure 2: ERD showing the database in 2NF

In this new form, we now have a database composed of three tables. Observe that the fields
in each of the entities can be retrieved using the entire primary key of the entity. Aside from
splitting the student and course information into separate tables, we also created a third one
called registrations. This table contains the fields that require the primary key of both the
students and courses tables, namely the grade field. We connect students and courses to
registrations using one-to-many connectors because a student can be registered into
multiple courses, and a course may have multiple students registered, but there is only one
student and one course in each registration.

Third Normal Form (3NF)

The 3NF is often the stopping point of the normalisation process (at least for our purposes).
To get to this form, we take our 2NF database and then remove transitive dependencies,
cases when a non-key attribute determines another attribute. Suppose that in our example,
we have these two assumptions [2].

1. Each instructor only teaches in one classroom


2. Instructor names are unique in the system and can be used as a primary key

Given these assumptions then there is the following transitive dependency.

instructor_name → room

Since we have the assumption that instructors only teach in one room, then knowing the
instructor of a course also allows us to find out what room the course is taught in. Since
instructor_name is a non-key attribute (not the primary key or part of some composite key),
this creates a transitive dependency that we should remove by once again splitting the table.

6
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

Figure 3: ERD showing database in 3NF given the earlier mentioned assumptions

Here we add another entity called instructors with the instructor_name field as its primary key
(under an assumption that names are unique identifiers, otherwise we can just add a new
instructor_id field). Since the room field is part of a transitive dependency determined by the
instructor (given that instructors only teach in one room) then it is also moved to the
instructors entity. By adding the instructor_name as a foreign key in the courses table, we are
still able to retrieve the room and instructor of each course using table joins. By doing this,
we have now placed the database in the 3NF.

II. Data Warehousing


What we have tackled so far applies to most standard databases such as those associated
with operational source systems [1]. These systems serve as the backend of many integral
business processes such as transactional data from checkout counters or employee
databases in the human resource department. Databases that these systems use, called
operational databases, are structured to ensure the smooth operation of daily business
processes. For example, the database used by a point-of-sales (POS) system will likely allow
for fast updates so that new sales transactions can easily be entered into the system.

Since these operational databases are built around business processes, the data of
the organisation may be split across multiple databases often owned by different
departments. This can cause an issue sometimes called siloing where data is concentrated in
each department and not shared with the whole organisation, often causing issues such as
inconsistencies (sales having different numbers from operations, for example). Aside from
this, departments not knowing what the other departments know means a lot of data is not
being used to help make decisions.

This kind of set-up, with multiple databases owned by various departments, makes
data analysis and visualisation quite challenging, especially since this can involve multiple

7
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

clients accessing the data across sources with various client applications. To get around this
issue, one should ideally create a data warehouse, a specially structured database that is
constructed to allow for the complex calculations necessary for data analysis and ease of use
in data visualisation through business reporting and dashboarding applications. Key to
creating these data warehouses is an integration step called Extract-Transform-Load (ETL).

1. Extract – data is pulled from various operational databases into a data staging area.
2. Transform – data in the staging area is transformed to adhere to the data warehouse
structure and format. At this stage, it is also cleaned to ensure correctness and
consistency. Any conflict between different source databases should be resolved
here.
3. Load – transformed data is then loaded into the data warehouse. After loading, the
data may then be accessed by the different client applications.

Figure 4: Diagram showing movement of data to and from the data warehouse

Data warehouses should contain clean, analysis ready data in a specified format so that it can
act as a single source of truth for the organisation. With this, the data warehouse has six
main goals.

1. Make data easily accessible – much easier to get data from one source rather than
having it scattered across systems or departments.
2. Present information consistently – ETL should ideally have cleaned data such that
inconsistencies between data sources have been resolved.
3. Adaptive and be resilient to change – the structure of the data warehouse allows it
to adapt better to the different queries business users may throw at it.
4. Protect information assets – source systems usually do not keep a large amount of
historical data. The data warehouse, on the other hand, keeps a lot of this past data.

8
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

Thus, it is often secured quite well and has measures in place to prevent data loss
from things such as hardware malfunction.
5. Foundation for improved decision making – the data warehouse is structured in a
way that makes data analysis and visualisation much simpler to carry out.
6. Business community acceptance – the business users must accept and use the data
warehouse. No matter how sophisticated the technology, if it is not fit for the needs
of the business users then the data warehouse would have failed at its goals.

III. Dimensional Modelling


One of the most well-known methods for structuring data warehouses is dimensional
modelling (sometimes called the Kimball method, named after the author of the book that
helped popularise it) [1]. At the core of this method is the desire to keep the data warehouse
simple and easy to understand while maintaining good query performance so data can be
retrieved easily. Data is also stored in its most granular level in such a way that it can easily
adapt to a wide range of queries business users can throw at it.

A dimensionally modelled data warehouse is made up of multiple data marts, each


representing a business process. Note that marts are based on processes such as invoicing or
order processing and not departments. If a certain process involves multiple departments,
then their data must be pooled together. Doing this prevents data from being stored in silos
within departments and allows it to be shared across the whole organisation. These
individual marts are then connected to each other to form the data warehouse (we discuss
the method of doing this connection later). Each data mart then has two main parts: (1) fact
tables, and (2) dimension tables.

Fact Tables

The fact table contains measurements that result from the business process and take up the
most storage space in the data warehouse [1]. Ideally this is stored in its most granular form
to allow for the most query flexibility. Each row of the fact table is one measurement event
such as the scanning of a barcode or the entry of an item into the warehouse inventory. It is
important to define the grain of this measurement event, what each row really corresponds
to in the business process. For example, one could say that each row in the fact table
represents a single product sold in some transaction.

At the core of the fact table are the facts, the business measures that are taken
during the measurement event such as the sales price of an item scanned at a grocery
checkout counter. In general, there are three kinds of facts [1].

1. Additive Facts
Usually the most informative facts are those that are additive across the different
dimensions. We talk more about dimensions later on, but examples of additive facts
would be the earlier mentioned sales price. With this fact recorded, one can theoretically
sum to compute a sales number over a certain time period, over a certain brand, or even
over a store shelf.

9
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

2. Semi-additive Facts
Some facts are additive except in the time dimension. For example, bank account
balances can be added across category or product type (sum of balances for Junior
Savings Accounts vs. Regular Accounts), but they cannot be summed across the time
dimension. If Account A has Php 10,000 on day 1, and Php 15,000 on day 3, summing
them to Php 25,000 does not make sense since they are balances of the same account at
two different points in time.

3. Non-additive Facts
Some facts really cannot be added across any of the dimensions. For example, the
discount percentage that a certain product was sold at cannot be added. If Product A
was sold at 15% off, and Product B was sold at 25% off, adding them up to 40% off does
not make sense. Given the number of products a store usually sells, this percentage can
easily go above 100% if summed. Other kinds of ratios also fall into this category. While
they cannot be summed, knowing the mean value of these non-additive facts may be
useful.

Below is an example of an entity representing a fact table in a hypothetical retail database


[1].

Figure 5: Sample retail fact table

Note that there are numerous foreign keys in the fact table. Each of these foreign keys
indicates a relationship to some dimension table. The primary key in this case is a composite
key made up of some subset of the fields such that they create a combination that can
uniquely identify each row of the fact table. For example, a combination of Transaction# and
Product Key can create a primary key.

Dimension Tables

Dimension tables contain descriptors of the facts in the fact table, which may serve as
different query constraints. They represent various ways we can slice the data following the

10
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

word by [1]. For example, we can see the sales data by brand, by product category, by
customer age, or by date (or even some combination of these).

Dimension tables take up much less storage space compared to the fact tables. They
tend to have many columns due to the large number of descriptors in the dimension, but
they usually have fewer rows. Below is an example of a product dimension containing the
different descriptors a hypothetical product may have [1].

Figure 6: Sample product dimension

There are quite a number of descriptors for products in the hypothetical store, each
descriptor being represented by a column in the dimension table. By including a lot of the
(often textual) descriptors, the data warehouse is ready for a wide range of possible queries
from business users. For example, one can slice the data by weight and by shelf height to see
if heavy products in higher shelves are sold less. Different combinations of the dimension
table columns allow us to view the data in different ways, maximising its potential to drive
business action.

Star Schema

Putting both dimension and fact tables to form the whole data mart results in what is
sometimes called a star schema due to its shape. In the centre is the fact table and the rays
of the star are the dimension tables. Below is an example of a database with a star schema.

11
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

Figure 7: Sample star schema

In this case, dimension tables have a one-to-many relationship with the fact table (a row can
only have one date, but a date may be in multiple rows). With this star schema, there may be
numerous ways to slice the data with a combination of different descriptors from various
dimensions. For example, using this structure, we can generate reports of sales per brand in
Makati City in the month of December.

Note, however, that the star schema is not normalised. For example, one could create
a separate product or store category entity to prevent the repetition of this data across
multiple rows in the dimension table. Observe the ERD below where we have normalised a
few of the structures in the original star schema.

12
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

Figure 8: Example of a snowflake schema

This normalised star is sometimes called a snowflake schema since the rays of the star then
branch out, looking like the shape a snowflake. A key thing to note here is that star schemas
often include hierarchical data. For example, a barangay is inside a city, which is inside a
province that belongs to a region. In an operational database, this would usually be
normalised as a long chain similar to the structure above. This makes it easy to move a city
from one province to another, for example, since one would only have to change one row
(the province key of a specific city). In the denormalised structure, one would have to change
the province for all the different barangays in the city that moved.

There are two main reasons for keeping this denormalised structure though. First,
since dimension tables only make up a small part of the storage space required by the

13
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

database, switching to a normalised version to prevent repetition of data only results in


negligible space savings. Second, the main goal of the model is to be easy to understand
and quick to execute queries. Normalising this database will make it much harder to read
with multiple new tables. Aside from this, queries will then require many more join
operations, which can impact performance, especially in large databases. Though
normalisation can help prevent data anomalies, through proper data management in the ETL
phase, these issues can be avoided.

Data Warehouse Bus

As mentioned earlier, a dimensionally modelled data warehouse will contain multiple data
marts, each corresponding to a business process. These are all connected to each other
through a data warehouse bus [1]. A bus is a common structure from which many things
can connect to and get power. The most popular example of this is the universal serial bus
(USB), which connects printers, mice, keyboard, storage devices, and other devices to a
computer.

In a data warehouse bus, fact tables in different data marts are connected to each
other through common conformed dimensions [1]. Even if each mart is developed by a
different team, by plugging into this structure of standardised dimensions, the data
warehouse as a whole becomes easier to manage and interpret. Below is an illustration
showing this bus.

Figure 9: Illustration of Data Warehouse Bus from [1]

Each of the conformed dimensions is visualised as a lane in the bus. The fact tables of the
data marts are then connected to specific lanes in the bus corresponding to the dimensions
that they use. By structuring things in this way, all the data marts can use a common set of
keys. Product 143 in the Store Inventory data mart will be the same product in the Store
Sales data mart. If a new product is introduced and loaded into the Product dimension table,
then it becomes available for use across all of the marts attached to its lane in the bus. Doing
this makes the data warehouse easier to maintain (data only has to be loaded once for the

14
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

whole warehouse instead of once per mart) and ensures a consistent view of the data across
the different business processes.

IV. Retail Example


In this last section, we go through one example of developing a data mart for a retail sales
process. We first go through the hypothetical context and then move on to a Four-Step
Dimensional Design Process [1]. Afterward, we present the design through an ERD and go
through some of its nuances.

Suppose you are creating a data warehouse for Loyola Grocer, a chain of grocery
stores with over 100 branches across the Philippines. Each store has multiple sections selling
a variety of product categories such as dairy, meat, produce, and wine. Each store stocks
more than 20,000 products, also called stock-keeping units (SKUs).

Loyola Grocer
1234 Katipunan Ave.
Quezon City, Metro Manila
(02)000-0000

Store: 0132
Cashier: 000213148/Juan dela Cruz

1x 0013984058 Cup Noodle Pro, 400g 99.99

1x 2148548359 Shiny Teeth Toothpaste 199.99


Saved Php 50 from 249.99

1x 0238040400 The Fat Cow Milk, 1L 85.99

1x 3048580034 Eskimo Ice Cream, 400ml 299.99

TOTAL 685.96

AMOUNT TENDERED
CASH 700.00
CHANGE 15.04

ITEM COUNT 4

--------------------------------------------------------------------------------
Transaction: 723 05/05/2020 11:07AM
--------------------------------------------------------------------------------

Thank you for shopping at Loyola Grocer!


Have a great day!

003483925849350340234895060

Figure 10: Sample receipt from Loyola Grocer based on [1]

15
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

While multiple business processes are involved in the grocery store and these all
generate some sort of data, we focus specifically on the point-of-sales system where the
cashier scans different products to be purchased. Figure 10 above is an example of a receipt
from one of the Loyola Grocer stores. From this we can immediately gleam some information
about the data that we have to model. An important thing to note is that promotions can
modify the price of a specific product. This means that one product may be sold at different
prices at various points in time.

After a cursory glance at the receipt, we can then move on to laying out the
dimensional model. We follow here a Four-Step Dimensional Design Process as a guide [1].

1. Select the Business Process

As mentioned earlier, data marts are based on business processes, thus the first step is
choosing a process to model. This usually involves an understanding of what the business
needs and what data is available. In this case, we have already chosen the retail sales
transaction process of the grocery store.

2. Declare the Grain

Once the business process to model has been identified, we must then define the grain, what
each fact table row is supposed to represent. This will dictate the level of detail available to
users of the data warehouse. This should be as granular as possible.

In this case, we take the grain to be each individual product per point-of-sales
transaction since this is the most atomic grain we can get. This allows for maximum flexibility
when performing queries. For example, since we also know where stores are located, we can
view sales of milk across different cities. If we took the grain to be per transaction or receipt,
we would lose the ability to query about specific products, which could have yielded a lot of
insight for the grocery business. The best we could do would have been to count the sales of
all products per city, but not specifically milk.

3. Identify the Dimensions

Once a grain has been defined, we can move on to finding the different descriptors one
might use to view the data by. In this case, we can see that we would need a Product, Date,
Store, Cashier, and Payment Method dimension. Depending on how data is collected by the
operational system, it would also be wise to include a Promotion dimension to account for
cases when products are purchased under special conditions. Lastly, we can create a
Transaction dimension to easily pull products that were purchased on one receipt. We will
discuss later how this is a special type of dimension called a degenerate dimension. At this
stage, we need not fill out all the descriptors in each of these dimensions yet.

4. Identify the Facts

Finally, we then determine the facts that would be in the fact table. Since we have already
defined the grain, what each measurement event actually measures should fall out naturally.
In this case, what are we measuring about the individual products in each transaction? We

16
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

know the sales quantity, regular unit price (price before discount), discount unit price
(discount amount in pesos), and net unit price (price after discount).

Some other facts can easily be computed from this set of facts. For example, total
regular sales price would just be the regular unit price times the sales quantity. A similar
operation can be done to compute the total discount price and total net sales price. While
these can easily be computed by a querying application, there may be circumstances when it
is best to precompute them and store them in the fact table. While this will take up more
storage space, it may be worth it to prevent possible computational errors by users of the
data warehouse.

Putting Things Together

After finishing the four-step process, we can visualise the structure of our data mart (without
the complete descriptors first).

Figure 11: Data mart for retail sales process at Loyola Grocer

Depending on the data captured by the operational source system, these dimensions may be
filled with different textual descriptors. Figure 7 shows some ways to fill in the Store and
Product Dimensions. Due to time constraints, and to keep things simple, we focus only on
three of these dimensions: Date, Promotion, and Transaction (a unique dimension we discuss
more later on).

Date Dimension

The date dimension is special in that it is almost guaranteed to be in every dimensional


model [1]. Almost all business process will involve some sort of time dimension (when the
process is started or completed). One of the unique things about this dimension is that it can
be built out in advance since we already have calendars ready for the future. Assuming we

17
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

are storing dates per day, 20 years’ worth of days is only around 7,300 rows in the database.
Below is an example of what the date dimension table could look like.

Figure 12: Sample date dimension for Loyola Grocer

The sample table Date Dimension in Figure 12 includes a lot of the basic textual descriptors
to view the data from certain perspectives. For example, using the Day of the Week field, we
can easily compare product sales across different days of the week. Included here also are a
few flag fields, indicators of a certain status. Given the Holiday Flag field, we would be able to
see if sales of specific products are better on holidays than on non-holidays. What fields to
put in the Date Dimension will depend heavily on the business though. If many of the
processes operate based on a different calendar (such as a fiscal calendar), one could include
fields for those to easily filter data based on them.

While time-of-day and dates usually go together as a single time stamp, if we create
a row per minute instead of per day, the size of the table quickly explodes to millions of
rows. This can seriously bog down query performance of an important dimension used in
multiple data marts. Because of this, time-of-day may instead be included as a datetime fact
in the fact table. This also allows us to compute (with some work) how long a specific
transaction takes.

Promotion Dimension

The Promotion Dimension can be very interesting to business users because it is a causal
dimension, a dimension whose descriptors are thought to affect the facts being measured
(in this case, retail sales) [1]. For the most part, analysts are interested in finding out how
successful promotions are. Below are just some of the factors that can be used to measure
how successful sales were during a promotion.

18
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

1. Lift – gains in sales over the promotion period. Only measurable if a baseline sales
value is set (to be able to compute the gain).
2. Cannibalisation – when there are promotions such as price reductions on some
products, people purchase that instead of some other one they would have
purchased. For example, if detergent brand A is on sale, one might buy it instead of
B, leading to an increase in the sales of A, but decrease in B. For a retailer like Loyola
Grocer, this may not always be a good thing since it could lead to large unsold stock
of detergent B (which incurs storage costs).
3. Profitability – this measures the gain in profit of a promotion over a certain set
baseline, taking into account the other factors such as cannibalisation and cost of the
promotions.

Promotions are rarely constrained to a single type. For example, if there is a price reduction
for a certain product, there is usually an accompanying ad, display, or coupon. Depending on
how the business thinks about things, each of these could be a separate dimension, but for
simplicity, we assume here that promotions come in these packages. Below is an example of
what the Promotion Dimension might look like in this case.

Figure 13: Sample promotion dimension for Loyola Grocer

One thing to note though is that a vast majority of products sold will likely be sold
with no promotion. It is important to keep the fact table free from null foreign keys as these
can be confusing and lead to issues when making queries. To get around this, a special type
of promotion called “No Promotion” with a Promotion Key of 0 or -1 may be created.

Transaction Dimension

Notice how each row of the fact table includes a Transaction# dimension, but there is no
Transaction Dimension table. Nonetheless it is a useful dimension since it allows us to pull all
the products purchased within the same transaction. This can be used to measure things
such as the total transaction time, for example.

19
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence

Though it looks like a dimension key, since there are no descriptors associated with it,
it does not make much sense to create a separate table. Because of this, it is called a
degenerate dimension, a dimension with no corresponding dimension table. These
dimensions often play a part in creating the composite key that serves as the primary key of
the fact table. In the case of Loyola Grocer, a combination of the Transaction# and Product
Key is sufficient to create a unique combination that can serve as the primary key.

V. References
[1] Ralph Kimball and Margy Ross. 2013. The data warehouse toolkit: The definitive guide
to dimensional modeling, Third edition. John Wiley & Sons Inc.

[2] Logical Design. CS122. DISCS 2014.

20

You might also like