Chapter 2 Data Management
Chapter 2 Data Management
To allow for this quick feedback system, data must be structured in a way that
prioritises simplicity, understandability, and performance. While sophisticated technologies
and methodologies are impressive, if they are too complicated for the business user, then
the system will remain unused. This means data will fail to drive action and business impact.
I. Database Foundations
In the past, data was usually stored in multiple files (imagine numerous Excel or CSV files).
This resulted in multiple issues that made it difficult to manage data [2].
1. Concurrency – it was difficult to manage the files if multiple users had to access the
same data, especially if they also had to modify it.
2. Organisation – it became difficult to manage systems with large numbers of files.
3. File Formatting – occasionally, more than one application would have to access the
files holding the data. For example, one application may be used to analyse it while
another is used to create visualisations. This meant all the applications had to agree
on a common format so that the data may be read properly.
Due to these difficulties, logically related collections of data called databases were created.
To create, maintain, and provide access to these databases, database management systems
(DBMS) were created alongside them. While there are other kinds of DBMS (GraphDBs,
Document-based DBs, etc.) we restrict our discussion here only to relational database
management systems (RDBMS). Examples of these RDBMS are MySQL, PostgreSQL, and
Microsoft SQL.
1
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
These tables may then be queried for data using structured query language (SQL). While
this is useful to learn, we limit our discussion here only to the structure of these databases,
which make sense even without knowing SQL.
These databases are described as relational because tables are related to each other
using keys. To illustrate this, observe the two tables below.
The first column of each table has been underlined and labelled with (PK) to denote it as the
primary key, a unique identifier for each row in the table. Note that there must be a one-to-
one correspondence between primary keys and rows in the same table. For example, in the
instructor table, only Louis Charles may have the instructor_id of 1.
In the table containing different classes, the last column instructor_id is labelled (FK)
designating it as a foreign key. This column consists of primary keys from another table
(here from the instructors table). Since this is a foreign key, it may appear in the table more
than once. Use of this indicates a particular relationship between rows in the tables
depending on the context. For example, classes here are taught by instructors. Instructor 1,
“Louis Charles”, teaches both Network Science and Computational Statistics since these rows
have 1 as their instructor_id, the primary key of “Louis Charles” in the instructors table.
2
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
Relationships between tables can be visualised by matching rows in one table to rows
in another table using the primary and foreign keys. This matching is called a join in SQL.
Below is what the table would look like if the instructors and classes tables were joined on the
instructor_id key.
These database joins can be computationally expensive, especially when multiple tables must
be joined together. Depending on the DBMS used, there are multiple ways to optimise
common joins that are expected to happen frequently. To keep things here simple, we leave
these methods to the reader for further reading.
One might ask then why the need for joins instead of storing the entire table in its
already joined form. There are two main reasons for this.
1. Data Redundancy – note that in the joined table above, the name “Louis Charles,”
and the other fields pertaining to the instructor have been repeated. This duplicated
data can take up quite a lot of storage space in large databases. If the tables were
separated, it is only the instructor_id foreign key that is repeated instead of the entire
row of data.
2. Data Anomalies – because of the data redundancy involved, anomalies might be
introduced into the system. For example, if “Louis Charles” were to age one year, two
different rows in the joined table above would have to be updated, changing the age
value from 32 to 33. Since the database must be modified in two separate points, a
failed operation due to an intermittent connection to the database might modify one
row, but not the other. This would mean that one row containing data on “Louis
Charles” would have him at 32 years old, and the other at 33. By avoiding this data
redundancy through separate tables, only one row in the instructors table has to be
updated, preventing the possibility of data anomalies.
To avoid these two issues, databases often undergo normalisation, a process that changes
the structure of the database to some standard form that avoids data redundancy [2]. This
process essentially disentangles the different entities or objects represented by the tables.
However before tackling how this process is done, we first introduce a simpler way of
describing the structure of a database.
3
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
An entity-relationship diagram (ERD) is a visual that draws out the structure (also
called a schema) of the database. Each table is represented as an entity (drawn as a
rectangle), and relationships between tables are drawn as various connectors depending on
their nature. Below is a simple ERD representing the classes and instructors tables shown
earlier.
In this ERD, there are two entities, classes and instructors, each corresponding to a database
table. Inside each entity are the fields corresponding to each of the columns in the original
table. The primary and foreign keys here are then labelled with PK and FK (with the bold
emphasis being optional). Between the two entities is an exactly-one to zero-or-many
connector. This means that each class must have exactly one instructor, but each instructor
may have 0 or more classes. The type of connector depends on the type of connector head
used. Below are some other examples of connector heads.
Exactly-one
Zero-or-many
(≥ 0)
One-or-many
(≥ 1)
Table 5: Connector heads and their descriptions
Now that we know how these diagrams are read, we can proceed to a discussion of the
normalisation process. Suppose we have a database with only the single table shown below.
4
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
In general, the normalisation process fits the database into standard structures called normal
forms. Usually we normalise into first, then second, and finally the third normal form. There
are other forms past this, but they are outside the scope of this course.
In the first normal form, we ensure rows in the table cannot be divided further. This means
splitting rows that contain multiple values and repeating common fields [2]. Placing the
above table into 1NF, results in the following.
student_id student_name major course_code course_title instructor_name room grade
187 Andrew Math CS 122 Databases Sofia Anne F115 B
Michael
187 Andrew Math CS110 Algorithms Felipe Luis F114 A
Michael
323 Henry George DS MA 111 Calculus Gustaf John B204 C+
323 Henry George DS MA127 Statistics Carl William B102 B+
323 Henry George DS MA151 Geometry Sarah May K304 B
Table 7: First Normal Form (1NF) of sample table
Here we have split rows with multiple values into separate rows, duplicating the common
fields in the process. Since the student Henry George takes three different courses, for
example, we split the row into three, with one course in each.
Once a table is in the 1NF, we can then move it into the 2NF. This means removing partial
functional dependencies such that every non-key attribute can be retrieved with the whole
key and not just part of it [2]. This is best illustrated through an example. We assume three
things here.
Observe Table 7 in 1NF. Recall that a primary key has to be a unique identifier for
each row of the table. No single field in the table appears to be a good primary key (since
student_id has been duplicated, and if more than one student takes a certain course,
course_code would also be repeated).
We can propose making use of a compound key, a primary key made up of the
concatenation of other fields. In this case, we can use {student_id, course_code} as a candidate
primary key. Note that if we want to find out the grade of a specific student in some class,
we require both the student_id and course_code. Since we require the entire candidate key,
this is called a full functional dependency. However, a few of the other non-key fields do
not require the entire candidate key. Below are fields that can be retrieved using only one
part of the key, either student_id or course_code.
5
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
These cases are called partial functional dependencies. This is usually a sign that the table
contains data on two different objects, in this case students and courses. In general, we can
convert this into the 2NF by splitting the table based on the different objects it seeks to
represent. We show this below in the form of an ERD.
In this new form, we now have a database composed of three tables. Observe that the fields
in each of the entities can be retrieved using the entire primary key of the entity. Aside from
splitting the student and course information into separate tables, we also created a third one
called registrations. This table contains the fields that require the primary key of both the
students and courses tables, namely the grade field. We connect students and courses to
registrations using one-to-many connectors because a student can be registered into
multiple courses, and a course may have multiple students registered, but there is only one
student and one course in each registration.
The 3NF is often the stopping point of the normalisation process (at least for our purposes).
To get to this form, we take our 2NF database and then remove transitive dependencies,
cases when a non-key attribute determines another attribute. Suppose that in our example,
we have these two assumptions [2].
instructor_name → room
Since we have the assumption that instructors only teach in one room, then knowing the
instructor of a course also allows us to find out what room the course is taught in. Since
instructor_name is a non-key attribute (not the primary key or part of some composite key),
this creates a transitive dependency that we should remove by once again splitting the table.
6
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
Figure 3: ERD showing database in 3NF given the earlier mentioned assumptions
Here we add another entity called instructors with the instructor_name field as its primary key
(under an assumption that names are unique identifiers, otherwise we can just add a new
instructor_id field). Since the room field is part of a transitive dependency determined by the
instructor (given that instructors only teach in one room) then it is also moved to the
instructors entity. By adding the instructor_name as a foreign key in the courses table, we are
still able to retrieve the room and instructor of each course using table joins. By doing this,
we have now placed the database in the 3NF.
Since these operational databases are built around business processes, the data of
the organisation may be split across multiple databases often owned by different
departments. This can cause an issue sometimes called siloing where data is concentrated in
each department and not shared with the whole organisation, often causing issues such as
inconsistencies (sales having different numbers from operations, for example). Aside from
this, departments not knowing what the other departments know means a lot of data is not
being used to help make decisions.
This kind of set-up, with multiple databases owned by various departments, makes
data analysis and visualisation quite challenging, especially since this can involve multiple
7
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
clients accessing the data across sources with various client applications. To get around this
issue, one should ideally create a data warehouse, a specially structured database that is
constructed to allow for the complex calculations necessary for data analysis and ease of use
in data visualisation through business reporting and dashboarding applications. Key to
creating these data warehouses is an integration step called Extract-Transform-Load (ETL).
1. Extract – data is pulled from various operational databases into a data staging area.
2. Transform – data in the staging area is transformed to adhere to the data warehouse
structure and format. At this stage, it is also cleaned to ensure correctness and
consistency. Any conflict between different source databases should be resolved
here.
3. Load – transformed data is then loaded into the data warehouse. After loading, the
data may then be accessed by the different client applications.
Figure 4: Diagram showing movement of data to and from the data warehouse
Data warehouses should contain clean, analysis ready data in a specified format so that it can
act as a single source of truth for the organisation. With this, the data warehouse has six
main goals.
1. Make data easily accessible – much easier to get data from one source rather than
having it scattered across systems or departments.
2. Present information consistently – ETL should ideally have cleaned data such that
inconsistencies between data sources have been resolved.
3. Adaptive and be resilient to change – the structure of the data warehouse allows it
to adapt better to the different queries business users may throw at it.
4. Protect information assets – source systems usually do not keep a large amount of
historical data. The data warehouse, on the other hand, keeps a lot of this past data.
8
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
Thus, it is often secured quite well and has measures in place to prevent data loss
from things such as hardware malfunction.
5. Foundation for improved decision making – the data warehouse is structured in a
way that makes data analysis and visualisation much simpler to carry out.
6. Business community acceptance – the business users must accept and use the data
warehouse. No matter how sophisticated the technology, if it is not fit for the needs
of the business users then the data warehouse would have failed at its goals.
Fact Tables
The fact table contains measurements that result from the business process and take up the
most storage space in the data warehouse [1]. Ideally this is stored in its most granular form
to allow for the most query flexibility. Each row of the fact table is one measurement event
such as the scanning of a barcode or the entry of an item into the warehouse inventory. It is
important to define the grain of this measurement event, what each row really corresponds
to in the business process. For example, one could say that each row in the fact table
represents a single product sold in some transaction.
At the core of the fact table are the facts, the business measures that are taken
during the measurement event such as the sales price of an item scanned at a grocery
checkout counter. In general, there are three kinds of facts [1].
1. Additive Facts
Usually the most informative facts are those that are additive across the different
dimensions. We talk more about dimensions later on, but examples of additive facts
would be the earlier mentioned sales price. With this fact recorded, one can theoretically
sum to compute a sales number over a certain time period, over a certain brand, or even
over a store shelf.
9
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
2. Semi-additive Facts
Some facts are additive except in the time dimension. For example, bank account
balances can be added across category or product type (sum of balances for Junior
Savings Accounts vs. Regular Accounts), but they cannot be summed across the time
dimension. If Account A has Php 10,000 on day 1, and Php 15,000 on day 3, summing
them to Php 25,000 does not make sense since they are balances of the same account at
two different points in time.
3. Non-additive Facts
Some facts really cannot be added across any of the dimensions. For example, the
discount percentage that a certain product was sold at cannot be added. If Product A
was sold at 15% off, and Product B was sold at 25% off, adding them up to 40% off does
not make sense. Given the number of products a store usually sells, this percentage can
easily go above 100% if summed. Other kinds of ratios also fall into this category. While
they cannot be summed, knowing the mean value of these non-additive facts may be
useful.
Note that there are numerous foreign keys in the fact table. Each of these foreign keys
indicates a relationship to some dimension table. The primary key in this case is a composite
key made up of some subset of the fields such that they create a combination that can
uniquely identify each row of the fact table. For example, a combination of Transaction# and
Product Key can create a primary key.
Dimension Tables
Dimension tables contain descriptors of the facts in the fact table, which may serve as
different query constraints. They represent various ways we can slice the data following the
10
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
word by [1]. For example, we can see the sales data by brand, by product category, by
customer age, or by date (or even some combination of these).
Dimension tables take up much less storage space compared to the fact tables. They
tend to have many columns due to the large number of descriptors in the dimension, but
they usually have fewer rows. Below is an example of a product dimension containing the
different descriptors a hypothetical product may have [1].
There are quite a number of descriptors for products in the hypothetical store, each
descriptor being represented by a column in the dimension table. By including a lot of the
(often textual) descriptors, the data warehouse is ready for a wide range of possible queries
from business users. For example, one can slice the data by weight and by shelf height to see
if heavy products in higher shelves are sold less. Different combinations of the dimension
table columns allow us to view the data in different ways, maximising its potential to drive
business action.
Star Schema
Putting both dimension and fact tables to form the whole data mart results in what is
sometimes called a star schema due to its shape. In the centre is the fact table and the rays
of the star are the dimension tables. Below is an example of a database with a star schema.
11
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
In this case, dimension tables have a one-to-many relationship with the fact table (a row can
only have one date, but a date may be in multiple rows). With this star schema, there may be
numerous ways to slice the data with a combination of different descriptors from various
dimensions. For example, using this structure, we can generate reports of sales per brand in
Makati City in the month of December.
Note, however, that the star schema is not normalised. For example, one could create
a separate product or store category entity to prevent the repetition of this data across
multiple rows in the dimension table. Observe the ERD below where we have normalised a
few of the structures in the original star schema.
12
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
This normalised star is sometimes called a snowflake schema since the rays of the star then
branch out, looking like the shape a snowflake. A key thing to note here is that star schemas
often include hierarchical data. For example, a barangay is inside a city, which is inside a
province that belongs to a region. In an operational database, this would usually be
normalised as a long chain similar to the structure above. This makes it easy to move a city
from one province to another, for example, since one would only have to change one row
(the province key of a specific city). In the denormalised structure, one would have to change
the province for all the different barangays in the city that moved.
There are two main reasons for keeping this denormalised structure though. First,
since dimension tables only make up a small part of the storage space required by the
13
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
As mentioned earlier, a dimensionally modelled data warehouse will contain multiple data
marts, each corresponding to a business process. These are all connected to each other
through a data warehouse bus [1]. A bus is a common structure from which many things
can connect to and get power. The most popular example of this is the universal serial bus
(USB), which connects printers, mice, keyboard, storage devices, and other devices to a
computer.
In a data warehouse bus, fact tables in different data marts are connected to each
other through common conformed dimensions [1]. Even if each mart is developed by a
different team, by plugging into this structure of standardised dimensions, the data
warehouse as a whole becomes easier to manage and interpret. Below is an illustration
showing this bus.
Each of the conformed dimensions is visualised as a lane in the bus. The fact tables of the
data marts are then connected to specific lanes in the bus corresponding to the dimensions
that they use. By structuring things in this way, all the data marts can use a common set of
keys. Product 143 in the Store Inventory data mart will be the same product in the Store
Sales data mart. If a new product is introduced and loaded into the Product dimension table,
then it becomes available for use across all of the marts attached to its lane in the bus. Doing
this makes the data warehouse easier to maintain (data only has to be loaded once for the
14
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
whole warehouse instead of once per mart) and ensures a consistent view of the data across
the different business processes.
Suppose you are creating a data warehouse for Loyola Grocer, a chain of grocery
stores with over 100 branches across the Philippines. Each store has multiple sections selling
a variety of product categories such as dairy, meat, produce, and wine. Each store stocks
more than 20,000 products, also called stock-keeping units (SKUs).
Loyola Grocer
1234 Katipunan Ave.
Quezon City, Metro Manila
(02)000-0000
Store: 0132
Cashier: 000213148/Juan dela Cruz
TOTAL 685.96
AMOUNT TENDERED
CASH 700.00
CHANGE 15.04
ITEM COUNT 4
--------------------------------------------------------------------------------
Transaction: 723 05/05/2020 11:07AM
--------------------------------------------------------------------------------
003483925849350340234895060
15
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
While multiple business processes are involved in the grocery store and these all
generate some sort of data, we focus specifically on the point-of-sales system where the
cashier scans different products to be purchased. Figure 10 above is an example of a receipt
from one of the Loyola Grocer stores. From this we can immediately gleam some information
about the data that we have to model. An important thing to note is that promotions can
modify the price of a specific product. This means that one product may be sold at different
prices at various points in time.
After a cursory glance at the receipt, we can then move on to laying out the
dimensional model. We follow here a Four-Step Dimensional Design Process as a guide [1].
As mentioned earlier, data marts are based on business processes, thus the first step is
choosing a process to model. This usually involves an understanding of what the business
needs and what data is available. In this case, we have already chosen the retail sales
transaction process of the grocery store.
Once the business process to model has been identified, we must then define the grain, what
each fact table row is supposed to represent. This will dictate the level of detail available to
users of the data warehouse. This should be as granular as possible.
In this case, we take the grain to be each individual product per point-of-sales
transaction since this is the most atomic grain we can get. This allows for maximum flexibility
when performing queries. For example, since we also know where stores are located, we can
view sales of milk across different cities. If we took the grain to be per transaction or receipt,
we would lose the ability to query about specific products, which could have yielded a lot of
insight for the grocery business. The best we could do would have been to count the sales of
all products per city, but not specifically milk.
Once a grain has been defined, we can move on to finding the different descriptors one
might use to view the data by. In this case, we can see that we would need a Product, Date,
Store, Cashier, and Payment Method dimension. Depending on how data is collected by the
operational system, it would also be wise to include a Promotion dimension to account for
cases when products are purchased under special conditions. Lastly, we can create a
Transaction dimension to easily pull products that were purchased on one receipt. We will
discuss later how this is a special type of dimension called a degenerate dimension. At this
stage, we need not fill out all the descriptors in each of these dimensions yet.
Finally, we then determine the facts that would be in the fact table. Since we have already
defined the grain, what each measurement event actually measures should fall out naturally.
In this case, what are we measuring about the individual products in each transaction? We
16
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
know the sales quantity, regular unit price (price before discount), discount unit price
(discount amount in pesos), and net unit price (price after discount).
Some other facts can easily be computed from this set of facts. For example, total
regular sales price would just be the regular unit price times the sales quantity. A similar
operation can be done to compute the total discount price and total net sales price. While
these can easily be computed by a querying application, there may be circumstances when it
is best to precompute them and store them in the fact table. While this will take up more
storage space, it may be worth it to prevent possible computational errors by users of the
data warehouse.
After finishing the four-step process, we can visualise the structure of our data mart (without
the complete descriptors first).
Figure 11: Data mart for retail sales process at Loyola Grocer
Depending on the data captured by the operational source system, these dimensions may be
filled with different textual descriptors. Figure 7 shows some ways to fill in the Store and
Product Dimensions. Due to time constraints, and to keep things simple, we focus only on
three of these dimensions: Date, Promotion, and Transaction (a unique dimension we discuss
more later on).
Date Dimension
17
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
are storing dates per day, 20 years’ worth of days is only around 7,300 rows in the database.
Below is an example of what the date dimension table could look like.
The sample table Date Dimension in Figure 12 includes a lot of the basic textual descriptors
to view the data from certain perspectives. For example, using the Day of the Week field, we
can easily compare product sales across different days of the week. Included here also are a
few flag fields, indicators of a certain status. Given the Holiday Flag field, we would be able to
see if sales of specific products are better on holidays than on non-holidays. What fields to
put in the Date Dimension will depend heavily on the business though. If many of the
processes operate based on a different calendar (such as a fiscal calendar), one could include
fields for those to easily filter data based on them.
While time-of-day and dates usually go together as a single time stamp, if we create
a row per minute instead of per day, the size of the table quickly explodes to millions of
rows. This can seriously bog down query performance of an important dimension used in
multiple data marts. Because of this, time-of-day may instead be included as a datetime fact
in the fact table. This also allows us to compute (with some work) how long a specific
transaction takes.
Promotion Dimension
The Promotion Dimension can be very interesting to business users because it is a causal
dimension, a dimension whose descriptors are thought to affect the facts being measured
(in this case, retail sales) [1]. For the most part, analysts are interested in finding out how
successful promotions are. Below are just some of the factors that can be used to measure
how successful sales were during a promotion.
18
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
1. Lift – gains in sales over the promotion period. Only measurable if a baseline sales
value is set (to be able to compute the gain).
2. Cannibalisation – when there are promotions such as price reductions on some
products, people purchase that instead of some other one they would have
purchased. For example, if detergent brand A is on sale, one might buy it instead of
B, leading to an increase in the sales of A, but decrease in B. For a retailer like Loyola
Grocer, this may not always be a good thing since it could lead to large unsold stock
of detergent B (which incurs storage costs).
3. Profitability – this measures the gain in profit of a promotion over a certain set
baseline, taking into account the other factors such as cannibalisation and cost of the
promotions.
Promotions are rarely constrained to a single type. For example, if there is a price reduction
for a certain product, there is usually an accompanying ad, display, or coupon. Depending on
how the business thinks about things, each of these could be a separate dimension, but for
simplicity, we assume here that promotions come in these packages. Below is an example of
what the Promotion Dimension might look like in this case.
One thing to note though is that a vast majority of products sold will likely be sold
with no promotion. It is important to keep the fact table free from null foreign keys as these
can be confusing and lead to issues when making queries. To get around this, a special type
of promotion called “No Promotion” with a Promotion Key of 0 or -1 may be created.
Transaction Dimension
Notice how each row of the fact table includes a Transaction# dimension, but there is no
Transaction Dimension table. Nonetheless it is a useful dimension since it allows us to pull all
the products purchased within the same transaction. This can be used to measure things
such as the total transaction time, for example.
19
Ateneo de Manila University
CSCI113i / CSCI213 / CS129.15: Business Intelligence
Though it looks like a dimension key, since there are no descriptors associated with it,
it does not make much sense to create a separate table. Because of this, it is called a
degenerate dimension, a dimension with no corresponding dimension table. These
dimensions often play a part in creating the composite key that serves as the primary key of
the fact table. In the case of Loyola Grocer, a combination of the Transaction# and Product
Key is sufficient to create a unique combination that can serve as the primary key.
V. References
[1] Ralph Kimball and Margy Ross. 2013. The data warehouse toolkit: The definitive guide
to dimensional modeling, Third edition. John Wiley & Sons Inc.
20