Database Design Fundamentals Explained
Database Design Fundamentals Explained
UNIT I
Entity Set
We can represent the entity sets in an ER Diagram but we can't represent individual entities
because an entity is like a row in a table, and an ER diagram shows the structure and
relationships of data, not specific data entries (like rows and columns). An ER diagram is a
visual representation of the data model, not the actual data itself.
Types of Entity
There are two main types of entities:
1. Strong Entity
A Strong Entity is a type of entity that has a key Attribute that can uniquely identify each
instance of the entity. A Strong Entity does not depend on any other Entity in the Schema for
its identification. It has a primary key that ensures its uniqueness and is represented by a
rectangle in an ER diagram.
2. Weak Entity
A Weak Entity cannot be uniquely identified by its own attributes alone. It depends on a
strong entity to be identified. A weak entity is associated with an identifying entity (strong
entity), which helps in its identification. A weak entity are represented by a double rectangle.
The participation of weak entity types is always total. The relationship between the weak
entity type and its identifying strong entity type is called identifying relationship and it is
represented by a double diamond.
Example:
A company may store the information of dependents (Parents, Children, Spouse) of an
Employee. But the dependents can't exist without the employee. So dependent will be a Weak
Entity Type and Employee will be identifying entity type for dependent, which means it is
Strong Entity Type.
Attributes in ER Model
Attributes are the properties that define the entity type. For example, for a Student entity
Roll_No, Name, DOB, Age, Address, and Mobile_No are the attributes that define entity
type Student. In ER diagram, the attribute is represented by an oval.
Attribute
Types of Attributes
1. Key Attribute
The attribute which uniquely identifies each entity in the entity set is called the key attribute.
For example, Roll_No will be unique for each student. In ER diagram, the key attribute is
represented by an oval with an underline.
Key Attribute
2. Composite Attribute
An attribute composed of many other attributes is called a composite attribute. For example,
the Address attribute of the student Entity type consists of Street, City, State, and Country.
In ER diagram, the composite attribute is represented by an oval comprising of ovals.
Composite Attribute
3. Multivalued Attribute
An attribute consisting of more than one value for a given entity. For example, Phone_No
(can be more than one for a given student). In ER diagram, a multivalued attribute is
represented by a double oval.
Multivalued Attribute
4. Derived Attribute
An attribute that can be derived from other attributes of the entity type is known as a derived
attribute. e.g.; Age (can be derived from DOB). In ER diagram, the derived attribute is
represented by a dashed oval.
Derived Attribute
The Complete Entity Type Student with its Attributes can be represented as:
Entity-Relationship Set
A set of relationships of the same type is known as a relationship set. The following
relationship set depicts S1 as enrolled in C2, S2 as enrolled in C1, and S3 as registered in C3.
Relationship Set
Unary Relationship
2. Binary Relationship: When there are TWO entities set participating in a relationship, the
relationship is called a binary relationship. For example, a Student is enrolled in a Course.
Binary Relationship
3. Ternary Relationship: When there are three entity sets participating in a relationship, the
relationship is called a ternary relationship.
4. N-ary Relationship: When there are n entities set participating in a relationship, the
relationship is called an n-ary relationship.
Cardinality in ER Model
The maximum number of times an entity of an entity set participates in a relationship set is
known as cardinality.
Cardinality can be of different types:
1. One-to-One
When each entity in each entity set can take part only once in the relationship, the cardinality
is one-to-one. Let us assume that a male can marry one female and a female can marry one
male. So the relationship will be one-to-one.
2. One-to-Many
In one-to-many mapping as well where each entity can be related to more than one entity.
Let us assume that one surgeon department can accommodate many doctors. So the
Cardinality will be 1 to M. It means one department has many Doctors.
3. Many-to-One
When entities in one entity set can take part only once in the relationship set and entities in
other entity sets can take part more than once in the relationship set, cardinality is many to
one.
Let us assume that a student can take only one course but one course can be taken by many
students. So the cardinality will be n to 1. It means that for one course there can be n students
but for one student, there will be only one course.
In this case, each student is taking only 1 course but 1 course has been taken by many
students.
4. Many-to-Many
When entities in all entity sets can take part more than once in the relationship cardinality is
many to many. Let us assume that a student can take more than one course and one course
can be taken by many students. So the relationship will be many to many.
In this example, student S1 is enrolled in C1 and C3 and Course C3 is enrolled by S1, S3,
and S4. So it is many-to-many relationships.
1. Key Constraints
• Definition:
A key constraint specifies how many entities can participate in a relationship set.
• Main Idea:
o In a relationship, at most one entity is associated with another entity (on a
side).
• Example:
o Student – Enrolled – Course
▪ A student can enroll in many courses → No key constraint.
▪ But if we say each student has exactly one ID card, then → 1:1 key
constraint.
• Types:
o 1:1 (One-to-One) → One entity of A maps to one entity of B.
Example: Person ↔ Passport.
o 1:N (One-to-Many) → One entity of A maps to many entities of B.
Example: Department ↔ Employees.
o M:N (Many-to-Many) → Many entities of A map to many entities of B.
Example: Student ↔ Course.
• ER Diagram Notation:
o Arrow (→) from entity to relationship shows key constraint.
Total Participation
Partial Participation
In database design, partial participation—also known as optional participation—allows
certain aspects of a relationship to be optional. It implies that the way the database is
configured does not require that every entity be linked to every other thing. Consider a
database at a university, for instance. Partial participation can mean that some students are
enrolled in classes but not all students are registered in them. Because it recognizes that not
everything in real life is always connected to everything else, this flexibility is crucial. While
some objects are connected to one another, others may stand alone. It permits scenarios in
which certain database entities may not be connected to any other entity.
In below diagram, The participation of an entity set E in relationship set R is said to be partial
if only some entities in E participate in relationships in R.
The participation of entity set A in the relationship set is partial because only some entities
of A participate in the relationship set. while the participation of entity set B in the
relationship set is total because every entity of B participates in the relationship set.
Partial Participation
Example:
Suppose an entity set Student related to an entity set Course through Enrolled relationship
set.
The participation of entity set course in enrolled relationship set is partial because a course
may or may not have students enrolled in. It is possible that only some of the course entities
are related to the student entity set through the enrolled relationship set.
The participation of entity set student in enrolled relationship set is total because every
student is expect to relate at least one course through the enrolled relationship set.
Conclusion
In conclusion, participation constraints, including total participation and partial participation
are useto design of a robust and efficient database schema. While partial participation permits
optional involvement, whole participation requires each entity in one set to take part in a
relationship in another set. These limitations protect the integrity of the data, uphold
corporate policies, and faithfully replicate actual events in the database. Database designers
can avoid errors and inconsistencies by specifying the minimum and maximum
participation of entities in relationships, which results in a dependable and efficient database
management system.
Example-3:
The bank account of a particular bank has no existence if the bank doesn't exist
anymore.
Example-4:
A company may store the information of dependents (Parents, Children, Spouse)
of an Employee. But the dependents don’t have existence without the employee.
So Dependent will be weak entity type and Employee will be Identifying
Entity type for Dependent.
1. Class Name:
• The name of the class is typically written in the top compartment of the
class box and is centered and bold.
2. Attributes:
• Attributes, also known as properties or fields, represent the data
members of the class. They are listed in the second compartment of the
class box and often include the visibility (e.g., public, private) and the
data type of each attribute.
3. Methods:
• Methods, also known as functions or operations, represent the behavior
or functionality of the class. They are listed in the third compartment
of the class box and include the visibility (e.g., public, private), return
type, and parameters of each method.
4. Visibility Notation:
•Visibility notations indicate the access level of attributes and methods.
Common visibility notations include:
o + for public (visible to all classes)
o - for private (visible only within the class)
o # for protected (visible to subclasses)
o ~ for package or default visibility (visible to classes in the
same package)
Parameter Directionality
• In class diagrams, parameter directionality refers to the indication of the flow of
information between classes through method parameters.
• It helps to specify whether a parameter is an input, an output, or both. This
information is crucial for understanding how data is passed between objects
during method calls.
There are three main parameter directionality notations used in class diagrams:
• In (Input):
o An input parameter is a parameter passed from the calling object
(client) to the called object (server) during a method invocation.
o It is represented by an arrow pointing towards the receiving class (the
class that owns the method).
• Out (Output):
o An output parameter is a parameter passed from the called object
(server) back to the calling object (client) after the method execution.
o It is represented by an arrow pointing away from the receiving class.
• InOut (Input and Output):
o An InOut parameter serves as both input and output. It carries
information from the calling object to the called object and vice
versa.
o It is represented by an arrow pointing towards and away from the
receiving class.
Relationships between classes
In class diagrams, relationships between classes describe how classes are connected or
interact with each other within a system. Here are some common types of relationships in
class diagrams:
1. Association
An association represents a bi-directional relationship between two classes. It indicates that
instances of one class are connected to instances of another class. Associations are typically
depicted as a solid line connecting the classes, with optional arrows indicating the direction
of the relationship.
2. Directed Association
A directed association in a UML class diagram represents a relationship between two classes
where the association has a direction, indicating that one class is associated with another in a
specific way.
3. Aggregation
Aggregation is a specialized form of association that represents a "whole-part" relationship. It
denotes a stronger relationship where one class (the whole) contains or is composed of
another class (the part). Aggregation is represented by a diamond shape on the side of the
whole class. In this kind of relationship, the child class can exist independently of its parent
class.
4. Composition
Composition is a stronger form of aggregation, indicating a more significant ownership or
dependency relationship. In composition, the part class cannot exist independently of the
whole class. Composition is represented by a filled diamond shape on the side of the whole
class.
5. Generalization(Inheritance)
Inheritance represents an "is-a" relationship between classes, where one class (the subclass or
child) inherits the properties and behaviors of another class (the superclass or parent).
Inheritance is depicted by a solid line with a closed, hollow arrowhead pointing from the
subclass to the superclass.
6. Realization (Interface Implementation)
Realization indicates that a class implements the features of an interface. It is often used in
cases where a class realizes the operations defined by an interface. Realization is depicted by
a dashed line with an open arrowhead pointing from the implementing class to the interface.
7. Dependency Relationship
A dependency exists between two classes when one class relies on another, but the
relationship is not as strong as association or inheritance. It represents a more loosely coupled
connection between classes.
8. Usage(Dependency) Relationship
A usage dependency relationship in a UML class diagram indicates that one class (the client)
utilizes or depends on another class (the supplier) to perform certain tasks or access certain
functionality. The client class relies on the services provided by the supplier class but does
not own or create instances of it.
• In UML class diagrams, usage dependencies are typically represented by a dashed
arrowed line pointing from the client class to the supplier class.
• The arrow indicates the direction of the dependency, showing that the client class
depends on the services provided by the supplier class.
Subclasses
A subclass is a class derived from the superclass. It inherits the properties of the superclass and
also contains attributes of its own. An example is:
Car, Truck and Motorcycle are all subclasses of the superclass Vehicle. They all inherit
common attributes from vehicle such as speed, colour etc. while they have different attributes
also i.e Number of wheels in Car is 4 while in Motorcycle is 2.
Superclasses
A superclass is the class from which many subclasses can be created. The subclasses inherit
the characteristics of a superclass. The superclass is also known as the parent class or base
class.
In the above example, Vehicle is the Superclass and its subclasses are Car, Truck and
Motorcycle.
Inheritance
Inheritance is basically the process of basing a class on another class i.e to build a class on a
existing class. The new class contains all the features and functionalities of the old class in
addition to its own.
The class which is newly created is known as the subclass or child class and the original class
is the parent class or the superclass.
First, we discuss constraints that apply to a single specialization or a single generalization. For
brevity, our discussion refers only to specialization even though it applies
to both specialization and generalization. Then, we discuss differences between
specialization/generalization lattices (multiple inheritance) and hierarchies (single
inheritance), and elaborate on the differences between the specialization and generalization
processes during conceptual database schema design.
▪ In some specializations we can determine exactly the entities that will become members
of each subclass by placing a condition on the value of some attribute of the superclass.
Such subclasses are called predicate-defined (or condition-defined) subclasses. For
example, if the EMPLOYEE entity type has an attribute Job_type, as shown in Figure
8.4, we can specify the condition of membership in the SECRETARY subclass by the
condition (Job_type = ‘Secretary’), which we call the defining predicate of the
subclass. This condition is a constraint specifying that exactly those entities of
the EMPLOYEE entity type whose attribute value for Job_type is ‘Secretary’ belong
to the subclass. We display a predicate-defined subclass by writing the predicate
condition next to the line that connects the subclass to the specialization circle.
✓ Disjoint, total
✓ Disjoint, partial
✓ Overlapping, total
✓ Overlapping, partial
➢ Of course, the correct constraint is determined from the real-world meaning that applies
to each specialization. In general, a superclass that was identified through
the generalization process usually is total, because the superclass is derived from the
subclasses and hence contains only the entities that are in the subclasses.
➢ Deleting an entity from a superclass implies that it is automatically deleted from all the
subclasses to which it belongs.
➢ Inserting an entity in a superclass implies that the entity is mandatorily inserted in
all predicate-defined (or attribute-defined) subclasses for which the entity satisfies the
defining predicate.
➢ The reader is encouraged to make a complete list of rules for insertions and dele-tions
for the various types of specializations.
• Entities of a category inherit only the attributes of the superclass they belong to.
o Example: If OWNER is a PERSON → inherits Person attributes (Name,
DOB).
o If OWNER is a COMPANY → inherits Company attributes (RegNo,
Address).
Generalization groups the common properties of multiple classes into a single, generalized
class. This makes models cleaner and easier to understand. For example, imagine we’re
designing a system for managing clients. Both companies and individuals are clients. Instead
of repeating shared attributes, we create a general “Client” class. This class connects to
“Company” and “Person” classes with lines and arrowheads pointing to the generalized
“Client” class.
Example:
• Abstract Generalization: The system must allow creating clients.
• Corresponding Specializations:
1. The system must allow creating companies.
2. The system must allow creating persons.
If “Client” is abstract (displayed in italics), it cannot have direct instances. This means users
can only create “Company” or “Person” objects. However, if “Client” is not abstract, users
can create generic client objects.
Modeling of a UML
Generalization
Generalization Sets and Constraints
UML introduces generalization sets to group subtypes logically. These sets often include
constraints that define relationships between subtypes. Let’s break them down:
• Incomplete: Not all subtypes are listed. For instance, new roles like “Manufacturer”
can be added later.
• Complete: All possible subtypes are covered. No new ones can exist.
• Disjoint: An instance belongs to only one subtype. For example, a contact is either a
“Person” or a “Company.”
• Overlapping: An instance can belong to multiple subtypes. For example, a contact
may be both a “Client” and a “Supplier.”
Example: Let’s expand the client example:
1. Generalization Set: “Contact Type” – {Complete, Disjoint}: “Person” and
“Company.”
2. Generalization Set: “Contact Kind” – {Incomplete, Overlapping}: “Client,”
“Supplier,” and “Interested Party.”
With these constraints, the system ensures accurate categorization. While every contact must
be either a person or a company, they can simultaneously serve as clients, suppliers, or both.
Consider an e-commerce company implementing a CRM. They use UML to model their
contact management system. All contacts share basic attributes like “Name” and “Email.”
Subtypes such as “Person” and “Company” add specialized fields. For example, a
“Company” contact might include “Tax ID,” while a “Person” contact has a “Date of Birth.”
• Constraint Application:
• “Contact Type” (Complete, Disjoint): A contact must be either a person or
a company.
• “Contact Kind” (Incomplete, Overlapping): A contact can be a client, a
supplier, or both.
This approach avoids duplication and ensures consistency. By grouping shared attributes into
a generalized “Contact” class, the company simplifies its database design.
Heuristics for Identifying Generalizations
The concepts of UML generalization and specialization are invaluable for organizing
complex systems. These concepts promote clarity and prevent redundancy. By applying
constraints wisely, you can create models that are both flexible and precise. Whether you’re
building a CRM or another application, UML ensures your design aligns with business goals.
TOPIC 9: Data Abstraction, Knowledge Representation, and Ontology
Concepts
• Classification and Instantiation − Grouping similar objects into classes for better
management.
• Identification − Creating unique identifiers. It is needed for distinguishing and linking
objects.
• Specialization and Generalization − Refining or unifying concepts for better
representation of data.
• Aggregation and Association − Combining related entities to form higher-level concepts.
Each abstraction method plays a critical role in managing and interpreting complex data
effectively. Let us now understand these four concepts of data abstraction with examples.
Classification and Instantiation: Grouping Similar Objects
Classification organizes entities into groups based on shared attributes. For instance, in
a Company database −
By grouping applicants and companies into separate classes, it becomes easier to describe and
analyze the data. Instantiation, on the other hand, focuses on individual members. Instantiation
refers to the creation of specific instances from these classes, such as a job applicant named "John
Doe" or a company called "TechCorp".
Example − ER diagrams often illustrate this structure. Classification allows class-level properties
like "Company Type," while Instances might include a "Startup" or "Multinational".
• A person in a PERSON entity might be identified by their Name, Ssn, and Address.
• The same person could also appear in a STUDENT entity, identified by a Student
ID and Course.
Without clear identifiers, we cannot link or cross-reference related instances across entities.
Database designers and administrators must implement effective identification mechanisms to
maintain consistency.
Specialization and Generalization
Specialization refines a broader class into specific subclasses. Generalization, on the other
hand, unifies subclasses into a broader superclass. These processes help capture hierarchical
relationships. For example −
Such classifications allow databases to handle both shared and unique attributes effectively.
An Interview can be modeled as a composite of Company, Applicant, and attributes like Date
and Contact Person. Associating Interview with Job Offer must be done carefully to avoid
incorrect assumptions (e.g., assuming every interview results in a job offer).
What is Knowledge Representation?
Building on data abstraction, Knowledge Representation (KR) is about capturing the structure
and relationships within a knowledge domain. It goes beyond data modeling by
supporting reasoning and inference.
Unlike traditional databases, KR systems mix schemas with data instances, enabling intelligent
reasoning over the stored information.
An ontology defines −
For example, suppose a company is hiring. In this context, an ontology might define terms like
"Applicant", "Interview", and "Job Offer" and their interconnections.
Example − A semantic job portal might use ontologies to link job requirements with applicant
profiles, despite differences in data structures, even when the data is in different formats and
structures.
2. Update Anomalies
• If data is stored in multiple places, updating one copy but not the others leads to
inconsistency.
• Example: If a customer’s phone number is updated in one table but not in all
occurrences, database becomes inconsistent.
3. Insertion Anomalies
4. Deletion Anomalies
5. Data Inconsistency
• Redundancy makes it harder to enforce integrity constraints (like primary key, foreign
key).
TOPIC 2: Decompositions
Decomposition in DBMS
Types of Decomposition
There are two types of Decomposition:
• Lossless Decomposition
• Lossy Decomposition
Types of Decomposition
Lossless Decomposition
The process in which where we can regain the original relation R with the help of joins from
the multiple relations formed after decomposition. This process is termed as lossless
decomposition. It is used to remove the redundant data from the database while retaining the
useful information. The lossless decomposition tries to ensure following things:
• While regaining the original relation, no information should be lost.
• If we perform join operation on the sub-divided relations, we must get the
original relation.
Example:
There is a relation called R(A, B, C)
A B C
55 16 27
48 52 89
55 16
48 52
R2(B, C)
B C
16 27
52 89
After performing the Join operation we get the same original relation
A B C
55 16 27
48 52 89
Lossy Decomposition
As the name suggests, lossy decomposition means when we perform join operation on the
sub-relations it doesn't result to the same relation which was decomposed. After the join
operation, we always found some extraneous tuples. These extra tuples genrates difficulty
for the user to identify the original tuples.
Example:
We have a relation R(A, B, C)
A B C
1 2 1
2 5 3
3 3 3
1 2
2 5
3 3
R2(B, C)
B C
2 1
5 3
3 3
1 2 1
2 5 3
2 3 3
3 5 3
3 3 3
Properties of Decomposition
1. Loss of Information
|A |B |C |
|----|----|----|
|1 |X |P |
|1 |Y |P |
|2 |Z |Q |
Suppose we decompose R into R1(A,B) and R2(A,C).
R1(A, B):
|A |B |
|----|----|
|1 |X |
|1 |Y |
|2 |Z |
R2(A, C):
|A |C |
|----|----|
|1 |P |
|1 |P |
|2 |Q |
Now, if we take the natural join of R1 and R2 on attribute A, we get back the original relation
R. Therefore, this is a lossless decomposition.
• Once tables are decomposed, certain functional dependencies might not be preserved,
which can lead to the inability to enforce specific integrity constraints.
• Example: If you have the functional dependency `A → B` in the original table, but in
the decomposed tables, there is no table with both `A` and `B`, this functional
dependency can't be preserved.
Example: Let's consider a relation R with attributes A,B, and C and the following functional
dependencies:
A→B
B→C
Now, suppose we decompose R into two relations:
R1(A,B) with FD A → B
R2(B,C) with FD B → C
In this case, the decomposition is dependency-preserving because all the functional
dependencies of the original relation R can be found in the decomposed relations R1 and R2.
We do not need to join R1 and R2 to enforce or check any of the functional dependencies.
However, if we had a functional dependency in R, say A → C, which cannot be determined
from either R1 or R2 without joining them, then the decomposition would not be dependency-
preserving for that specific FD.
3. Increased Complexity
4. Redundancy
• Incorrect decomposition might not eliminate redundancy, and in some cases, can even
introduce new redundancies.
5. Performance Overhead
• An increased number of tables, while aiding normalization, can also lead to more
complex SQL queries involving multiple joins, which can introduce performance
overheads.
A functional dependency occurs when the value of one attribute (or a set of attributes)
uniquely determines the value of another attribute. This relationship is denoted as:
X→Y
Here, X is the determinant, and Y is the dependent attribute. This means that for each
unique value of X, there is precisely one corresponding value of Y.
Example:
If each student has a unique StudentID, and this ID determines the student's name, we can
express this functional dependency as:
StudentID → StudentName
This indicates that knowing the StudentID allows us to determine the StudentName.
StudentID StudentName StudentAge
101 Rahul 23
102 Ankit 22
103 Aditya 22
104 Sahil 24
Functional Dependency
How to represent functional dependency in DBMS?
• Functional dependency is expressed in the form of equations. For example, if we
have an employee record with fields "EmployeeID", "FirstName" and
"LastName" we can specify the function as follows:
EmployeeID -> FirstName, LastName
• To represent functional dependency in DBMS has two main features: left (LHS)
and right (RHS) of the arrow (->).
• For example, if we have a table with attributes "X", "Y" and "Z" and the
attribute "X" can determine the value of the attributes "Y" and "Z".
X -> Y, Z
• This symbol indicates that the value in property "X" determines the values in
property "Y" and "Z". So if you know the value of "X", you can also determine
the value of "Y" and "Z".
Types of Functional Dependency in DBMS
The following are some important types of FDs in DBMS:
Trivial Functional Dependency
The dependency of an attribute on a set of attributes is known as trivial functional
dependency if the set of attributes includes that attribute.
Multivalued Dependency
A multivalued dependency happens when there are at least three attributes (let us say X, Y
and Z), and for a value of X there is a well defined set of values of Y and a well defined set
of values of Z. However, the set of values of Y is independent of set Z and vice versa.
Reflexivity: If A is a set of attributes and B is a part of A, then the function A -> B is valid.
Augmentation: If the A -> B dependency is valid, adding multiple elements to either side of
the dependency will not affect the dependency.
Transitivity: If the functions X → Y and Y → Z are both valid, then X → Z is also valid
according to the transitivity rule.
Reasoning Steps
Step 1: Closure of a Set of Attributes
R(A, B, C, D)
FDs: A → B, B → C, C → D
Find A⁺
Solution:
A⁺ = {A}
A → B ⇒ {A, B}
B → C ⇒ {A, B, C}
C → D ⇒ {A, B, C, D}
So A⁺ = {A, B, C, D}
A is a key.
Example:
FDs: A → B, B → C
Check if A → C holds?
A⁺ = {A, B, C}
Since C ⊆ A⁺ → A → C holds.
A candidate key is a minimal set of attributes that can determine all other attributes.
Example:
R(A, B, C)
FDs: A → B, B → C
Find A⁺ = {A, B, C}
So A is a candidate key.
Example:
FDs: A → BC, B → C
Step 1: Split → A → B, A → C, B → C
Step 2 & 3: No redundancy
Minimal cover = {A → B, A → C, B → C}
🧮 Practice Problems
Problem 1: Compute Closure
R(A, B, C, D)
FDs: A → B, B → CD
Find A⁺
Solution:
A⁺ = {A}
A → B ⇒ {A, B}
B → CD ⇒ {A, B, C, D}
A⁺ = {A, B, C, D}
A⁺ = {A, B, C}
So A → C holds by transitivity.
A⁺ = {A, B, C, D}
So A is a candidate key.
Problem 4: Find Minimal Cover
FDs: A → BC, B → C, A → B
Given:
X → Y and Y → Z
By Transitivity,
⇒X→Z
By Augmentation,
⇒ XW → YW
By Union,
⇒ X → YZ
Normal forms are a set of progressive rules (or design checkpoints) for relational schemas
that reduce redundancy and prevent data anomalies. Each normal form - 1NF, 2NF, 3NF,
BCNF, 4NF, 5NF - is stricter than the previous one: meeting a higher normal form implies
the lower ones are satisfied. Think of them as layers of cleanliness for your tables: the
deeper you go, the fewer redundancy and integrity problems you’ll have.
Benefits of using Normal Forms:
• Reduce duplicate data and wasted storage.
• Prevent insert, update, and delete anomalies.
• Improve data consistency and integrity.
• Make the schema easier to maintain and evolve.
The Diagram below shows the hierarchy of database normal forms. Each inner circle
represents a stricter level of normalization, starting from 1NF (basic structure) to 5NF
(most refined). As you move inward, data redundancy reduces and data integrity improves.
Each level builds upon the previous one to ensure a cleaner and more efficient database
design.
❖ First Normal Form (1NF)
First Normal Form (1NF) ensures that the structure of a database table is organized in a way
that makes it easier to manage and query.
• A relation is in first normal form if every attribute in that relation is single-
valued attribute or it does not contain any composite or multi-valued attribute.
• It is the first and essential step in to reduce redundancy, improve data integrity
and reducing anomalies in relational database design.
A relation (table) is said to be in First Normal Form (1NF) if:
• All the attributes (columns) contain only atomic (indivisible) values.
• Each column contains values of a single type.
• Each record (row) is unique, meaning it can be identified by a primary key.
• There are no repeating groups or arrays in any row.
Rules for First Normal Form (1NF) in DBMS
To follow the First Normal Form (1NF) in a database, these simple rules must be followed:
Every Column Should Have Single Values
Each column in a table must contain only one value in a cell. No cell should hold multiple
values. If a cell contains more than one value, the table does not follow 1NF.
• Example: A table with columns like [Writer 1], [Writer 2], and [Writer 3] for
the same book ID is not in 1NF because it repeats the same type of information
(writers). Instead, all writers should be listed in separate rows.
All Values in a Column Should Be of the Same Type
Each column must store the same type of data. You cannot mix different types of
information in the same column.
• Example: If a column is meant for dates of birth (DOB), you cannot use it to
store names. Each type of information should have its own column.
Every Column Must Have a Unique Name
Each column in the table must have a unique name. This avoids confusion when retrieving,
updating, or adding data.
• Example: If two columns have the same name, the database system may not
know which one to use.
The Order of Data Doesn’t Matter
In 1NF, the order in which data is stored in a table doesn’t affect how the table works. You
can organize the rows in any way without breaking the rules.
Example:
Consider the below COURSES Relation :
In the above table, Courses has a multi-valued attribute, so it is not in 1NF. To make the
table in 1NF we have to remove the multivalued attributes from the table as given below:
1NF
Now the table is in 1NF as there is no multi-valued attribute present in the table.
Second Normal Form (2NF) is based on the concept of fully functional dependency. It is a
way to organize a database table so that it reduces redundancy and ensures data consistency.
Fully Functional Dependency means a non-key attribute depends on the entire primary key,
not just part of it.
For a table to be in 2NF, it must first meet the following requirements
1. Meet 1NF Requirements: The table must first satisfy First Normal Form (1NF),
meaning:
• All columns contain single, indivisible values.
• No repeating groups of columns.
2. Eliminate Partial Dependencies: A partial dependency occurs when a non-prime
attribute (not part of the candidate key) depends only on a part of a composite primary key,
rather than the entire key.
By ensuring these steps, a table in 2NF is more efficient and less prone to errors during
updates, inserts, and deletes.
What is Partial Dependency?
The FD (functional dependency) A->B happens to be a partial dependency if B is functionally
dependent on A, and also B can be determined by any other proper subset of A.
In other words, if you have a composite key (a primary key made up of more than one
attribute), and an attribute depends on only a subset of that composite key, rather than the
entire key, that is considered a partial dependency.
A partial dependency would occur whenever a non-prime attribute depends functionally on
a part of the given candidate key.
Example:
1. Doesn't Handle Transitive Dependencies: 2NF ensures that non-prime attributes are
fully dependent on the entire primary key, but it doesn't address transitive dependencies. In
a transitive dependency, an attribute depends on another non-key attribute.
For example, if A → B and B → C, then A indirectly determines C. This can lead to further
redundancy and anomalies.
4. Not Sufficient for Some Use Cases: While 2NF is useful for reducing redundancy in
some situations, in real-world applications where data integrity and efficiency are crucial,
additional normalization (like 3NF) might be needed to address more complex
dependencies and optimize data storage and retrieval.
❖ Third Normal Form (3NF)
The Third Normal Form (3NF) builds on the First (1NF) and Second (2NF) Normal Forms.
Achieving 3NF ensures that the database structure is free of transitive dependencies, reducing
the chances of data anomalies. Even though tables in 2NF have reduced redundancy
compared to 1NF, they may still encounter issues like update anomalies.
A relation is in Third Normal Form (3NF) if it satisfies the following two conditions:
1. It is in Second Normal Form (2NF): This means the table has no partial
dependencies (i.e., no non-prime attribute is dependent on a part of a candidate
key).
2. There is no transitive dependency for non-prime attributes: In simpler terms,
no non-key attribute should depend on another non-key attribute. Instead, all
non-key attributes should depend directly on the primary key.
Understanding Transitive Dependency
To fully grasp 3NF, it’s essential to understand transitive dependency. A transitive
dependency occurs when one non-prime attribute depends on another non-prime attribute
rather than depending directly on the primary key. This can create redundancy and
inconsistencies in the database.
For example, if we have the following relationship between attributes:
• A -> B (A determines B)
• B -> C (B determines C)
This means that A indirectly determines C through B, creating a transitive dependency.
3NF eliminates these transitive dependencies to ensure that non-key attributes are directly
dependent only on the primary key.
Conditions for a Table to be in 3NF
A table is in Third Normal Form (3NF) if, for every non-trivial functional dependency
X→Y, at least one of the following holds:
• X is a superkey: This means that the attribute(s) on the left-hand side of the
functional dependency (X) must be a superkey (a key that uniquely identifies a
tuple in the table).
• Y is a prime attribute: This means that every element of the attribute set Y
must be part of a candidate key (i.e., a prime attribute).
To remove the transitive dependency and ensure the relation is in 3NF, we decompose the
original CANDIDATE relation into two separate relations:
1. CANDIDATE: This will store information about the candidates, including
their CAND_NO, CAND_NAME, CAND_STATE,
and CAND_AGE:\text{CANDIDATE (CAND_NO, CAND_NAME,
CAND_STATE, CAND_AGE)}
2. STATE_COUNTRY: This relation will store information about the states and
their respective countries:\text{STATE_COUNTRY (CAND_STATE,
CAND_COUNTRY)}
Boyce-Codd Normal Form (BCNF) is an advanced version of 3NF used to reduce redundancy
in databases. It ensures that for every functional dependency, the left side must be a superkey.
This helps create cleaner and more consistent database designs, especially when there are
multiple candidate keys.
Rules for BCNF
• Rule 1: The table should be in the 3rd Normal Form.
• Rule 2: X should be a super-key for every functional dependency (FD) X−>Y in
a given relation.
Note: To test whether a relation is in BCNF, we identify all the determinants and make sure
that they are candidate keys.
Key Notes:
1. To verify BCNF, identify all determinants (left side of FDs) and check whether each is a
candidate key.
2. If a relation is in BCNF, it is automatically in 3NF, 2NF, and 1NF as well.
The normal forms become stricter as we move from 1NF → 2NF → 3NF → BCNF:
• 1NF: Each field must hold atomic (indivisible) values.
• 2NF: No partial dependency on a primary key.
• 3NF: No transitive dependency on a primary key.
• BCNF: Every determinant must be a candidate key.
This progression ensures better structure and removes redundancy at each level.
Why Do We Need BCNF?
• 2NF and 3NF may allow anomalies if a functional dependency exists where the
determinant is not a superkey.
• BCNF handles edge cases where 3NF fails to remove all redundancy, especially
in tables with multiple candidate keys.
• Prevents update, insert, and delete anomalies by ensuring every determinant is a
superkey.
• Makes database design more robust and easier to maintain over time.
• Improves data consistency and clarity by removing hidden or indirect
dependencies.
We are going to discuss some basic examples which let you understand the properties of
BCNF. We will discuss multiple examples here.
Example 1
Consider a relation R with attributes (student, teacher, subject).
FD: { (student, Teacher) -> subject, (student, subject) -> Teacher, (Teacher) -> subject}
• Candidate keys are (student, teacher) and (student, subject).
• The above relation is in 3NF (since there is no transitive dependency). A relation
R is in BCNF if for every non-trivial FD X->Y, X must be a key.
• The above relation is not in BCNF, because in the FD (teacher->subject),
teacher is not a key. This relation suffers with anomalies −
• For example, if we delete the student Tahira , we will also lose the information
that [Link] teaches C. This issue occurs because the teacher is a determinant
but not a candidate key.
R is divided into two relations R1(Teacher, Subject) and R2(Student, Teacher).
For more, refer to BCNF in DBMS.
How to Satisfy BCNF?
For satisfying this table in BCNF, we have to decompose it into further tables. Here is the
full procedure through which we transform this table into BCNF. Let us first divide this main
table into two tables Stu_Branch and Stu_Course Table.
Stu_Branch Table
Stu_ID Stu_Branch
101 201
101 202
102 401
102 402
Candidate Key for this table: {Stu_ID, Stu_Course_No}.
After decomposing into further tables, now it is in BCNF, as it is passing the condition of
Super Key, that in functional dependency X−>Y, X is a Super Key.
Example 3
Find the highest normal form of a relation R(A, B, C, D, E) with FD set as:
{ BC->D, AC->BE, B->E }
Explanation:
• Step-1: As we can see, (AC)+ ={A, C, B, E, D} but none of its subsets can
determine all attributes of the relation, So AC will be the candidate key. A or C
can’t be derived from any other attribute of the relation, so there will be only 1
candidate key {AC}.
• Step-2: Prime attributes are those attributes that are part of candidate key {A,
C} in this example and others will be non-prime {B, D, E} in this example.
• Step-3: The relation R is in 1st normal form as a relational DBMS does not
allow multi-valued or composite attributes.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not a proper
subset of candidate key AC) and AC->BE is in 2nd normal form (AC is candidate key) and
B->E is in 2nd normal form (B is not a proper subset of candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super key nor D is
a prime attribute) and in B->E (neither B is a super key nor E is a prime attribute) but to
satisfy 3rd normal for, either LHS of an FD should be super key or RHS should be a prime
attribute. So the highest normal form of relation will be the 2nd Normal form.
Note: A prime attribute cannot be transitively dependent on a key in BCNF relation.
Consider these functional dependencies of some relation R
AB ->C
C ->B
AB ->B
From the given functional dependencies, the candidate keys of relation R are AB and AC.
On close observation, we see that B depends transitively on AB through C, making it a
transitive dependency.
• The first and third dependencies are in BCNF as their left sides are candidate
keys.
• The second dependency is not in BCNF, but it's in 3NF since the right side is a
prime attribute.
So, the highest normal form of relation R is 3NF.
Example 3
For example consider relation R(A, B, C)
A -> BC,
B -> A
A and B both are super keys so the above relation is in BCNF.
Note: BCNF decomposition may always not be possible with dependency preserving,
however, it always satisfies the lossless join condition. For example, relation R (V, W, X, Y,
Z), with functional dependencies:
V, W -> X
Y, Z -> X
W -> Y
It would not satisfy dependency preserving BCNF decomposition.
Note: Redundancies are sometimes still present in a BCNF relation as it is not always
possible to eliminate them completely.
❖ Fourth Normal Form (4NF)
Multivalued Dependency
Multivalued Dependency
A multivalued dependency occurs in a relation when one attribute determines multiple
independent values of another attribute, independent of other attributes. A multivalued
dependency always requires at least three attributes because it consists of at least two
attributes that are dependent on a third.
For a dependency A -> B, if for a single value of A, multiple values of B exist, then the
table may have a multi-valued dependency. The table should have at least 3 attributes and B
and C should be independent for A ->> B multivalued dependency.
Example: A course can have multiple instructors, a course can also have multiple textbook
authors but instructors and authors are independent of each other. This creates two
independent multivalued dependencies:
If stored in the same table, this creates redundant combinations and data anomalies. A
multivalued dependency is a generalization of a functional dependency, but they are not the
same.
Properties
A relation R is in 4NF if and only if the following conditions are satisfied:
1. It should be in the Boyce-Codd Normal Form (BCNF).
2. The table should not have any Multi-valued Dependency.
Key Idea: 4NF eliminates redundancy caused by multivalued dependencies by separating
independent one-to-many relationships into different tables.
A table with a multivalued dependency violates the normalization standard of the Fourth
Normal Form (4NF) because it creates unnecessary redundancies and can contribute to
inconsistent data. To bring this up to 4NF, it is necessary to break this information into two
tables.
Example: Consider the database table of a class that has two relations R1 contains student
ID(SID) and student name (SNAME) and R2 contains course id(CID) and course name
(CNAME).
Table R:
Instructor
Course TextBook_Author
Management X Churchill
Management Y Peters
Management Z Peters
Finance A Weston
Finance A Gilbert
Problem:
• Each Course has multiple Instructors.
• Each Course has multiple TextBook_Author.
• But Instructor and TextBook_Author are not related to each other.
• This causes repetition of combinations, violating 4NF.
Solution: To remove the MVDs and bring the relation to Fourth Normal Form, we split the
original table into two separate tables, each handling one multivalued dependency. This
improves data integrity and removes redundancy.
Table R1:
Instructor
Course
Management X
Management Y
Management Z
Finance A
Table R2:
Course TextBook_Author
Management Churchill
Management Peters
Finance Weston
Finance Gilbert
Benefits of Decomposition:
1. No repetition of unrelated attribute combinations.
• In a non 4NF table, if two attributes are independently related to a third, their
combinations get repeated unnecessarily.
• This leads to a cartesian product effect, lots of rows just to represent all
combinations.
Example: If a course has 3 instructors and 2 textbook authors, we get 3 × 2 = 6 rows, even
though there's no link between instructors and authors.
After 4NF decomposition: Instructors and authors are stored in separate tables, so:
• Instructors: 3 rows, Authors: 2 [Link] redundant pairings between them.
• Each table contains data with a single multivalued dependency.
• Both tables are now in BCNF and 4NF.
• Ensures cleaner design, efficient storage and no anomalies.
2. Each Table Contains Data with a Single Multivalued Dependency
• Every decomposed table focuses on only one multivalued relationship.
• There is one clear dependency per table (e.g., Course ->-> Instructor OR
Course ->-> Textbook_Author), not both.
Why it's important:
• It simplifies understanding, querying and maintaining the data.
• Each relation represents one fact, reducing logical complexity.
• This aligns with principle of separation of concerns - one table, one purpose.
3. Both Tables Are Now in BCNF and 4NF
After decomposition:
• There are no partial, transitive or multivalued dependencies.
• All attributes are functionally dependent only on the whole key.
Result: The structure now meets
• Boyce-Codd Normal Form (BCNF) as Every determinant is a candidate key.
• Fourth Normal Form (4NF) as No non-trivial MVDs exist.
• Tables are well-structured, normalized and reliable.
4. Ensures Cleaner Design, Efficient Storage and No Anomalies
• Each table is focused and easier to read.
• Developers and DBAs can understand the schema without confusion.
Efficient Storage:
• Redundant rows are eliminated.
• Fewer rows corresponds to Less storage space and so Faster performance.
No Anomalies:
• Insertion anomaly: You can insert a new instructor without needing a textbook.
• Deletion anomaly: Deleting a textbook doesn't remove the instructor.
• Update anomaly: Update happens in one place only and no risk of mismatched
data.
Note: Decomposing tables to eliminate multivalued dependencies isn't just about "following
rules" , but it's about making your data model more logical, efficient and future-proof.
Theory
Consider a schema R(A,B,C,D) and functional dependencies A->B and C->D which
is decomposed into R1(AB) and R2(CD)
Example
Let a relation R(A,B,C,D) and set a FDs F = { A -> B , A -> C , C -> D} are given.
A relation R is decomposed into –
One common approach is normalization, which divides large tables into smaller, related
tables to eliminate redundancy and ensure consistency. This process reduces anomalies such
as update, insertion, and deletion issues. However, it may increase the complexity of queries
due to the need for joins.
Another approach is denormalization, which adds redundant data to improve query
performance by reducing the number of joins. While it simplifies data access and speeds up
queries, it can lead to data inconsistency and increased storage requirements if not managed
carefully.
Vertical partitioning splits a table into smaller tables based on columns, improving query
performance by reducing I/O operations. This approach is useful when queries frequently
access specific columns. However, it can complicate the schema if queries require data from
multiple tables.
Horizontal partitioning divides a table into smaller tables based on rows, enhancing
scalability and query performance by reducing the amount of data scanned. This is
particularly effective for large datasets but may complicate queries that span multiple
partitions.
Schema refinement also involves applying constraints to enforce data integrity. Examples
include:
• Primary Key Constraint: Ensures each record in a table is unique.
• Foreign Key Constraint: Maintains consistency between related tables.
• Unique Constraint: Ensures all values in a column are distinct.
• Not Null Constraint: Prevents null values in specific columns.
• Check Constraint: Enforces specific conditions on column values.
• Default Constraint: Assigns default values to columns when none are provided.
Schema refinement should also focus on performance optimization. This includes choosing
appropriate data types, creating indexes, and partitioning data effectively. These steps help
improve query performance and reduce database overhead.
Conclusion
Effective schema refinement is essential for building a reliable and efficient database. By
normalizing data, applying constraints, and optimizing for performance, you can create a
schema that minimizes redundancy, ensures data integrity, and supports scalable operations.
This process lays the foundation for a robust database that meets both current and future
application needs.
Example
Condition-1 for MVD
t1[a] = t2[a] = t3[a] = t4[a]
Finding from table,
t1[a] = t2[a] = t3[a] = t4[a] = Geeks
So, condition 1 is Satisfied.
Condition-2 for MVD
t1[b] = t3[b]
And
t2[b] = t4[b]
Finding from table,
t1[b] = t3[b] = MS
And
t2[b] = t4[b] = Oracle
So, condition 2 is Satisfied.
Condition-3 for MVD
∃c ∈ R-(a ∪ b) where R is the set of attributes in the relational table.
t1[c] = t4[c]
And
t2[c]=t3[c]
Finding from table,
t1[c] = t4[c] = Reading
And
t2[c] = t3[c] = Music
So, condition 3 is Satisfied. All conditions are satisfied, therefore,
a --> --> b
According to table we have got,
name --> --> project
And for,
a --> --> C
We get,
name --> --> hobby
Hence, we know that MVD exists in the above table and it can be stated by,
name --> --> project
name --> --> hobby
Conclusion
• Multivalued Dependency (MVD) is a form of data dependency where two or
more attributes, other than the key attribute, are functionally independent on
each other, but these attributes depends on the key .
• Data errors and redundancies may result from Multivalued Dependency.
• We can normalize the database to 4NF in order to get rid of Mutlivalued
Dependency.
(OR)
Multivalued dependencies (MVDs) are a type of data dependency in relational databases where
one attribute determines multiple independent values of another attribute. However, there are
some misconceptions or incorrect statements about MVDs that need clarification.
1. MVDs Do Not Imply Functional Dependency: It is incorrect to assume that
multivalued dependencies imply functional dependency. While functional dependency
ensures a unique mapping between attributes, MVDs allow multiple independent
values for a single attribute without violating data consistency.
2. MVDs Do Not Always Lead to Data Redundancy: While MVDs can cause redundancy
in certain cases, they do not inherently lead to redundancy unless the database is not
normalized to the Fourth Normal Form (4NF). Proper normalization can eliminate
redundancy caused by MVDs.
3. MVDs Are Not Limited to Two Attributes: It is a misconception that MVDs only
involve two attributes. They can involve multiple attributes, as long as the conditions
for MVD are satisfied.
4. MVDs Are Not Errors in Design: MVDs are not inherently errors or flaws in database
design. They are natural occurrences in certain data relationships and can be
managed effectively through normalization.
5. MVDs Do Not Violate Database Integrity: When properly handled, MVDs do not
compromise data integrity. They are a logical representation of certain attribute
relationships and can be normalized to maintain consistency.
Understanding these points ensures a clearer perspective on MVDs and their role in database
design and normalization.