Whats a database system?
Review of Basic Database Concepts
CPS 296.1 Topics in Database Systems According to Oxford Dictionary
Database: an organized body of related information Database system, DataBase Management System, or DBMS: a software system that facilitates the creation and maintenance and use of an electronic database
More precisely, a DBMS should support
Efficient and convenient querying and updating of large amounts of persistent data Safe, multi-user access
2
Two important questions
What is the right API for a DBMS?
Data model
How is the data structured conceptually?
Entity-relationship (E/R) diagram
Entities: students and courses Relationships: students enroll in courses
SID name Student age GPA Enroll Course title CID
Query language
How do users ask queries about the data?
How does the DBMS support the API?
Query processing and optimization
What is the most efficient way to answer a query?
Transaction processing
How are atomicity, consistency, isolation, and durability of transaction ensured?
3
Widely used for database design by humans DBMS does not need a graphical data model
Before the relational revolution
Hierarchical and network data models
Relationships are modeled as pointers Queries require explicit pointer following
Physical data independence
Problems with hierarchical and network data models
Access to data is not declarative Whenever data is reorganized, applications must be reprogrammed!
Example: a simplified CODASYL query
Student.GPA := 4.0 FIND Student RECORD BY CALC-KEY FIND OWNER OF CURRENT Student-Course SET IF Course.CID = CPS 296 THEN PRINT Student.name
! Physical data independence
Applications should not need to worry about how data is physically structured and stored Applications should work with a logical data model and declarative query language Leave the implementation details and optimization to DBMS
6
Assume that we can quickly find student records by GPA Assume there is a pointer from students to courses How about navigating from courses to students?
5
Relational data model
A database is a collection of relations (or tables) Each relation has a list of attributes (or columns) Each relation contains a set of tuples (or rows)
Duplicates not allowed
Student
SID 142 123 857 456 ... name Bart Milhouse Lisa Ralph ... age 10 10 8 8 ...
SID CID title 142 GPACPS 296 Topics in Database Systems 142 2.3 CPS 216 Advanced Database Systems 123 3.1 CPS 116 Intro. to Database Systems 857 4.3 ... ... 857 2.3 456 ... ...
Schema versus instance
Schema (metadata)
Structure and constraints over data
Student (SID integer, name string, age integer, GPA float) Course (CID string, title string) Enroll (SID integer, CID integer) Student.SID is a key, Enroll.SID is a foreign key referencing Student.SID, etc.
Course
Enroll
CID CPS 296 CPS 216 CPS 296 CPS 296 CPS 116 CPS 116 ...
7
Changes infrequently
Instance
Actual contents that conform to the schema
{ <142, Bart, 10, 2.3>, <123, Milhouse, 10, 3.1>, ...} { <CPS 296, Topics in Database Systems>, ...} { <142, CPS 296>, <142, CPS 216>, ...}
Changes frequently
Relational algebra
Core set of operators:
Selection, projection, cross product, union, difference, and renaming
Selection
Notation: p ( R )
p is called a selection condition/predicate
Additional, derived operators:
Join, etc. Operator Operator
Output: only rows that satisfy p Example: Students with GPA higher than 3.0
GPA > 3.0 ( Student )
name Bart Milhouse Lisa Ralph age 10 10 8 8 GPA 2.3 3.1 4.3 2.3
SID 142 123 857 456
9
GPA > 3.0
SID 142 123 857 456
name Bart Milhouse Lisa Ralph
age 10 10 8 8
GPA 2.3 3.1 4.3 2.3
10
Projection
Notation: L ( R )
L is a list of columns in R Duplicate rows are removed
Cross product
Notation: R S Output: for each row r in R and each row s in S, output a row rs (concatenation of r and s) Example: Student Enroll
SID 142 123 ... name Bart Milhouse ...
SID 142 142 142 123 123 123 ...
Output: only the columns in L Example: age distribution of students
age ( Student )
name Bart Milhouse Lisa Ralph age 10 10 8 8
age 10 10 ...
GPA 2.3 3.1 ...
age 10 10 10 10 10 10 ... GPA 2.3 2.3 2.3 3.1 3.1 3.1 ...
SID 142 142 123 142 142 123 ... CID CPS 296 CPS 216 CPS 296 CPS 296 CPS 216 CPS 296 ...
SID 142 123 857 456
GPA 2.3 3.1 4.3 2.3
age
SID 142 123 857 456
name Bart Milhouse Lisa Ralph
age 10 10 8 8
GPA 2.3 3.1 4.3 2.3
11
name Bart Bart Bart Milhouse Milhouse Milhouse ...
SID 142 142 123 ...
CID CPS 296 CPS 216 CPS 296 ...
12
Derived operator: join
Notation: R ><p S (shorthand for p (R S))
p is called a join condition/predicate
Union and difference
Notation: R S
R and S must have identical schema
Notation: R S
R and S must have identical schema
Example: students and CIDs of their courses Student >< Student.SID = Enroll.SID Enroll
SID 142 123 ... name Bart Milhouse ... SID 142 142 142 123 123 123 ... age 10 10 ... GPA 2.3 3.1 ... age 10 10 10 10 10 10 ...
Output:
Same schema as R and S Contains all rows in R and all rows in S, with duplicates eliminated
Output:
Same schema as R and S Contains all rows in R that are not found in S
Student.SID = Enroll.SID
GPA 2.3 2.3 2.3 3.1 3.1 3.1 ... SID 142 142 123 142 142 123 ... CID CPS 296 CPS 216 CPS 296 CPS 296 CPS 216 CPS 296 ...
><
name Bart Bart Bart Milhouse Milhouse Milhouse ...
SID 142 142 123 ...
CID CPS 296 CPS 216 CPS 296 ...
13
14
Renaming
Notation: S ( R ), or S ( A1 , A2 , ...) ( R ) Purpose: rename a table and/or its columns
No real processing involved Used to avoid confusion caused by identical column names
Relational algebra example
Names of students in CPS 296 with 4.0 GPA
GPA = 4.0
name
Example: all pairs of (different) students
><Student.SID = Enroll.SID
Student1 (SID1, name1, age1, GPA1)
Student
>< SID1 < > SID2
CID = CPS 296
Student2 (SID2, name2, age2, GPA2)
Student
15
Student
Enroll
Compare this query to the CODASYL version!
16
SQL
SQL (Structured Query Language)
Pronounced S-Q-L or sequel The query language of every commercial DBMS
SQL example
Names of students in CPS 296 with 4.0 GPA
SELECT Student.name FROM Student, Enroll WHERE Enroll.CID = CPS 296 AND Enroll.SID = Student.SID AND Student.GPA = 4.0;
Simplest form: SELECT A1, A2, , An FROM R1, R2, , Rm WHERE condition;
Also called an SPJ (select-project-join) query Equivalent (more or less) to relational algebra query
A1, A2, , An ( condition (R1 R2 Rm)) Unlike relational algebra, SQL preserves duplicates by default
17
Compare this query to the CODASYL version!
18
More SQL features
SELECT [DISTINCT] list_of_output_columns FROM list_of_tables WHERE where_condition GROUP BY list_of_group_by_columns HAVING having_condition ORDER BY list_of_order_by_columns;
SQL example with aggregation
Find the average GPA for each age group with at SID name age GPA least three students
SELECT age, AVG(GPA) FROM Student GROUP BY age HAVING COUNT(*) >= 3;
142 857 123 456 789 Bart Lisa Milhouse Ralph Jessica 10 8 10 8 10 2.3 4.3 3.1 2.3 4.2
Operational semantics
FROM: take the cross product of list_of_tables WHERE: apply where_condition
GROUP BY
SID 142 123 789 857 456 name Bart Milhouse Jessica Lisa Ralph age 10 10 10 8 8 GPA 2.3 3.1 4.2 4.3 2.3 SID 142 123 789 857 456
HAVING
name Bart Milhouse Jessica Lisa Ralph age 10 10 10 8 8 GPA 2.3 3.1 4.2 4.3 2.3
SELECT
age AVG(GPA) 10 3.2
GROUP BY: group result tuples according to list_of_group_by_columns HAVING: apply SELECT: apply list_of_output_columns (preserve duplicates) DISTINCT: eliminate duplicates ORDER BY: sort the result by list_of_order_by_columns
having_condition
to the groups
19
20
Summary: relational query languages
Not your general-purpose programming language
Not expected to be Turing-complete Not intended to be used for complex calculations Amenable to much optimization
Access paths
Store data in ways to speed up queries
Heap file: unordered set of records B+-tree index: disk-based balanced search tree with logarithmic lookup and update Linear/extensible hashing: disk-based hash tables that can grow dynamically Bitmap indexes: potentially much more compact And many more
More declarative than languages for hierarchical and network data models
No explicit pointer following
Replaced by joins that can be easily reordered
One table may have multiple access paths
One primary index that stores records directly Multiple secondary indexes that store pointers to records
22
Next: How do we support relational query languages efficiently? 21
Query processing methods
The same query operator can be implemented in many different ways Example: R ><R.A=S.B S
Motivation for query optimization
The same query can have many different execution plans Example: SELECT Student.name
FROM Student, Enroll WHERE Enroll.CID = CPS 296 AND Enroll.SID = Student.SID AND Student.GPA = 4.0;
Nested-loop join: for each tuple of R, and for each tuple of S, join Index nested-loop join: for each tuple of R, use the index on S.B to find joining S tuples Sort-merge join: sort R by R.A, sort S by S.B, and merge-join Hash join: partition R and S by hashing R.A and S.B, and join corresponding partitions And many more
23
Plan 1: evaluate GPA = 4.0(Student); for each result SID, find the Enroll tuples with this SID and check if CID is CPS 296 Plan 2: evaluate CID = CPS 296(Enroll); for each result SID, find the Student tuple with this SID and check if GPA is 4.0 Plan 3: evaluate both GPA = 4.0(Student) and CID = CPS 296(Enroll), and join them on SID Any many more
24
Query optimization
A huge number of possible execution plans
With different access methods, join order, join methods, etc.
Optimizing for I/O
Location Registers Memory Disk Cycles 1 100 106 Location Time My head 1 min. Washington D.C. 1.5 hr. Pluto 2 yr. (source: AlphaSort paper, 1995)
Query optimizers job
Enumerate candidate plans
Query rewrite: transform queries or query plans into equivalent ones
Estimate costs of plans
Use statistics such as histograms
! I/O costs dominate database operations
DBMS typically optimizes the number of I/Os
Pick a plan with reasonably low cost
Dynamic programming Randomized search
Example: Which of the following is a more efficient way to process SELECT * FROM R ORDER BY R.A;?
Use an available secondary B+-tree index on R.A: follow leaf pointers, which are already ordered by R.A Just sort the table
25 26