100% found this document useful (2 votes)
1K views

A First Course in Database Systems (3rd Edition)

Uploaded by

Tayyaba Asif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (2 votes)
1K views

A First Course in Database Systems (3rd Edition)

Uploaded by

Tayyaba Asif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 294
TE a ee ee 3 “aRETaRUHE | | CSUR - SBSH | A First Course in Database Systems : (Third Edition) Qaere wwe English reprint edition copyright © 2008 by Pearson Education Asia Limited ‘and China Machine Press, Original English language title: A First Cours Ei IBN 978.0-13-502176-7) by Jeffrey D. ‘Copyright © 2008, 2002, 1997 Pearson Education, Ine All rights reserved, Published by arrangement with the original publisher, Pearson Education, In., publishing as Prentice Hall For sale and distribution in the People’s Republic of China exclusively (except mng Kong SAR and Macau SAR). in Database Systems, Third wan and Jennifer Widom, FIM) HHT. Ac {5H Mii Pearson Education RRA. RUA. BALE. TREMP RIERS HF APHIS. BF. 01-2008-3272 ACHES TA) BOLD D PRE, JE aK BUT 8 (Ullman, J. D.), GFL, 2008.8 st Course in Database Systems, Third Ealition ISBN 978-7-111-24733.3 1 Di Oph. ede Ee Hea IW. T3113 + IR 37SENBE 247333 eth: 45.00% FURGARTS, MA TL T. AOT, A RL TAB ZRLLNGIALR, (010) 68326294 ARS BIS KEM MUME, TIRES AE DE Re RE, ART BITTEN, IEEE AEH. (HH LEN EA RAAT SAF de AEH BOTS PS AE Be. ORE RI TORSTEN, a TEAR SME RM. LAA EE, ROMA T AE A iit AEE, (EER ALE AANI HED Fale PEMA A AER AL J AL RAR TR A ee A TA PES TE PRPS OER A RRO ind SRA TSAR, ST RG FREE EBL, PERLE FRADE Be BCH 5 5 EF BAPE EZ Me BBR. 1B SRL ERD SP HG LB A AG tL LR Fae PRA He a PD, i SS eR IEA HR — RE He BLT Ae HRA HER a EA HE TR” LOOREE TERE. SERA HRA T (RACAL CTE Takk | EGUERL SNL Bot bk. Sit SMASH, FVGPearson, McGraw-Hill, Elsevier, MIT, John Wiley & Sons, Cengage tJ % #4 time 2s altar 7 RAE AY Ee PE MA A ATT A A i tak HH Andrew S. Tanenbaum, Bjarne Stroustrup, Brain W. Kernighan, Dennis Ritchie, Jim Gray, Afred V. Aho, John E. Hoperoft, Jeffrey D. Ullman, Abraham Silberschatz, William Stallings, Donald E. Knuth, John L, Hennessy, Larry L. Peterson Aiti # RASH Se feta. EAE BLEEP, CREA, IRR. KAY A, IER TMA BAA A “UE PL BLES A” AY A fe 9 TD hE RET BD BE ATS Te ALS, ee ALE T FUERA A «i RASS PE A tA Sek IE HR CE A 4 ECR WIN BR EHES ed, “URL RES” BE THART AH, RSDRERE tHe TREO, HRS RRMA ERS MBA. FORCE “AARNE” HE ‘i th He EER BURMITE RS . SBMAH OM HRI. PATS Ai, GMETAD RACAL AS i TAI E. Bi Pe LAE Se AG AE BY AN FE BE ADA A RA, BATE RL EEL 9G LE A — AB, BLATT FE iS Ak — A BRA EY, A MELB THRE, BUNA hzedu@ hzbook.com Uy 10) 68995264 Fee we AALS oun tea4@55. 100037 SAH Bats Preface At Stanford, we are on the quarter system, and as a re database instruction is divided into two courses. The firs for students who will use database systems but not is prerequisite for C8245, w to DBMS implementation. Students wishing to go further in the database field then take CSD45(opecal topics), (S046 (DBMS implementation project), and ied for CS145, and Database System Impl "S346. Because many schools are on the semester system wo kinds of database instructi rductory course, here was a need to prod asa single volume, Database Systems: The Co point somewhat. Today, there are two imp relational and semistructured (XML). We have decide object-oriented databases, except in the contexts of systems, herefore to downplay and object-relational What’s New in the Third Edition in Chapters addition to the moved to Chapter 4 a shorter version of the material on ODL, treat vi PREFACE, 2 design language for relation treatment of ODL and OQL is a ‘The material on functional and multivalued dependencies has been mod- ified and remains in Chapter 3. We have changed our viewpoint, 50 that a fenctional dependency i assumed to have a set of attributes om the right. We normal form to include the 3NF synthesis algorithm and to make clear what the tradeoff between 3NF and BCNF is. Chapter 5 contains the coverage of relational algebra fi joined by the treatment of Datalog from recursion in Datalog chapter, number 8, and hhas been augmented with a discs important new topics, including materialized views, and automatic selection of Indexes. ‘The new Chapter 9 is based on the old Chapter 8 (embedded SQL). It is introduced by a new section on 3-tiered architecture. It also includes an expanded discussion of JDBC and new coverage of PHP. Chapter 10 collects a number of advanced SQL topics. The discussion of authorization from the old Chapter 8 has been moved here, as has the discussion the old Chapter 10. Most of the chapter is devoted to the (from the old Chapter 4) and object-relational features wapters 11 and 12 cover XML and systems based on XML. Ex- ial at the end of the old Chapter 4, which has bi ter 12 is devoted to programming, and and XSLT. ludes sections on XPath, XQuery, Use of the Book ‘There is adequate material in this volume for a one-semester course on database ‘modeling and programming. For a one-quarter course, you will probably have to omit some of the topics. We regard Chapters 2-7 as the core of the course. ‘The remaining five chapters contain material from which it is safe to select at will, although we believe that every student should get some exposure to the issues of embedding SQL in standard host languages from one of the sections in Chapter 9. PREFACE vii If, as we do in CS145, you give students a substan fabase-application ‘material such as dependencies, although students need normalization for design, Prerequisites ‘We have used the book at the undergraduates and beginnin oftware systems, software engineer Exercises ‘The book contains extensive exercises, with some for almost every section, We indicate harder exercises or parts of exercises with an exclamation point. "The hhardest exercises have a double exclamation pol Gradiance On-Line Exercises ‘There is an accompanying set of on-line homeworks using a technology devel- corks to their us class” that get a perfect score. ce package for the book includes programming ex- ed for correctness, and if queries that only respond correctly iples is not sufficient to got credit for the problem. Gradiance service can be purchased at http://www. prenhall..con/goal. ir classes should contact their Instructors who want to use the system in Prentice-Hall representative. vill PREFACE Support on the World Wide Web ‘The book's home page is http: //wwy-db. stanford, edu/"ullnan/fedb.htal ‘You will find errata as we learn of them, and backup materials. We are making available the notes for each offer ‘works, projects and exarns. We the second edition that have Mapping the Second Edition to the Third Here is a table giving the correspondence between old sections and new. Old_| New || Old_| New Old_| New Vi [ir fiz fiz 22 [a2 23 [43 [24 [44 33 | 46 34 [31 35 | 32 41 | Web 42 [49 [4s | 49 46 fit a7 |u2|sa | 22 sa [52 38 |25 jes | oa 64 | 64 65/65 |e | 23 r2 | 72 73 [73 |ira | ras 83 [92 84 jos jas [oo 9.1 | Web 8.2. | web | 9.3 | Web 10.1 | 53 102 {54 | 10.3 | web Acknowledgements ‘We wonld like o thank Donald Kossmann for helpful discussions, especially con- cerning XML and its associated programming systems. Also, Bobbie Cochrane assisted us in understanding trigger semantics for a earlier edi A large number of people have helped us, either with the development of this book or its predecessors, or by contacting us with errata in the books and/or other Web-based materials, It is our pleasure to acknowledge them all here. H. Adamski, Brad Adelberg, Gleb Ashimo, Don- Babeock, Bruce Baker, Y Phillip Butler, Karen Butler, Mike Carey, Also Per Christensen, Ed Surajit Chi Chirkova, Nitin Chopra, Lewis Church, Jr. B Tom Di chika, Navier Faz, Greg Fichtenholtz, Bart Fisher, Simon Frettloeh, Jas! PREFACE 7 Meredith Gyn: Gerard Luis Gravano, itzman, Also Sajid Hussai Table of Contents ‘The Worlds of Database Systems 1.1 The Evolution of Database Systems Early Database Managenent Stes 11.3 Smaller and Smaller Systems - 1.14 Bigger and Bigger Systems 1.1.5 Information Integration 1.2 Overview of a Databas 121 Data-Defini 1.2.3 Storage and Buffer Management 1.24 Transaction Processing Relational Database Modeling ‘The Relational Model of Data 2.1 An Overview of Data Models 2.1.1 What is a Data Model? 2.14 The Semistructured Model in Brief . 2.15 Other Data Models 22.1 Attributes 222 Schemas 223° Tuples 224 Domai 2.2.5 Equivalent Representations of a Relation TABLE OF CONTENTS 2.2.6 Relation Instances 22.7 Keys of Relations . . 228 An Example Database Schema 22.9 Exercises for Section 2.2 23 Defining a Relation Schema in SQL 23.1 Relations in SQL 23,7 Exercises for Section 23. . 24. An Algebraic Query Language 2.4.1 Why Do We Need a Special Query Language? - 24.2 What is an Algebra’ 24.3 Overview of Relational Algebra 244 ons on Relations 245 7 246 2A jan Product 24.8 Natural Joins 249 24.12 Relationships Among Operations 24.19 A Liner Netaton for Algsrale Exprendone 24.14 Exercises for Section 2.4 25 Constraints on Relations . 25.1 Relational Algebra. 25.2 Referential Integrity Constraints 25.3 Key Constraints. 254 Additional Constraint Examples. 25.5 Exercises for Section 2.5, 26 Summary of Chapter 2 2.7 References for Chapter 2 Design Theory for Relational Databases 3.1 Functional Dependencies 3.11 Definition of Functional Dependency « 3.1.2 Keys of Rel Superkeys . . 3.2. Rules About Functional Dependencies . 3.2.1 Reasoni 3.2.2 The Spl 1g/Combining Rule ‘a Constraint Vane About Functional Dependencies xii TABLE OF CONTENTS ‘TABLE OF CONTENTS 323 ™ 4 High-Level Database Models 324 Z 4.1 The Entity/Relationship Model 411 325 7 mee a a Relationships 327 80 Entity-Relationship Diagrams 328 al Instances of an E/R Diagram 329 83 Binary E/R Rel 3.3 Design of Relational Database Schemas 85 33.1 Anomalies eee a6) 3.3.2 Decomposing Relations 86 3.3.3 Boyce-Codd Normal Form sees 88 3.34 Decomposition into BCNF.. . . ‘ sees 89 3.3.5 Exercises for Section 3.3 cee 3.4 Decomposition: The Good, Bad, and Ugly 93 34.1 Recovering Information from a Decomposition 4 34.2 The Chase Test for Lossless Join cee 96 343 Why the Chase Works 7 . 99 344 Dependency Preservation aes 00) 34.5 Exercises for Section 34... . 2 102 3.5 Third Normal Form... 102 35.1 Definition of Third Normal Form 102 35.2 The Synthesis Algorithm for 3NF Schemas 103 3.5.3 Why the 3NF Synthesis Algorithm Works we 14 354 = 105 36 = 105; nate Independence and Its Consequent. Redundancy 106 of Multivalued Dependencies... . 107 1g About Multivalued Dependencies 108 Fourth Normal Form : 38 39 1 xiv TABLE OF CONTENTS 4.7.1 UML Classes 472 473 ara § ars 476 : 277) Aegregation and Conposions 4.7.8 Bxercises for 4.8 From UML Diagram 48.1 UML-to-Relatic 482 From UML Si 4.8.3 From Aggregations and Comp 484 The UML Analog of Week Entity Sts 48.5 Exercises for Object Definition Li 49.1 49.2 Attributes in ODL 4.9.3 Relationships in ODL 4.9.4 Inverse Relationships 49.5 Multiplicity of Relationships 496 49.7 Subclasses in ODL 4.98 Declaring Keys in ODL 4.9.9 Bxercises for Section 4.9 4.10 From ODL Designs to Relational Designs. 4.10.1 From ODL Classes to Relations . . 4.10.2 Complex Attributes in Classes |... 4.10.3 Representing Set-Valued Attributes |. 4.10.4 Representing Other Type Constructors 10.5. Representing ODL Relationships 4.10.6 Exercises for Section 4.10 4.11 Summary of Chapter 4. . 4.12 References for Chapter 4 4 Relational Database Programming 5 Algebraic and Logical Query Languages 5.1 Relational Operations on Bags 51 Why Bags? 2... . 5.12 Union, Intersection, and Difference of Bags 5.1.3 Projection of Bags 5.14 Selection on Bags 5.15. Product of Bags 5.16 Joins of Bags 172 173 173 175 205 207 209 210 210 TABLE OF CONTENTS 5.17 Exercises for Section 5.1 . Extended Operators of Relational Algebra 5.2.1 Duplicate Elimination 5. 5.2.2 Aggregation Operators 523 Grouping 5.24 The Grouping Operator . . 5.2.5 Extending the Projection Operator 5.26 The Sorting Operator Outerjoins - 5.34 Meaning of Datalog Rules 5.3.5 Extensional and Intensional Predi 5.36 Datalog Rules Applied to Bags 5.3.7 Exercises for Section 5.3 5A Relational Algebra and Datalog 5A2 Projection 54.3. Selection 544 Product, 545 Joins 5.4.6 Simulating Multiple Operations with Datalog, 5417 Comparison Between Datlog and Relational Algebra 5A8 Exercises for Section 5.4 5.5 Summary of Chapter 5 5.6 References for Chapter 5 base Language SQL Nall Values and Comperisons Involving MULL ‘The Truth: Ordering the Output Exercises for Section 6.1 6.2 Queries Involving More Than One Relation 6.2.2 Disambiguating Attributes 62.3 Tuple Variables xvi TABLE OF CONTENTS 624 Interpret 625 mn Queries Difference of Queries 63.1 63.2 633 634 635 636 637 638 63.9 64. Full-Rel 641 642 643° Grouping and Aggregation in SQL G44 Agsregation Operators 645 646 647 GAS Exercises for Si 6.5 Database Modifications 6.5.1 Insertion. 652 Deletion 653. Updates 6.5.4 Bxercises for Section 6.5 6.6 ‘Transaction 66.1 662 66.3 Transactions 6.64 Read-Only Transactions 6.65 Dirty Reads . : 6.6.6 Other Is Levels 6.6.7 Exercises for Section 6.6 6.7 Summary of Chapter 6 68 References for Chapter 6 and Differences, Null Constraints and Triggers 7.1. Keys and Foreign Keys i 7.1.1 Declaring Foreign-Key Constraints 7.1.2 Maintaining Referential Integrity 7.1.3 Deferred Checking of Constraints TA Exercises for Section 7.1 7.2. Constraints on Attributes and Tuples 262 265 267 268 269 270 mm 273 274 275 276 27 279 281 281 282 283 284 285 287 288 289 201 291 292 204 295 296 296 298 299 300 302 TABLE OF CONTENTS 73 74 82 83 84 85 7.38 Exercises for Section 7.3, Assertions TAL Creating Assertions TA2_ Using Assertions 743. Exercises for Section 7.4 ‘Triggers a 75.1 Triggers in SQL. - 7.5.2 The Options for Trigger Design 7.5.3 Exercises for Section 7.5 Summary of Chapter 7 References for Chapter 7 and Indexes Virtual Views BLA Exercises for Section 8.1 Modifying Views 8.21 View Removal 8.2.2 Updatable Views 8.2.3 Instead-Of Triggers on Views 8.2.4 Exercises for Section 8.2 Indexes in SQL 83.1 Motivation for Indexes 83.2 Declaring Indexes 83.3 Exercises for Section 8.3 Selection of Indexes. - 84.1 A Simple Cost Model 34.2 Some Useful Indexes 843 Caleulat 844 Automatic Selection of Indexes to Create ceeises for Section 84 xvii 319 320 321 323 823 325 325 326 327 328 328 329 30 332 832 334 337 39) 341 341 341 343 343 344 3a 345 345 37 349 350 351 352 352 - 352 2353 355 387 359 359 360 362 TABLE OF CONTENTS xix sli TABLE OF CONTENTS JDBC aa m tic Creation of Materialized Views 364 °8 9.6.1 Introduction to JDBC bert ises for Section 8.5 - 365, 9.6.2 Creating Statements in JDBC . 413. 8.6 Summary of Chapter 8. . . 366 9.6.3 Cursor Operations in JDBC 415 8.7 References for Chapter 8 367, 9.64 Parameter Passing - - fae 416 for Section ria 9 SQU in a Server Environment 360 oo ee +s 9.1 The Three-Tier Architecture = 369 "97.1 PHP Basics aie 1 The Web-Server Tier . . ...370 9.7.2 Arrays - - a8 1.2. The Application Tier 371 9.73 The PEAR DB Library ud 3 The Database Tier i 372 9.74 Creating a Database Connection Using DB 419 92 The SQL Environment . . . 372 Executing SQL Statements 419 921 Environments : 373 ‘Cursor Operations in PHP : : 420 9.2.2 Schemas i 374 Dynamic SQL in PHP ie 9.23 Catalogs 375 9. 78 Exercises for Section 9.7 Lead 824 Clans and Servers in the SOL Envionment 11 375 9.8 Summary of Chapter 9 = 2.5 Connections: ter bead 926. Senora : an ot eeoreer 9.2.7 Modules . 2 878 ‘Advanced Topics in Relational Databases 425, 83 The SQL/Host Language Interface 373 ae peer e gener al 25, 9.3.1 The Impedance Mismatch Problem 380 10. Pr rd 9.3.2 Connecting SQ) lost Language 380 10.1.2 Cree oe 9.3.3 The DECLARE Section - ‘381 10.1.3. The Privilege-Checking Process 428 9.34 Using Shared Variables. . + 382 = fai 9.3.5 Single-Row Select Statements 383 = pee ie 93.6 Cursors Privileg 937 Moen, by Corso 3s Execs or Seton 101 : 7 938 Protecting Against Concurveat Updates 387 squ on 939 Dynamic SQL 338 ing Recursive elations in SQL 437 9.3.10 Exercises for Section 9.3 390, matic ‘Expressions in Recursive SQL 440 94° Stored Procedures 391 ines for Section 10.2 me 941 is and Procedures 391 ie 9.4.2 Some Forms 392, ae 94.3 Branc 394 rae 944 Queri - 3 10.3.3 References - : on 243. Loopetn PSM ee Toad Object Oriented Versus Object Relational 40 946 For-Loops 398, at 94.7 Exceptions in PSM. . 400 cr 402 os 482 405 or . 407 10.4.5 References bia 9.5.3. Fetching Data From a Query Result 408, 10.4.6 Creating Object ID's for Tables . 455 95.4 Passing Parameters to Queries 10 10A.7 Exercises for Section 10.4 : 7 9.5.5 Exercises for Section 9.5 412 Chapter 1 The Worlds of Database Systems 2 CHAPTER 1, THE WORLDS OF DATABASE SYSTEMS query the data (a “query” is database lingo for lata) and modify sing an appropriate ‘a query language or data-manipulation language. language, often 3. Support the storage of very large amounts of data — many terabytes or over a long period of Jing efficient access to the data 1.1.1 Early Database Management Systems ‘The first commercial database management systems appeared ‘These systems evolved from filesystems, which provide some of file systems store data over a long period of time, an lata. However, file systems do not snerally guarantee that port efficient access 1. Banking systems: maintaining accounts and making sure that system failures do not cause money to disappear. systems: these, like banking systems, require assurance be lost, and they must accept very large volumes of small actions by customers, 3. Corporate record keeping: employment and tax records, inventories, sales records, and a great variety of other types of information, much of it ‘The early DBMS's required the programmer to visualize data much as it was stored. These database systems used several different data models for 11. THE EVOLUTION OF DATABASE SYSTEMS 3 n a database, chief among them based model and the graph-based “network” model sd In the late 1960's through a report of CODASYL ment, through able effort. needed to te such programs, even for very simple q 1.1.2. Relational Database Systems ng a famous paper written by Ted Codd in 1970,? database systems changed si Codd proposed that database systems should present the user with a view of data organized as tables called relations. Behind ¢ scenes, there might be a complex data structure that allowed rapid response to a variety of queries. But, unlike the programmers for earlier database sys- toms, the programmer of a relational system would not be concerned with the storage structure, Queries could be expressed in a very high-level language, which greatly increased the efficiency of database programmers. We shall cover the relational model of database systems throughout most of this book, SQL ‘Structured Query Language”), the most important query language based on relational model, is covered extensively. By 1990, relational database systems were the norm. Yet the database field continues to evolve, and new issues and approaches to the manager surface regularly Some of the largest databases are organized rather differently from those using, relational methodology. In the balance of this section, we shall consider soi of the modern trends in database systems, 1.1.3 Smaller and Smaller Systems DBMS's were large, expensive software systems run sm. Today, hundreds of gigabytes fit on a s in a DBMS on a personal computer. TI 4 CHAPTER 1. THE WORLDS OF DATABASE SYSTEMS serve as a database, and the methods of querying and manipulating them are different from those relational systems. 1.14 Bigger and Bigger Systems On the other hand, a gigabyte is not that much data any more. Corporate databases routinely store terabytes (10? bytes). Yet there are many databases that store petabytes (10! bytes) of data and serve it all to users. Some impor- tant examples: 1, Google holds petabytes of data gleaned fro data is not held in a traditional DBMS, but optimized for search-engine queries crawl of the Web. This specialized structures Satellites send down petabytes of information for storage in specialized systems, 3. A picture is actually worth way more than a thousand words, You can store 1000 words in five or six thousand bytes. Storing a picture typi- cally takes much more space. Repositories such as Flickr store mi of pictures and support search of those pictures. Even a database Ii Amazon's has millions of pictures of products to serve, ‘And ifstill pictures consume space, movies consume much more. An hour of video requires atleast a gigabyte, Sites such as YouTube hold hundreds of thousands, or millions, of movies and make them available easly. 5, Peer-to-peer file-sharing systems use large networks of conventional com- fers to store and distribute data of various kinds. Although each node the network may only store a few hundred gigabytes, together the database they embody is enormous. 1.1.5 Information Integration To a great extent, the old problem of building and ning databases has become one of information integration: joining the information contained in ‘many related databases into a whole, For example, a large company has many divisions. Bach di own database of products ployee records indepe ‘to mean the same thing or the same term to mean different things. To make matters worse, the existence of legacy applications using each of these databases makes it almost impossible to scrap them, ever. ‘Asa result, it has become necessary with increasing frequency to tures on top of existing databases, with the goal of integrating ld struc- information 1.2. OVERVIEW OF A DATABASE MANAGEMENT SYSTEM 5 distributed among. One popular approach is the creation of data ware- ‘many legacy databases is copied periodically, with the appropriate translation, to a central database. Another approach is the implementation of a medi idleware,” whose function is to sup- pport an integrated model of the data of the various databases, while translating ‘between this model and the actual models used by each database. 1.2 Overview of a Database Management System at the top, to the DBMS: 1, Conventional users and application programs that ask for data or modify data, A database administrator: a person or persons responsible for the struc- ture or schema of the database. 1.2.1 Data-Definition Language Commands ‘The second kind of command is the simpler to process, and we show its trail beginning at the upper right side of Fig. 1.1. For example, the database admin- ‘or, or DBA, for a university registrar's database might decide that there ‘a student, a course the student hat course. The DBA might also red by the DBA, who needs special authority to ex- cute schema-altering commands, since these can have profound effects on the database. ‘These schema-altering data-definition language (DDL) commands are parsed by a DDL processor and passed to the execution engine, which then ‘goes through the index/file/record manager to alt ‘metadata, that schema information for the database. , the: 1.2.2 Overview of Query Processing The great majority of interactions llow the path on the left side of Fig. 1-1. A user or an ay ates some action, using ‘the data-manipulation language (DML). ‘This command does not affect the schema of the database, but may affect the content of the database (if the CHAPTER 1. THE WORLDS OF DATABASE SYSTEMS Bxecuton reo reuc Indewiteree= Figure 1.1; Database management system components OVERVIEW OF A DATABASE MANAGEMENT SYSTEM 7 ‘a modification command) or will extract data from the database (ifthe ‘a query). DML statements are handled by two separate subsystems, Answering the Query pptimized by a query compiler. The resul is the DBMS wi ‘The query is parsed plan, ot sequence of passed to the engine. The exe requests for s s of data, typically resource manager that knows about dat tnd sie of records in those files, and index fle, which lp fi data files quickly. ‘The requests for data are pas ager’s task is to bring appropriate px (disk) where it page or “disk block’ is the unit of transfer between ‘The buffer manager communicates with a storage manager to get data disk. ‘The storage manager migh more typically, the DBMS issues to the buffer manager. The buffer man- ‘Transaction Processing Queries and other DML actions are grouped into transactions, which are tion of transactions must be durable, meaning that the effect of any completed transaction must be preserved even if the system fails in some way right after completion of the transaction, We divide the transaction processor into two major parts: 1. A concurrency-control manager, or scheduler, responsible for assuring, atomicity and isolation of transactions, and 2. A logging and recovery manager, responsible for the durability of trans: actions, 1.2.8 Storage and Buffer Management ‘The data of a database normally resides in secondary storage; in today’s com- puter systems “secondary storage” generally means magnetic disk. However, to perform any useful operation on data, that data must be in main memory. Tt is the job of the storage manager to control the placement of data on disk and its movement between disk and main memory. a simple database storage manager might be nothing more than the file system of the underlying operating system. However, for efficiency OVERVIEW OF A DATABASE MANAGEMENT SYSTEM 9 8 CHAPTER 1. THE WORLDS OF DATABASE SYSTEMS purposes, DBMS's normally control storage on the disk directly, at least under The ACID Properties of Transactions some circumstances. ‘The storage manager keeps track of the location of files on the disk and obtains the block or blocks containing a file on request from Properly implemented transactions are commonly said to meet the “ACID the buffer manager. test,” where: ‘The buffer manager is responsible for par ory into buffers, which are page-sized regions ing the available main mem- 10 which disk blocks can be © “AY stands for “atomicity,” the all-or-nothing execution of trans- all DBMS components that need information from the disk actions, ‘© "T” stands for “isolation,” the fact that each transaction must appear to be executed as if no other transaction is executing at the same need include: 1. Data: the contents of the database itself stands for “dur 2. Metadata: the database schema that describes the structure of, and con- ee ee straints on, the database. ‘has completed. The remaini 7 stands for “consistency.” That is, all databases 2 fay evr, formation abut recent changes to the dati thee Helped panera lence data elements (e.g., account balances may not be negative after a trans- died ean gees er bos eae flo hen rast epee opr te cna Eee erred poe components of the database. Indeses: data structures that support efficient access to the data, sei icc at once, Thus, the scheduler (concurrency-control manager) must assure Fi that the indi 1.2.4 Transaction Processing iat the net effect is the same as if the transactions had in i in their entirety, one-at-a-time. A typical scheduler does normal to group one or more database operations into a transaction, which is a unit of work that must be executed atomically and in apparent isolation its work by maintaining locks on certain pieces of the database. These from other transactions, In addition, a DBMS offers the guarantee of durability locks prevent two transactions from accessing the same piece of data in ‘that the work of a completed transaction will never be lost. The transaction ‘ways that interact badly. Locks are generally stored in a main-memory lock table, as suggested by Fig, 1.1. The scheduler affects the execution of 1er database operations by forbidding the execution er locked parts of the database. ‘manager therefore accepts transaction commands from an application, which tell the transaction manager when tran begin and end, as well as infor- ‘mation about the exps (some may not wish to require atomicity, for example). The transaction processor performs the following tasks: locks that the scheduler grants, they can get into a situation where not ccan proceed because each needs something another transaction has. The transaction manager has the respons : -vene and cancel (“roll- back” or “abort”) one or more transactio the others proceed, 1. Logging: In order to assure durat logged separately on disk. ‘The log ‘designed to assure that no matter wh a recovery manager will be change in the database is sone of several policies system failure or “crash” occurs, the log of changes and restore the database to some consistent, state. writes the log in buffers and negotiates with the buffer manager to make sure that buffers are written to disk (where data can survive a crash) at appropriate 1.2.5 The Query Processor times. ‘The portion of the DBMS that most affects the performance that the user sees isthe query proceser Ta Fig. 1 the query broweor represented Dy eo 2. Concurrency contro: Transaction must appear to execu in sl i te gery Ye LL the ery vecented by But in most systems, there will in truth be many transactions executing. 10 CHAPTER 1, THE WORLDS OF DATABASE SYSTEMS ranslates the query into an internal form called uence of operations to be performed on ‘a query plan are implementations of 1. The query compiler, which (a) A query parser, which builds a tree structure from the textual form of the query. A query preprocessor, which performs seman (eg, making sure all relat query pl best available sequence of operations on the actual data, 's about the data to decide fastest. For example, the The query compiler uses metadata and stat which sequence of operations is likely to be th access to data, given values for one or more components of that dat riake one plan much faster than another. 2. The execution engine, which has the responsi the steps in the chosen query ‘most of the other component must get the data accessing data that is locked, and all database changes are properly logged. 1.3 Outline of Database-System Studies We divide the study of databases into five parts. This section is an out each of these units ‘model is essential for a study of database systems. After ex- amining the basic concepts, we delve into the theory of relational databases. ‘That study includes functional dependencies, a formal way of stating that one kind of data is uniquely determined by another. Tt also includes normalization, the process whereby functional dependencies and other formal dependencies are used to improve the design of a relational database ‘We also consider high-level design notations, These mechanisms include the Entity-Relationship (E/R) model, Unified Modeling Language (UML), and Ob- ject Definition Language (ODL). Their purpose isto allow informal exploration ‘of design issues before we implement the design using a relational DBMS. 13. OUTLINE OF DATABASE-SYSTEM STUDIES ul Part II: Relational Database Programming We then take up the matter of how relational databases are queried and modi- fied. After an introduction to abstract programming languages based on algebra and logic (Relational Algebra and Datalog, respectively), we turn our atten- tion to the standard language for relational databases: SQL. We study both the basics and important special topics, including constraint specifications and triggers (active database elements), indexes and other structures to enhance performance, forming SQL into transactions, and security and privacy of data in SQL. We also discuss how SQL is used in complete systems. It is typical to combine SQL with a conventional or host language and to pass data between the database and the conventional program via SQL calls. We discuss a number of ways to make this connection, including embedded SQL, Persistent Stored ‘Modules (P: (DBO), and Part III: Semistructured Data Modeling and Programming ‘The pervasiveness of the Web has put a premium on the management of hierar- chically structured data, because the standards for the Web are based on nested, tagged elements (semistructured data). We defining notations: Document Type Definitions (DTD) and XML Schema. We also examine three query languages for XML: XPATH, XQuery, and Extensible Stylesheet Language Transform (XSLT) Part IV: Database System Implementation We begin with a study of storage management: how disk-based storage can be organized to allow efficient access to data. We explain the commonly used B- a balanced tree of disk blocks and other specialized schemes for managing mensional data, ‘We then turn our attention to query processing. There are two patts to this study. First, we need to learn query execution: the algorithms used to implement the operations from which queries are built. Since data is typically fon disk, the algorithms are somewhat different from what one would expect were they to study the same problems but assuming that data were in main ‘memory. The second step is query compiling. Here, we study how to select an ‘efficient query plan from among all the possible ways in which a given query can be executed ‘Then, we study transaction processing. There are several threads to foliow. ‘One concerns logging: maintaining reliable records of what the DBMS is doing, in order to allow recovery in the event of a crash. Another thread is scheduling controlling the order of events in transactions to assure the ACID properties. We also consider how to deal with deadlocks, and the modificatio rithms that are needed when a transaction is distributed over many 12 CHAPTER 1. THE WORLDS OF DATABASE SYSTEMS sites. Part V: Modern Database System Issues In this part, we take up a number of the ways in which database-system tech- nology is relevant beyond the realm of conventional, relational DBMS's. We ‘consider how search engines work, and the specialized data structures that make their operation possible. We look at information integration, and methodolo- ‘gies for making databases share their data seamlessly. Data mining is a study that includes a number of interesting and important algorithms for processing large amounts of data in complex ways. Data-stream systems deal with data arrives at the system continuously, and whose queries are answered contin= jously and in a timely fashion. Peer-to-peer systems pri for management of distributed data held by independent 1.4 References for Chapter 1 in our citations, but rather shall mention only the papers of historical impor- tance and major secondary sources or useful surveys. A searchable index of je references from many fields. directory of many indexes relevan implementations of database systems widely known are the System R shaped the database field ar ‘The 2003 “Lowell report’ database-system research and dire of this type. 1 most recent in a series of reports on jons. It also has references to earlier reports than is covered ‘can find more about the th here from [2] and [8] y of database 1. $. Abiteboul et al, “The Lowell database research self-assessment," ACM 48:5 (2008 18. http: //research.microsoft .com/"gray ‘Noveli/Love1IDetebaseResearchSelfhesessnont.hta 2. S. Abiteboul, R. Hull, and V, Vianu, Foundations of Databases, Addison- ‘Wesley, Reading, MA, 1995, 3. http: //Linwww. ira.uka.de/bibliography/Database 1.4. REFERENCES FOR CHAPTER 1 3 4. M. M. Astrahan et al., “System R: a relational approach to database management,” ACM Trans. on Database Systems 1:2, pp. 97-137, 1976. hetp://www. informatik.uni-trier.de/“ley/db/index.btal . A mir~ ror site is found at http: //uww.acn.org/signod/dblp/db/ index.html M. Stonebraker and J. M. Hellerstein (eds.), Readings in Database Sys- tems, Morgan-Kaufmann, San Francisco, 1998. M, Stonebraker, E. Wong, P. Kreps, and G. Held, “The design and imple- mentation of INGRES,” ACM Trans. on Database Systems 1:3, pp. 189- 222, 1976. 8. J.D. Ullman, Principles of Database and Knowledge-Base Systems, Vol- umes I and I, Computer Science Press, New York, 1988, 1989. Part I Relational Database Modeling Chapter 2 The Relational Model of Data ‘This chapter introduces the most important model of data: the two-dimensional table, or “relation.” We begin with an overview of data models in general, We nology for relations and show how the model can be used to represent typical forms of data. We then introduce a portion of the language SQL — that part ns and their structure. The chapter closes with an int algebra. We see how this notation serves as both a query language — the aspect of a data model that enables us to ask questions about the data — and as a constraint language — the aspect of a data model that lets us restrict the data in the database in vatious ways. 2.1 An Overview of Data Models jon of a “data model” is one of the most fundamental in the study of systems. In this brief summary of the concept, we define some basic terminology and mention the most important data models. 2.1.1 What is a Data Model? A data model is a notation for describing data or information. The description generally consists of throe parts: 1, Structure of the data, You may be familiar with tools in programming languages such as C or Java for describing the structure of the data used by ‘program: arrays and structures (“structs”) or objects, for example. The data structures used to implement data in the computer are sometimes referred to, in discussions of database systems, as a physical data model, although in fact they are far removed from the gates and electzons that truly serve as the physical implementation of the data, In the database ra 8 CHAPTER 2. THE RELATIONAL MODEL OF DATA world, data models are at a somewhat higher Ie fand are sometimes referred to as a conceptual difference in level. We shall see examples shortly. 2. Operations on the de data are generally anyt! ‘models, there is usually limi We are generally allowed to perform that retzieve information) and modifications (op database). This limi operati ata very high level, yet have the operations efficiently. In cor ‘optimize programs in con Inefficient algorithm (e (eg, quicksort. the data. Database data models us 2.1.2 Important Data Models ‘Today, the two data models of preeminent importance for database systems are 1, The relational model, including object-relational extensions. including XML and related standards, 2. The semistructured-data model ‘The first, which Is present in all com is the subject of this chapter. The semistructured model the primary manifestation, is an added feature of most appears in a number of other contexts as well. We t starting in Chapter 11 2.1.8. The Relational Model in Brief minutes, and the genre of the movie. We show three particular should imagine that there are many more rows to this table — one row for each movie ever made, perhaps. ‘The structure portion of the relational model might appear to resemble an array of structs in C, where the column headers are the field names, and each 2.1, AN OVERVIEW OF DATA MODELS 19 tithe |_vear | length | genre Gone With the Wind] 1939] 231 | drane Star Vars 397 | 124 | sciFi Wayne’s World 1992| 98 | conedy Figure 2.1: An example relation of the rows represent the values of one struct in the array. However, it must be ‘emphasized that this physical implementation is only one possible way the table could be implemented in physical data structures. In fact, it is not the normal ‘way to represent relations, and a large portion of the study of database systems addresses the right ways to implement such tables. Much of the distinction comes from the scale of relations — they are not normally implemented as main-memory structures, and their proper physical implementation must take into account the need to access relations of very large size that are resident on disk, ‘The operations normally associated with the relational model form the “re- lational algebra,” which we discuss beginning in Section 2.4. These operations are table-oriented. As an example, we can ask for all those rows of a relatio that have a certain value in a certain column. For example, we can ask of the table in Fig. 2.1 for all the rows where the genre is “comedy. ‘The constraint portion of the relational data model will be touched upon briefiy in Section 2.5 and covered in more detail in Chapter 7. However, as a brief sample of what kinds of constraints are generally used, we could decide that there is a fixed list of genres for movies, and that the last column of every row must have a value that is on this list. Or we might decide (incorrectly, it turns out) that there could never be two movies with the same title, and constrain the table so that no two rows could have the same string in the fist ‘component. 2.1.4 The Semistructured Model in Brief Semistructured data resembles trees or graphs, rather than tables or arrays. The principal manifestation of this viewpoint today is XML, a way to represent ally nested tagged elements. The tags, similar to those used might appear in an XML “document” as i ‘The operations on semistructured data usually involve following paths in the implied tree from an element to one or more of its nested subelements, then to subelements nested those, and so on, For example, starting at the the entire document in Fig. 2.2), we might move to tag Movie? and ing tag, and from each element to its nested outer element 0 CHAPTER 2, THE RELATIONAL MODEL OF DATA Movie: ‘title="Gone With the Wind"> >1939 sngth>231 Movie titles"Star Wars"> 1977 124 aciFi Movie title="Vayne’s World"> ‘1992 95 conedy Figure 2.2: Movie data as XML element, to see which movies belong to the “comedy” genre. Constraints on the structure of data in this model often involve the data type of values associated with a tag, For instance, are the values associated with the tag integers or can they be arbitrary Fig. 2.2 might be used within xe be more than one gente for a movie? These and in Seetion 11.2. other matters will be taken 2.1.5 Other Data Models ‘There are many other models that are, or have been, associated with DBMS's. ‘A modern trend is to add object-oriented features to the relational model. There are two effects of object-orientation on relations: 1, Values can have structure, rather than being elementary types such as integer or strings, as they were in Fig, 2.1 2. Relations can have associated methods. In a sense, these extensions, called the object-relational mode the way structs in C were extended to objects in C++, We she ‘object-relational model in Section 10.3. nalogous to 22, BASICS OF THE RELATIONAL MODEL a are even database models of the purely object-oriented kind. In these, longer the principal data-structuring concept, but becomes ‘only one option among many structures, We discuss an object-oriented database ‘model in Section 4.9. ‘There are several other mor but that have now is that were used in some ofthe earlier DBMS's, ality of graphs was built directly into the network model, rather t trees as these other models do. 2.1.6 Comparison of Modeling Approaches Even from our brief exat sppears that semistructured models have more flexibility than relations. This difference becomes even more apparent when e productivity of programmers who use the data. Surprisin goals can be achieved with a model, particularly the relational m 1. Provides a simple, limited approach to structuring data, yet is reasonably versatile, so anything can be modeled. 2. Provides a limited, yet useful, on of operations on data, ‘Together, these limitations turn into features. They allow us to impleme languages, such , that enable the programmer to express their wishes at a very high level. A few lines of SQL can do the work of thousands of lines of , or hundreds of lines of the eode that had to be written to access dat earlier models such as network or hierarchical. Yet the short SQL ps because they use a strongly limited sets of operations, can be optimized to run as fast, or faster than the code written in alternative languages, 2.2 Basics of the Relational Model jonal model gives us a single le called a relation. Figure 2. y to represent data: as a two-dimen- which we copy here as Fig. 2, is an Movies. The rows each represent a 2 CHAPTER 2. THE RELATIONAL MODEL OF DATA ‘movie, and the columns each represent a property of movies. In this section, ‘we shall introduce the most important terminology régarding relations, and illustrate them with the Movies relation. length | genre Gone With the Wind | 1989] 281 | arama Star Wars 1977 | 124 | scaré Wayne’s World 1992| 95 | comedy Figure 2.3: The relation Movies 2.2.1 Attributes ‘The columns ofa relation are named year, length, and genre. AU ally, an attribute describes the me: instance, the column with attribute length holds t each movie. 2.2.2 Schemas ‘The name of a relation and the set of attributes for a relat schema for that relation. We show the schema for the relation name followed by a patenthesized list ofits attributes. Thus, the schema for relation Movies of Fig. 2.3 is Wovies(title, year, length, genre) ‘The attributes in a relation schema are a set, not a list. However, in order to talk about relations we often must specify a “standard” order for the attributes. ‘Thus, whenever we introduce a relation schema with a list of attributes, as above, we shall take this ordering to be the standard order whenever we display the relation or any ofits rows. In the relational model, a database consists of one or more relations. The set of schemas for the relations of a database is called a relational database ‘scherna, or just a database schema. 2.2.3 Tuples ‘The rows of a relation, other than the header row containing the attribute names, are called ‘A tuple has one component for each attribute of ‘the relation. For instance, the first of the three tuples in Fig. 2.3 has the four components Gone With the Wind, 1939, 231, and drama for attributes ‘title, year, length, and genre, respectively. When we wish to write a tuple 22, BASICS OF THE RELATIONAL MODEL 2 Conventions for Relations and Attributes We shall generally follow the convention that relat capital letter, and att names begin with a names begin with a lower-case letter. However, talk of relations in the abstract, where the names In that case, we shall use single capital letters for both relations and attributes, e.g., R(A,B,C) for a generic rel with three attributes. in isolation, not as part of a relation, we normally use commas to separate components, and we use parentheses to surround the tuple, For example, (Gone With the Wind, 1999, 291, draza) 2.3, Notice that when a tuple appears in isolation, the 10 some indication of the relation to which the tuple We shall always use the order in which the attributes wore listed in the relation schema, 2.2.4 Domains ‘The relational model requires that each component of each ‘must be of some elementary type such as integer or ted for a value to be a record structure, set, list, arra It is further assumed that. as domain, that isa partes elementary type. The components of any tuple of the relation must have, in each ecmpone1 the corresponding column. For ex schema, We shall do so by es. For example, we could represe iding a colon and a ‘schema for the Movies Novies(title:string, year:integer, length: integer, genre:string) 2.2.5 Equivalent Representations of a Relation jons are sets of tuples, tuples of a relation are three tuples of Fig. 2.3 in “the same” as Fig, 2.3. six possible orders, and the 4 CHAPTER 2. THE RELATIONAL MODEL OF DATA Moreover, we can reorder the attributes of the relation as we choose, without changing the relation. However, when we reorder the relation schema, we must be careful to remember that the attributes are column headers. Thus, when we change the order of the attributes, we also change the order of their columns, When the columns move, the components of tuples change their order as well. ‘The result is that each tuple has its components permuted in the same way as the attributes are permuted. For example, Fig, 2.4 shows one ‘ered “the same.” More precisely, these two tables are different presentations of the same relation, year 1977 | sciFi | Star Ware 124 1992 | comedy | Wayne 95 1939 | drama | Gone With the Wind | 231 Figure 2.4: Another presentation of the relation Novies 2.2.6 Relation Instances A relation abo movies is not static; rather, relations change over time. We uples for new movies, as these appear. We also expect changes to existing tuples if we get revised or corrected information about a movie, and perhaps deletion of tuples for movies that are expelled from the database for some reason js less common for the schema of a relation to change. However, there are situations where we might want to add or delete attributes. Schema changes, while possible in commercial database systems, can be very expensive, because cach of perhaps millions of tuples needs to be rewritten to add or delete com- ponents. Also, if we add an attribute, it may be difficult or even impossible to appropriate values for the new component in the existing tuples. e shall calla set of tuples for a given relation an instance of that relation For example, the three tuples shown in Fig, 2.3 form an instance of relation Movies. Presumably, the relation Movies has changed over time and will con- true to change over time. For instance, in 1990, Movies did not contain the ‘tuple for Vayne’s World. However, a conventional database system maintains aly one version of any relation: the set of tuples that are in the relation “now.” ‘This instance of the relation is called the current instance.) “Databases chat maintain historical versions of data at it exited in past times are called temporal databases 2.2. BASICS OF THE RELATIONAL MODEL 2% 2.2.7 Keys of Relations ‘There are many constraints on relations that the relational model allows us to place on database schemas. We shall defer much of the discussion of constraints until Chapter 7. However, one kind of constraint is so fundamental that we shall if we do not allow two tuples in a relation instance to have the same values all the attributes of the key. We can declare that the Example 2, of the two at jon Movies has a key consisting that title by itself does not form a key, since sometimes “remakes” of a movi appear. For example, there are three movies named King Ke ‘a different year. It should also be obvious that year by its ‘here are usually many movies made in the same year. dicate the attribute or attributes that form a key for a relation by if the Key attribute(s), For instance, the Hevies relation could have its schema written as Movi (eitle, year, length, genre) Remember that the statement that a set of attributes forms a key for a relation is a statement about all possible instances of the relation, not a state- ‘ment about a single instance, For example, looking only at the tiny relation of Fig. 2.3, we might imagine that genre by itself forms a key, since we do not see two tuples that agree on the value oftheir genre components, Hoxever, we can imagine that if the relat be many dramas, tuples that agreed ‘a consequence, it would be Incorrect to assert that genre is a key for the relation Movies. While we might be sure that title and year can serve as a key for Novies, id databaces use artificial keys, doubting that it is safe to make about the values of attributes outside from all others, the employee-ID at en if there are several employees with the same name. Thus, te can serve as a key for a relation about employees, 0 serve as a key for employees. Note that there is key, as there would be for and Social-Security numbers. ‘hose purpose is to serve as a key is quite w employve ID's, we find student ID's to distinguish widespread. In ad 26 CHAPTER 2. THE RELATIONAL MODEL OF DATA students in a university. We find drivers’ license numbers and automobile reg- istration numbers to distinguish drivers and automobiles, respectively. You ‘undoubtedly can find more examples of attributes created forthe primary pure pose of serving as keys Movies genre:string, studiotane:string, producerC#: integer > Moviestar nane:string, address:etring, gender:char, direndave:aate > Starsin( cort#: integer, netorth: integer > Studio( hage:string, aadre presC#: integer ) Figure 2.5: Example database schema about movies 2.2.8 An Example Database Schema ‘We shall close this seetion with an example of a te database schema, ‘The topic is movies, and it builds on the relation Movies that has appeared so shown in Fig. 2.5. Here are the things of this schema, 2.2, BASICS OF THE RELATIONAL MODEL 7 Movies This relation is an extension of the example rel so far. Remember that its key is title and year together. We have added two new attributes; studiolame tells us the studio that owns the movie, and producerC# is an integer that represents the producer of the movie in a way that we shall discuss when we talk about the relation MovieExec below. we have been discussing MovieStar ‘This relation tells us something about stars. The key is name, the name of the movie star. It is not usual to assume names of persons are unique and therefore suitable as a key. However, movie stars are different; one would never take a name that some other movie star had used. Thus, we shall use the convenient. tional approach would -security numbers, so that wwe could assign each individual a unique number and use that attribute as the key. We take that a se. Another interesting point abo two new data types. The gender can be a si date,” which might be a character string ofa special frm. Starein ‘This relation connects movies to the stars of that movie, and likewise connects a star to the movies in which they appeared. Notice that movies are represented by the key for Movies — the title and year — although we have chosen differ- are necessary to form a key. It is pet Starsin could have two distinct tuples that agree in any two of the three ibutes. For instance, a star might appear in two movies in one year, giving rise to two tuples that agreed in movieYear and starNane, but disagreed in novieTitle. MovieExec This relation tells us about movie executives. It contains their name, address, and networth as data about the ‘and studio presidents (as app low). These are integers; a different one is assigned to each exec 28 CHAPTER 2. THE RELATIONAL MODEL OF DATA acctNo | type __| balance 12345 | savings | 12000 23486 | checking | 1000 34867 ings | 25 The relation Accounts frstName | tastName account Robbie | Banks 901-209 12945 Lena | Hana | 905-299 | 12365, tena | ana | 905-293 | 23456 ‘The relation Customers Figure 2.6: Two relations of a banking database Studio ‘This relation tells about movie st same name, and therefore use ni address of the stu ‘We assume that the st appears in NovieExec. ios. We rely on no two studios having the as the key. ‘The other at are the he certificate number for the president of the studio, president is surely a movie executive and therefore 2.2.9 Exercises for Section 2.2 Exercise 2.2.1 part of a ban| In Fig. 2.6 are instances: database. In: 8) The aetributes ofeach relation by The tuples of each reation ¢) The components of one tuple fom each relation, 4) The relation schema foreach relation, ©) The database schema £) A suitable domain for each attribute. 48) Another equivalent way to present each relation 23. DEFINING A RELATION SCHEMA IN SQL 29 Exercise 2.2.2: In Section 2.2.7 we suggested that there are many examples of attributes that are created for the purpose of serving as keys of relations. Give some additional examples. Exercise 2.2.3: How many different wi attributes) are there to represent a relation considering orders of tuples and tance if that instance has: 8) Three attributes and three tuples, lke the relation Accounts of Fig. 2.67 ) Four attributes and five tuples? ©) mattributes and m tuples? 2.3 Defining a Relation Schema in SQL ‘SQL (pronounced “sequel”) is the principal language used to describe and ma- te relational databases, There is a current standard for SQL, called SQL- fost commercial database management systems implement something sim- jut. not identical to, the standard. There are two aspects to SQL: 1. The Data-Definition sublanguage for declaring database schemas and 2. The Data-Manipulation sublanguage for querying (askiug questions a out) databases and for modifying the database. ‘The distinction between these two sublangu eg., C or Java have portions that declar ‘executable code, These correspond to data respectively. i section we shall begin a discussion of the data-definition portion of SQL. There is more on constraints on date, ‘The Chapter 6. languages; mas that are anipulation, 2.3.1 Relations in SQL SQL makes a distinction between three kinds of relations: 1, Stored relations, which are called tables. These are the kind of relation ‘we deal with ordinarily — a relation that exists in the database and that can be modified by changing its tuples, as well as queried. Views, which are relations defined by a computation. These relations are not stored, but are constructed, in whole or in part, when needed. They are the subject of Section 8.1, 30 CHAPTER 2. THE RELATIONAL MODEL OF DATA ‘3, Temporary tables, which are constructed by the SQL language processor when it performs its job of executing queries and data modifications, ‘These relations are then thrown away and not stored. shall learn how to declare tables. We do not treat the dee- n of views here, and temporary tables are never declared, and their data types. Tt also allows {or a relation. There are many other including many forms of constraints that can be declared, and the declaration of indezes (data structures that speed ‘up many operations on the table) but we shall leave those for the appropriate time. 2.3.2 Data Types jive data types that are supported by SQL ist have a data type. 1, Character strings of fixed or varying length, The type CHAR(n) denotes a fixed-length string of up to n characters. VARCHAR(A) also denotes a string of up to n characters. The difference is implementation-dependent; used. SQL 8 reasonable coercions between values of characterstring types. Normally, a string is padded by trail of a component that is ample, the string "foo" of type CHAR( llowing the second o). varying-length chats rer than characters while BIT VARYING(a) de ‘The type BOOLEAN denotes an gical. The possi- are TRUE, FALSE, and — although it would — ONKNOY. permitted may be less, depending on the types int and ehort int in C), 2.3, DEFINING A RELATION SCHEMA IN SQL 31 Dates and Times in SQL Different SQL implementations may tions for dates and times, but the if is the SQL standard repre- sentation. A date value is the keyword DATE followed by a quoted string of a special form. For example, DATE ° 1948-05-14? follows the required form. The first four characters are digits representing the year. Then come a hyphen and two digits representing the month. Finally there is another hyphen and two digits representing the day. Note that single-digit months and days are padded with a leading 0. AA time value is the keyword TIME and a quoted string. This string has n the military (24-hour) clock. Then come a color two digits for the minute, a ind two digits for the second, If fractions ofa second are desired, we may continue with a decimal point and. ‘as many significant digits as represents the time at which all students many different representa: 5. Floating-point numbers can be represented in a vari use the type FLOAT or REAL (these are synonyms) for decimal point assumed to be d positior is a possible value of type DECTHAL(6 for DECIMAL, although there are possibie implementation-dependent dif ferences, 6. Dates and times can be represented by the data types DATE and TIME, 2.3.3 Simple Table Declarations ‘The simplest form of declaration of a relation schema consists of the words CREATE TABLE followed by the rame of th mn and a parent attribute names and their types. of Example 2.2: The relati declared as in Fig. 27. TI 2 CHAPTER 2, THE RELATIONAL MODEL OF DATA CREATE TABLE Movii title yoar length INT, genre ‘CHAR (10), atudiollame CHAR(30), producerct INT Figure 2.7: SQL declaration of the table Movies: ‘The year and length attributes are each integers, and the genre is a string of (up éo) 10 characters, The decision to allow up to 100 characters for a title is arbitrary, but we don't want to limit the lengths of titles too strongly, or long titles would be truncated to fit. We have assumed that 10 characters are enough to represent a genre of movie; again, that is an arbitrary choice, one wwe could regret if we had a genre with a long name, Likewise, we have chosen 30 characters as sufficient for the studio name. The certificate number for the produeer of the movie is another integer. 0. Example 2.8: Figure 2.8 is # SQL declaration ofthe relation NovieStar from Fig. 2.5, It illustrates some new options for data types. The name of this table is MovieStar, and it has four attributes. The first two attributes, nane and address, have each been declared to be character strings. However, with the name, we have made the decision to use a fixed-length string of 30 characters, padding a name out with blanks at the end if necessary and truncating a name to 30 characters if itis longer. In contrast, we have declared addresses to be variable-length character strings of up to 255 characters It is not lear that these two choices are the best possible, but we use them to illustrate the two major kinds of string data types. CREATE TABLE MovieStar ( ‘CHAR(3O) , VARCHAR(255) , gender CHAR(1), birthdate DATE d Figure 2.8: Declaring the relation schema for the NovieStar relation strings, however. 28. DEPINING A RELATION SCHEMA IN SQL 33 ‘The gender attribute has values that are a single letter, M or F. Thus, we ccan safely use a single character as the type of this attribute. Finally, the birthdate attribute naturally deserves the data type DATE. 2.3.4 Modifying Relation Schemas table, including all of its current change the schema by adding or deleting attributes. We can delete a relation R by the SQL statement: DROP TABLE R; Relation Ris no longer part of the database schema, and we can no longer ‘access any of its tuples. ‘More frequently than we would drop a relation that is part of a long-lived database, we may need to modify the schema of an existing relation. These ‘modifications are done by a statement that begins with the keywords ALTER TABLE and the name of the relation. We then have several options, the most important of which are 1. ADD followed by an attribute name and its data type. 2. DROP followed by an attribute name. Example 2.4: Thus, for instance, we could modify the MovieStar relation by adding an attribute phone with: ALTER TABLE MovieStar ADD phone CHAR(16) ; As a result, the MovieStar schema now has five attributes: the four mentioned in Fig. 2.8 and the attribute phone, which is a fixed-length string of 16 bytes. In the actual relation, tuples would all have components for phone, but we know of no phone numbers to put there. Thus, the value of to the special null value, NULL. In Section 2. hhow itis possible to choose another “default” value to be used instead of NULL for unknown values. As another example, the ALTER TABLE statement: ALTER TABLE MovieStar DROP birthdate; deletes the birthdate attribute. As a result, the schema for ovieStar no longer has that attribute, and all tuples of the current MovieStar instance hhave the component for birthdate deleted, 2 M4 CHAPTER 2. THE RELATIONAL MODEL OF DATA 2.3.5 Default Values ‘When we create or modify tuples, we sometimes do not have values for all components. For instance, we mentioned in Example 2.4 that when we add a in to 8 relation schema, the existing tuples do not have a known value, and it was suggested that NULL could be used in place of a “real” value. However, there are times when we would prefer to use another choice of default value, the value that appears in a column if no other value is known, In general, any place we declare an attribute and its data type, we may add the keyword DEFAULT and an appropriate value. That value is either NULL or iat are provided by the system, such as the et us consider Example 2.3. We might wish to use the char- acter ? as the default for an unknown gender, and we might also wish to use the earliest possible date, DATE 1000-00-00" for an unknown birthdate, We could replace the declarations of gender and birthdate in Fig. 2.8 by: gender CHAR(1) DEFAULT 17”, birthdate DATE DEFAULT DATE 0000-00-00" As another example, we could have declared the default value for new at- tribute phone to be "unlisted? when we added this attribute in Example 2.4 Tin that case, ALTER TABLE MovieStar ADD phone CHAR(16) DEFAULT ‘unlisted’; would be the appropriate ALTER TABLE statement, 2.3.6 Declaring Keys ‘There are two ways to declare an attribute or set of attributes to be @ key in the CREATE TABLE statement that defines a stored relation. 1. We may declare one attribute to be a key when that attribute is listed in the relation schema, 2, We may add to the in the schema (which so far have only been attributes) an additional declaration that says a particular attribute or set of attributes forms the key. If the key consists of more than one attribute, we have to use method (2). If the key is a single attribute, either method may be used. ‘There are two declarations that may be used to indicate keyness: 18) PRIMARY KEY, or by owzgue, 23. DEFINING A RELAT IN SCHEMA IN SQL 35 ‘The effect of declaring a set of attributes S to be a Key for relation R either using PRIMARY KEY or UNIQUE is the following: n R cannot agree on of them is HULL. Any att this rule causes the DBM: IEPRIMARY KEY is used, value for their compot ted by the system. NULL Is permitted ifthe set S is declared UNIQUE. however. A DBMS make make other distinctions between wishes. other duplicate values for this attribute, CREATE TABLE MovieStar ( ‘name CHAR(SO) PRIMARY KET, address VARCIAN(256) gender CHAR(.) , birthdate DATE Figure 2.0 \g nane the key vely, we can use a separate definition of the key. ‘The resulting would look like Fig. 2.10. Again, UNIQUE could replace PRIMARY KEY, Ol CREATE TABLE MovieStar ( ‘name CHAR(30) , address VARCHAR (255), ‘gender CHAR(1) birthdate DATE PRIMARY KEY (name) % Figure 2.10: A separate declaration of the key 36 CHAPTER 2. THE RELATIONAL MODEL OF DATA Example 2.7: In Example 2.6, the form of either Fig. 2.9 or Fig. 2.10 is acceptable, because the key is a single attribute. However, in a situation where the key has more than one attribute, we must use the style of Fig. 2.10. For instance, the relation Movie, whose key isthe pair of attributes title and year, must be declared as in Fig. 2.11, However, as usual, UNIQUE is an option to replace PRIMARY KEY. 0 CREATE TABLE Movies ( title cuAR(100) , year nt, length INT, genre CHAR(10), studiolane CHAR(30), producerc# INT, PRIMARY KEY (title, year) Figure 2.11: Making title and year be the key of Movies Exercises for Section 2.3 2.8.1: In this exercise wo introduce one of our running examples of tabase schema. The database schema consists of four relations, ‘whose schemas are Product (maker, model, type) PO(model, speed, ram, hd, price) Laptop(nodel, speed, ram, hd, screen, price) Printer (model, color, type, price) The Product laptop, or print imbers are uniq is not realistic, and a real as part of the model number. The that is a PC the speed (of the processo megabytes), the size of the hard dish relation is si iches) is also included. The Printer relation records for each printer model whether the printer produces color output (true, if $0), the process type (laser or ink-jet, typically), and the price. ves the manufacturer, model number and type (PC, wus products. We assume for convenience that model manufacturers and product types; the price. The Laptop the following declarations: 4) A suitable schema for relation Product. 2.3. DEFINING A RELATION SCHEMA IN SQL a7 ») A suitable schema for relation Pc. ©) A suitable schema for relation Laptop. 4) A suitable schema for relation Printer. ) An alteration to your Printer schema from (4) to delete the attribute color. f) An alteration to your Laptop schema from (c) to add the attribute od (optical-disk type, eg., ed or dvd). Let the default value for this attribute bbe ‘none? if the laptop does not have an optical disk Exercise 2.3. World War II capital ships. ‘roduces another running example, concerning. lves the following relations: Classes(class, type, country, nunGuns, bore, displacement) Ships(name, class, launched) date) ‘ship, battle, result) ‘Ships ate built in “classes” from the same design, and the class js usually named of that class. The records the name of the yb? for battleship or 7 ‘main guns, and the displacement 1e name of the ship, the name was launched. Relation Battles gives the m: 8, and relation Outcones gives the each battle. ing declarations: ble schema for relation Classes. bb) A suitable schema for relation Ships. ©) A suitable schema for relation Battles. 4) A suitable schema for relation Outcomes. ) An alteration to your Clas: bore. relation from (a) to delete the attribute ) An alteration to your Ships relation from (b) to include the attribute ‘yard giving the shipyard where the ship was built. 38 CHAPTER 2. THE RELATIONAL MODEL OF DATA 2.4 An Algebraic Query Language In this section, we introduce the data- lation aspect of the relational icture; it needs a way to query f operations on relations, algebra, that jons from given ‘When the given relations are stored data, then the constructed relations ean be answers to queries about this data. Relational algebra is not used today as a query language in commercial DBMS's, although some of the early prototypes did use this algebra directly Rather, the “real” query language, SQL, incorporates relational algebra at its center, and many SQL programs are really “syntactically sugared” expressions of relational algebra. Further. when a DBMS processes queries, the first thing that happens to a SQL query is that it gets translated into relational algebra. or a very similar internal representation. ‘Thus, there are several good reasons to start out learning this algebra, 2.4.1 Why Do We Need a Special Query Language? Before introducing the operations of relational algebra, one should ask why, or ‘whether, we need a new kind of programming languages for databases, Won't conventional languages like C or Java suffice to ask and answer any computable about relations? After all, we can represent a tuple of a relation by @ in C) oF an object (in Java), and we can represent relations by arrays of these elements ‘The surprising answer is that relational algebra is useful because it is less powerful than C or Java. That is, there are computation: odd. By limiting what we can say or do in our query language, we get two huge rewards — ease of programming and the abil highly optimized code — that we discussed 2.4.2 What is an Algebra? general, consists of operators and atomic operands. For in- algebra of arithmetic, the atomic operands are variables 15. The operators are the usual arithmetic ones: addition, and division. Any algebra allows us to build perators to atomic operands and/or other expresio of the algebra. Usually, parentheses are needed to group operators and theit operands, For instance, in arithmetic we have expressions such as (2 + y) +2 of ((2+7)/(y-3)) +2. 24, AN ALGEBRAIC QUERY LANGUAGE 39 Relational algebra is another example of an algebra. Its atomic operands 1. Variables that stand for relations. 2. Constants, which are finite relations. We shall next see the operators of relational algebra, 2.4.3 Overview of Relational Algebra ‘The operations of the traditional relational algebra fall into four broad classes: 4a) The usual set operations — union, intersection, and difference ~~ applied to relations. b) Operations that remove parts of a relation: “selection” eliminates some rows (tuples), and “projection” eliminates some columns, luding “Cartesian possible ways, and ly pair tuples from two ©) Operations that combine the tuples of two rel product,” which pairs the tuples of two rel various kinds of “joii” operations, which select relations. 4) Am operation called “renaming” that does not affect the tuples of a re- lation, but changes the relation schema, ie., the names of the attributes and/or the name of the relation itself. We generally shall refer to expressions of relational algebra as queries. 2.4.4 Set Operations on Relations ‘The three most common operations on sets are union, intersection, and differ ‘ence. We assume the reader is familiar with these operations, which are defined as follows on arbitrary sets R and 5: ‘* RUS, the union of R and S. is the set of elements that are in Ror $ or both. An element appears only once in the union even present in both Rand S. © RNS, the intersection of R and S, is the set of elements that are in both Rand S. © R~S, the difference of R and S, is the set of elements that are in R but not in S. Note that ~ 5 is different from 5 — R; the latter is the set of elements that are in S$ but not in R. When we apply these operations to relations, we need to put some conditions on Rand S: 40 CHAPTER 2, THE RELATIONAL MODEL OF DATA 1. Rand S must have schemas with identical sets of and the types (domains) for each attribute must be the same in R and S. 2. Before we compute the set-theoretic union, intersection, or difference of sets of tuples, the columns of R and S must be ordered so that the order of attributes is the same for both relations. Sometimes we woul relations that have the same but that use different na ‘operator to be discusse: relations and give them with corresponding domains, tes. If $0, we may use the renaming to change the schema of one or both attributes. gender | birthdate F 9/9/99 Carrie Fisher | 123 vaple St., Hollywood Mark Hamill | 456 Oak Rd., Brentwood |m | o/@/ee Relation R name address | gender | birthdate Carrie Fisher | 123 Maple St., Hollywood [F [9/9/09 Harrison Ford | 789 Palm Dr., Beverly Hilts | | 7/7/77 Relation $ Figure 2.12: Two relations ‘Example 2.8: Suppose we have the two relations R and S, whose schemas are both that of relation MovieStar Section 2.2.8. Current instances of Rand S are shown in Fig. 2.12. Then the union RU gender | birthdate Carrie Fisher | 123 Maple St., Hollywood | F 9/9/99 Mark Haaill | 456 Oak Ra., Brentwood " 3/8/88 Harrison Ford | 789 Paim Dr., Beverly Hille | wrt Note that the two tuples for Carrie Fisher from the two relations appear of he result ‘The intersection RA S is name address gender | birthdate 723 Waple St., Hollywood | F ly the Carrie Fisher tuple appears, because only itis in both relations. ifference R— Sis 24, AN ALGEBRAIC QUERY LANGUAGE a gender | birthdate Wark Haniii | 456 Oak Rd, Brentwood | M 3787/88 Rand thus are candidates for Sand so is not in R-S. 0 ‘That is, the Fisher and Hamil RS. However, the Fisher tuples appeat le also appes 2.4.5 Projection ‘The projection operator is used to produce from a relation R a new that has only some of R's columns. The value of expression r,,43 a relation that has only the columns for attributes A,,Az,... dn of R. The schema for the resulting value is the set of attributes (4), 42,.-- An}, which wwe conventionally show in the order listed title | wear | tength | genre Sear Ware 1977 | 198 | eciF Galaxy Quest | 1999 | s08 Wayne's Worle | 1992 | 95 srtengtn Hovis) ‘The resulting relat year _{ length Galary Quest | 1999 | 104 Wayne's World | 1992 | 95 As another example, we can project onto the attribute genre with the ex- pression tyenre(Movies). The result is the single-column relation genre aciFi comedy "Notice that there are only two tuples in the resulting relator tuples of Fig. 2.13 have the same value in th and in the relational algebra of set, duplicate 2 CHAPTER 2. THE RELATIONAL MODEL OF DATA A Note About Data Quality While we have endeavored to make example data as accurate as possible, ‘we have used bogus values for addresses and other personal information about movie stars, in order to protect the privacy of members of the acting profession, many of whom are shy individuals who shun publicity 2.4.6 Selection ‘The selection operator, applied to a relation R, produces a new relation with a subset of R's tuples. The tuples in the resulting relation are those that satisfy some condition C that involves the attributes of R. We denote this operation relation is the same as R's schema, and 1es in the same order as we use for R. ional expression of the type we are familiar from onal programming languag¢ onal expressions eyword if in programming languag as C or Java. Th js that the operands in condition C are of R. We apply C to each tuple ¢ of R by subst appearing in condition C, the component of t fo tuting for each attribute of C the condition C that appear Example 2.10: Let the tioName | producerO# seiFi | Fox 12545 | conedy | Dreaavorks | 67890 ‘The frst tuple satisfies the condition length > 100 because when we substitute for length the value 124 found in the component of the fist tuple for attribute Length, the condition becomes 124 > 100. The latter condition is true, so we accept the first tuple. The same argument explains why the second tuple of Fig. 2.13 is in the The Length component 95. Thus, when we substitute for length we get the condition 95 > 100, which is false. Hence the last tuple of Fig. 2.13 is not in the result. Example 2.11: Suppose we want the set of tuples in the relation Movies that represent Fox movies at ‘2 more complicated cor expression is tength>100 AND studioNamen "Fox? (MOViCS) 24, AN ALGEBRAIC QUERY LANGUAGE 43 ‘The tuple title year | length | inColor | studioName | producerC# ‘Sear ware | 1977 | 124 | true | Fox 12345 is the only one in the resulting relation, 2.4.7 Cartesian Product ‘The Cartesian product (or cross-product, or just praduet) of two sets Rand S is the set of pairs that can be formed by choosing the first element of the pair to be any element of Rand the second any element of S. This product Is denoted Rx S. When R and $ are relations, the product is essentially the saine. However, since the members of R and $ are tuples, usually consisting of more than one componé from $ is a longer tuple, ‘constituent tuples. By convent precede the components from Sin order for the result. ‘The relation schema for the resulting relation is the union of the schemas Rand § should happen to have some attributes in ‘common, then we need to invent new names for at least one of each pair of ‘To disambiguate an attribute A that is in the schemas of use R.A for the attribute from R and S.A for the at yonent for each of the components of che components from FR (the left operand) Example 2.12: For conciseness, let us use an abstract example that illustrates the product ope relations R and S have the schemas and tuples in Fig. 2.14 ‘Then the product R x § consists of the six shown in Fig. 2.14(c). Note how we have paired each of the two tuples of 2.4.8 Natural Joins More often than we want to take the product of two relations, we find a need to join them by pairing only those tuples that match in some way. The simplest Sort of match is the natural join of two relations R and S, denoted R ve S, in which we pair only those tuples from R and $ that agree in whatever attributes are common to the schemas of R and $. More preci Ais Any »An be all the attributes that are in both the schema of R and the schema of S. Then tuple r from R and a tuple s from S are successfully paired if and only if r ‘and s agree on each of the attributes Ay, Any... Ap If the tuples r and # are successfully paired in the join Roe S, then the result of the pairing is a tuple, called the joined tuple, with one component for ‘each of the attributes in the union of the schemas of and S$. The joined tuple “4 CHAPTER 2. THE RELATIONAL MODEL OF DATA Alp T]2 a4 (@) Relation R Bic|p 2/5 6 a}7 fa 9 | 10 | 1 (b) Relation RB|SB|C|D rire ai7 fe 1[2 |o Jrofas ala |2 |s |e ala ja |7 |e 3}4 [9 || (o) Result Rx S Figure 2.14: Two relations and their Cartesian product agrees with tuple r in each attribute in the schema of R, and it agrees with sin each the schema of S. Since r and s are successfully paired, ble to agree with both these tuples on the attri ‘The construction of the joined However, the order of the attributes need not be that convenient; the a of Rand $ can appear in any order. Example 2.13: The natural join of the relations A and S$ from Fig. 21 and (b) is mmmon to Rand S is B. Thus, to pair succes n their B components. Ifo, the resulting ponents for attributes A (from R), B (Grom ether R ot §),C'( (from $) ly, tuples le has com- S), and D 24. AN ALGEBRAIC QUERY LANGUAGE 45 R s joined wple Figure 2.15: Joining tuples In this example, the first tuple of R successfully pairs with only the first tuple of ; they share the value 2 on their common attribute B. This pairing yields the first tuple of the result: (1,2,5,6). ‘The second tuple of R pairs successfully only with the second tuple of S, and the pairing yields (3,4,7,). Note that the third tuple of $ does not pair with any tuple of F and thus has no effect on the result of Rea S. A tuple that fails to pair with any tuple of the other relation in a join is said to be a dangling tuple, Example 2.14: The previous example does not illustrate all the possibi inherent in the natural join operator. For example, ’h more than one tuple, and there was only one attri ‘wo relation schemas. In Fig, 2.16 we see two other relat share two attributes between their schemas: B and C. We also show an instance in which one tuple joins with several tuples, For tuples to pair successfully, they must agree in both the B and C’ com- ponents. Thus, the first tuple of U' joins with the first two tuples of V, while the second and third tuples of U join with the third tuple of V. The result of ‘these four pairings is shown in Fig. 2.16( 2.4.9 Theta-Joins cother basis. For that purpose, we have a related not join. Historically, the “theta” refers to an arbitrary co Tepresent by C rather than 8. ‘The notation for a theta-join of relations R and $ based on condition C is Roag S. The result of this operation is constructed as follows: 1, Take the product of R and 5. 2, Select from the product only those tuples that satisfy the condition C. 46 CHAPTER 2. THE RELATIONAL MODEL OF DATA cory. Tae ee ala (@) Relation U lb sa tof ee ala 0 (b) Relation V Bic 2/3 aia 7\a 7I\8 (©) Result Use ¥ Figure 2.16: Natural join of relations ‘As with the product operation, the schema for the result is the union of the schemas of R and S, with “R." or “S8." prefixed to attributes if necessary to indicate from which Schema the attribute came Example 2.15: Consider the operation Us acp V, where U and V are the relations from Fig. 2.16(a) and (b). We must consider all nine pairs of tuples, , and see whether the A component from the U-tuple is less than the D component of the V-tuple. The first tuple of U, with an A hh each of the tuples from V. However, the .ples, constructed from the five successful pairings. This relation ig. 217. 0 Notice that the schema for the result in Fig. 2.17 consists of all six attributes, with U and V prefixed to their respective occurrences of at distinguish them. Thus, the theta-join contrasts with natus latter common attributes are merge 24, AN ALGEBRAIC QUERY LANGUAGE a UB LUC {VB IVC |D 2 |3 ]2 13 14 2 |3 |2 5 2 |3 |r 10 fee) Gaal, 10 7 Ja |r 10 Figure 2.17: Result of U > aco V do so in the case of the natural join, since tuples don’t pair unless they agree in ‘heir common attributes. In the case of a theta-join, there is no guarantee that ‘compared attributes will agree in the result, si with = Example 2.16: Here is a theta-join on the same relations U and V that has ‘4 more complex condition: Vedac a usa. V ‘That is, we require for successful pairing not only that the A component of the U-tuple be less than the D component of the V-tuple, but that the two tuples disagree on their respective B components. The tuple A|UB|UC|VB|VC|D TT? 13 |? 1s |1 is the only one to satisfy both conditions, so this relation is the result of the theta-join above. 0 2.4.10 Combining Operations to Form Queries If all we could do was to write single operations on one or two relations as ‘queries, then relational algebra would not be nearly as useful as it is. However, relational algebra, lke all algebras, allows us to form expressions of arbitrary complexity by applying operations to the result of other operations. ‘One can construct expressions of relational algebra to subexpressions, using parentheses when necessary to operands. It is latter often are easier for us to read, although they are less convenient as a machine-readable notation, Suppose we want to know, from our running Movies relation, itles and years of movies made by Fox that are at least 100 * One way to compute the answer to this query is 1. Sclect those Movies tuples that have length > 100. 48 CHAPTER 2, THE RELATIONAL MODEL OF DATA 2. Select those Movies tuples that have studioName = ’Fox? 3. Compute the intersection of (1) and (2). 4. Project the relation from (3) onto attributes title and year. i 2 engi >= 100 rox! “T Movies Movies Figure 2.18: Expression tree for a relational algebra expression In Fig. 2.18 we see the above steps represented as an expression tree. Ex: pression trees are evaluated bottom-up by the operator at an interior ‘The two selection nodes correspond to step corresponds to step (3), and the projection Alternatively, we could represent the same express linear notation, with parentheses. The formula in a conventional, Reateseor (2 (Hovies) M Syudionamen'Fox? (Hovies)) represents the same expression Incidentally, ther is often more than one relational algebra expression that represents the same computation. For instance, the above query could also be written by replacing the intersection by logical AND within a single selection aperation. That i, 1+ (Stengin2100 AND studioNamen"Fox? (Novies)) is an equivalent form of the query. 0 24. AN ALGEBRAIC QUERY LANGUAGE 49 Equivalent Expressions and Query Optimization All database systems have a query-answering system, and many of them are based on a language that is similar in expressive power to relational algebra. Thus, the query asked by a user may have many equivalent e2- pressions (expressions that produce the same answer whenever they are given the same relations as operands), and some of these may be much ‘more quickly evaluated. An important job of the query “optimizer” dis- ‘cussed briefly in Section 1.2.5 is to replace one expression of relational algebra by an equivalent expression that is more efficiently evaluated, 2.4.11 Naming and Renaming is often convenient to shall use the operator oreover, the at- ofthe result relation Sate named Ay, a,--. Any in order from the og peer tnt nn ere rere are attributes an they ae in Ry we can jst sey ps(P) and (b) and used the convention that when an attribute appears in both operands, it is renamed by prefixing the relation name to it. Suppose, however, that we do not wish to call the two versions of B by names RB and S.B; rather we want to continue to use the name B for the attribute that comes from R, and we want to use X as the name of the attribute B coming from S. We can rename the attributes of S so the fist is called X. The is a relation named § that looks just like first column has attribute X instead of B. Aleixic |p result of the expression ps the relation $ from Fig. 2. B TT2]2 aj2]a 1}2]o9|1o}a alal2 aja la 3|4]o [10] a Figure 2.19: R x pscx,c,p)(S) 50 CHAPTER 2. THE RELATIONAL MODEL OF DATA When we take the product of R with this new relation, there is no conflict the of names among the attributes, so no further renaming is done. That result of the expression Rx psix,c,p)(S) is the relation 2x S from Fig. 2 ‘except that the five columns are labeled A, B, X, C, and D, from the let. This relation is shown in Fig. 2.19. ‘As an altemnative, we could take the product without renaming, as we did in Example 2.12, and then rename the result. The expression prs(a.e.x.c.0)(R x $) yields the same relation as in Fig. 2.19, with the same set of attributes. But this relation has a name, RS, while the result relation in Fig. 2.19 has no name. 0 2.4.12 Relationships Among Operations ‘Some of the operations that we have described in Section 2.4 can be expressed in terms of other relational-algebra operations. For example, intersection can be expressed in terms of set difference: RNS=R-(R-S) ‘That is, if R and $ are any two relations with the same schema, the intersection of R and S can be computed by first subtracting S from R to form a relation T consisting of all those tuples in R but not S. We then subtract T from R, leaving only those tuples of F that are also in S. The two forms of join are also expressible in terms of other operations. ‘Theta-join can be expressed by product and selection: Roag $= a0(Rx S) ‘The natural join of R and 5 can be expressed by starting with the product Fx S. We then apply the selection operator with a condition C of the form RA = S.Ay AND R.Ap = S.Ap AND: --AND Ry = S.A where Ay, A; and S. Fins An ate all the attributes appearing in the schemas of both R /, we must project out one copy of each of the equated attributes Rea =n1(2o(R x5) Example 2.19: The natural join of the relations U and V from Fig. 2.16 can be written in terms of product, selection, and projection as: raveuco(sunev.s ax ucevclY x) 24, AN ALGEBRAIC QUERY LANGUAGE a ‘That i, we take the product Ux V. Then we select for equality between each pair of attributes with the same name — B and C in this example. Finally, wwe project onto all the attributes except one of the B’s and one of the C's; we have chosen to climinate the attributes of V whose names also appear in the schema of U. For another example, the theta-join of Example 2.16 can be written aco an u.pev.n(U xV) is, we take the product of the relations U7 and V and then apply the condition that appeared in the theta-join. ‘The rewriting rules mentioned in this section are the only “redundancies” ‘among the operations that remaining operations — union, difference, selection, » product, and renaming — form an in- dependent set, none of which can be written in terms of the other five. 2.4.13 A Linear Notation for Algebraic Expressions In Section 2.4.10 we used an expression tree to represent @ complex expression of relational algebra. An alternative is to invent names for the temporary relations that correspond to the interior nodes of the tree and write a sequence ‘of assignments that create a value for each. The order of the assignments is flexible, as long as the children of a node N have had their values created before wwe attempt to create the value for NV itself. ‘The notation we shall use for assignment statements is: 1. A relation name and parenthesized list of attributes for that relation. The ‘name Answer will be used conv for the result of the final step; ice, the name of the relation at the root of the expression tree. ‘The assignment symbol cn the right. We can choose to use only one in which case each interior node of the tree gets is also permissible to combine if it is convenient to do so. Any algebraic express ‘operator per assignin its own assignment statement. Howeve several algebraic operations in one right side Example 2.20: Consider the tree of Fig. 2.18. One possible sequence of as- signments to evaluate this expression is: = Ctength> 100 (Movies) 52 CHAPTER 2. THE RELATIONAL MODEL OF DATA step computes the relation of the interior node labeled ovengehi00 in , and the second step computes the node labeled otudictamew For iat we get renaming “for free,” since we can use any attributes and relation name we wish for the left side of an assignment. The last two steps compute the intersection and the projection in the obvious way. tis also permissible to combine some of the steps. For instance, we could ‘combine the last two steps and write: = Siength2100 (Movies) sexs (Movies) ns) We could even substitute for R and S in the last line and write the entire expression in one line. 2.4.14 Exercises for Section 2.4 Exercise 2.4.1: This exercise builds upon the products schema of Exercise 2.3.1. Recall that the database schema consists of four relations, whose schemas Product (maker, model, type) speed, ran, hd, price) ‘speed, ram, hd, screen, price) Printer (model, color, type, price) Some sample data for the relation Product is shown in Fig. 2.20. Sample data for the other three relations is shown in Fig. 2.21. Manufacturers and ‘model numbers have been “sanitized,” but the data is typical of products on sale at the beginning of 2007. ‘Write expressions of relational algebra to answer the following queries. You ‘may use the linear notation of Section 2.4.13 if you wish. For the data of Figs. 2.20 and 2.21, show the result of your query. However, your answer should work for arbitrary data, not just the data of these figures. a) What PC models have a speed of at least 3.007 ) Which manufacturers make laptops with a hard disk of at least 10GB? ©) Find the model number and price ofall products (of any type) made by manufacturer B. 4) Find the model numbers of all color laser printers. ¢) Find those manufacturers that sell Laptops, but not PC's. 1!) Find those hard-disk sizes that occur in two or more PC's. 4. AN ALGEBRAIC QUERY LANGUAGE 53 maker | model | type 1001 | pe 1002 | pe 1003 | pe 2004 | Laptop 2005 | Laptop 2006 | laptop 1004 | pe 1005 | pe 1008 | pe 2007 | laptop 1007 | pe 1008 | pe 1009 | pe 1010 | pe 3004 | printer 3005 | printer 1011 | pe pe 1013 | pe 2001 | laptop 2002 | laptop 2003 | laptop 3001 | printer 3002 | printer 3003 | printer 2008 | laptop 2009 | laptop 2010 | 1aptop 3006 3007 Figure 2.20: Sample data for Product

You might also like