Professional Documents
Culture Documents
Textbook(s)
Main textbook, available at the bookstore: Database Systems: The Complete Book, Hector Garcia-Molina, Jeffrey Ullman, Jennifer Widom Almost identical, and also available at the bookstore: A First Course in Database Systems, Jeff Ullman and Jennifer Widom Database Implementation, Hector Garcia-Molina, Jeff Ullman and Jennifer Widom
2
Other Texts
Database Management Systems, Ramakrishnan
very comprehensive
Database System Concepts , A. Silberschatz, H. F. Korth and S. Sudarshan, 6th Ed., McGRAW-HILL, ISBN 978-007-132522, International Edition, 2011 Fundamentals of Database Systems, Elmasri, Navathe
very widely used
Course Focus
Main focus of this course is Database System Implementation i.e. how does one build a DBMS. This subject in turn can be divided into 3 parts: 1. Storage Management: how secondary storage is used efficiently to hold data and allow it to be accessed quickly. 2. Query Processing: how queries expressed in a very high-level language such as SQL can be executed efficiently. 3. Transaction Management: how to support transactions with the ACID properties.
Course Overview
Physical data storage
Blocks on disks, records in blocks, fields in records
Query Processing
Methods to execute SQL queries efficiently
Crash Recovery
Failures, stable storage, logging policies, ...
5
Course Overview
Concurrency Control
Correctness, locks,
Transaction Processing
Logs, deadlocks,
Distributed Databases
Interoperation, distributed recovery,
Syllabus Chapter-wise
Ch [11] Ch.[12] Ch. [13] Ch. [14] Ch. [15] Ch. [16] Ch. [17] Ch. [18] Ch. [19] Ch. [20] Hardware File and System Structure Indexing and Hashing Indexing and Hashing Query Processing Query Optimization Crash Recovery Concurrency Control Transaction Processing Information Integration Review
Unit 1
Syllabus Data storage and File Structure: The Memory Hierarchy, Disks, Using Secondary Storage effectively, Accelerating access to Secondary Storage, Disk Failures, Recovery from Disk Crashes, RAID, Representing Data Elements, Indexing and Hashing: Indexes on Sequential Files, secondary Indexes, B-Trees, Hash Tables, Multidimensional and Bitmap Indexes Query Execution: Introduction to Physical-Query-Plan Operators, One-Pass Algorithms for Database Operations, Nested Loop Joins, Two-Pass Algorithms Based on Sorting and Hashing, Index-based Algorithms, Buffer Management, Algorithms Using more than 2 Passes, Parallel algorithms for Relational Operations, Query Optimization: Parsing, Estimating the cost of operations, query optimization. Advanced Transaction Management: Transactions in SQL, Coping with System Failures: Models for Resilient Operation, Undo Logging, Redo Logging, Undo/Redo Logging, Protecting Against Media Failures, Concurrency Control: Serial and Serializable Schedules, Conflict-Serializability, Enforcing Serializability by Locks, Locking Systems, Tree Protocol, Concurrency Control by Timestamps and Validation, Advanced Transaction Processing: Resolving Deadlocks, Distributed Databases, Commit and Locking, Long-duration Transactions Database System Architectures: Data Models: Review of Relational Data Model and Object based Model, Semi structured Data, XML and its Data Model, Objectorientation in Query Languages, Logical Query Languages, Centralized and ClientServer Architectures, Server System Architectures, Parallel Databases, Distributed Databases, Deductive databases Data Warehousing, Data Mining and Information Retrieval: DSS, OLAP, Data Warehousing, Data Mining, ID3 Algorithm, Classification, Association Rules, Clustering, IR Spatial Data Management: Time in databases, Spatial and Geographic Data, Multimedia Databases, Mobility and Personal Databases. Misc. Topics: Advanced Application Development
Weeks
2.
5 6 7
2 2 1
Buffers
Transaction processor
Permanent storage
Database Systems
The big commercial database vendors:
Oracle IBM (with DB2) bought Informix recently Microsoft (SQL Server) Sybase
11
Section 8.6 (Garcia, Ullman, Jennifer Book): Transactions in SQL (Review) Transactions Serializabilty example Atomicity example Read-only Transactions Dirty Reads Isolation Levels
Transactions
A transaction = sequence of statements or collection of one or more operations on the database that either all succeed, or all fail Transactions have the ACID properties: A = atomicity C = consistency I = independence D = durability
13
ACID Properties
A transaction is a unit of program execution that accesses and possibly updates various data items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or none are. Consistency. Transaction moves from a state where integrity holds, to another where integrity holds or relationships among values maintained Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
Transactions
Problem: An application must perform several writes and reads to the database, as a unit. Example: Two people attempt to book the last seat on a flight. Solution: Multiple actions of the application are bundled into one unit called Transaction.
Transactions guarantee certain properties to hold that prevent such problems.
Serializability
In applications like Banking/Airline Reservations, hundreds of operations per second may be performed on a single database. It is entirely possible that we could have 2 operations affecting the same account or flight, and for those operations to overlap in time. Consider the following two examples:
Serializability example (eg 8.26) Atomicity Example (e.g. 8.27)
Example 8.26
Consider a relation:
Flights(fltNum, fltDate, fltSeat, occupied)
Write a function chooseSeat() in PL/SQL to read relation Flights for flight number and seats available, Find if a particular seat is available, and make it occupied if so.
Serializability Example
Suppose 2 agents are trying to book the same seat for the same flight and date approx. at same times.
error
Each execution of chooseSeat() tells its customer that the seat belongs to them Both customers believe they have been granted the seat in question
Serial Transaction
An execution of functions operating on the same database is serial if one function executes completely before any other function begins. The execution is serializable if they behave as if they were run serially, even though their executions may overlap in time. Clearly, if 2 invocations of chooseSeat() are run i.e. one after another, serially or serializably, then error we saw can not occur.
Atomicity example
If 2 or more database operations are performed about the same time, it is possible for a single operation to put the database in an unacceptable state if there is a h/w or s/w crash while the operation is executing. Example 8.27:
Consider a relation Accounts(acctNo, balance). Write a function transfer() that inputs 2 accounts and an amount of money, checks that first acount has atleast that much money and if so moves the money from first account to second.
Transactions in SQL
Each SQL statement is normally a transaction by itself. As a default, transactions in SQL are executed in a serializable manner. START TRANSACTION command is used to start a transaction and COMMIT or ROLLBACK is used to end the transaction. In program interfaces, transactions begin whenever the database is accessed, and end when either a COMMIT or ROLLBACK statement is executed.
25
Read-only Transactions
Any transaction that reads and then write some data into the database, is prone to serialization problems. When a transaction only reads data and does not write data, we have more freedom to let the transaction execute in parallel with other transactions. For example, suppose we wrote a function that read data to determine whether a certain seat was available; we could execute many invocations of this function at once, without risk of permanent harm to the database.
Dirty Read
Dirty data is a common term for data written by a transaction that has not yet committed. A Dirty read is a read of dirty data. The risk in reading dirty data is that the transaction that wrote it may eventually abort.
Dirty Read
Sometimes the dirty read matters and sometimes it doesn t so that it makes sense to risk an occasional dirty read and thus avoid: Time-consuming work by DBMS i.e. needed to prevent dirty read Loss of parallelism that results from waiting until there is no possibility of a dirty read
Example 8.30
If program is executed serially, it doesn t matter that we have put money temporarily in account_2. Suppose dirty reads are possible, imagine there are 3 accounts A1(bal=$100), A2(bal=$200), A3(bal=$300), Suppose 2 transactions T1 and T2 execute program P, to transfer roughly at the same time:
T1: transfers $150 from A1 to A2 T2: transfers $250 from A2 to A3
Example 8.30
Here is a possible sequence of events: 1. T2 executes step 1 2. T1 executes step 1 3. T2 executes test of step2 4. T1 executes test of step2 5. T2 executes step 2b 6. T1 executes step 2a
Example 8.30
Here dirty read is a problem as it caused an account to negative balance. Although total amount of money has not changed i.e. still $600 among 3 accounts.
Example 8.31
Consider the relation Flights(fltNum, fltDate, fltSeat, occupied), find if a particular seat is available, and make it occupied if so. Use the following algorithm:
1. We find an available seat and reserve it by setting occupied to TRUE for that seat, if there is none abort. 2. We ask the customer for approval of the seat. If so we commit. If not we release the seat by setting occupied to FALSE and repeat step 1 to get another seat.
Example 8.31
If two transactions T1 and T2 are executing this algo at about the same time, T1 might reserve a seat S, which later is rejected by customer. T2 executes step1 at a time when seat S is marked occupied, the customer, customer for that transaction is not given the option to take seat S. The problem is that the dirty read has occurred, but here the problem is not too serious. This method of seat choosing with dirty reads allowed makes sense in order to speed up the avg. processing time for booking request.
Example 8.31
SQL allows us to specify that dirty reads are acceptable for a given transaction using the following command:
SET TRANSACTION READ WRITE ISOLATION LEVEL READ UNCOMMITTED;
first line declares that the transaction may write data Second line declares that the transaction may run with the isolation level read uncommitted i.e. its allowed to read dirty data suitable to be used by example 8.31.
where: X = SERIALIZABLE: this transaction must execute as if at a point in time, where all other transactions occurred either completely before or completely after.
X = REPEATABLE READ: if a transaction reads data twice, then what it saw the first time, it will see the second time (it may see more the second time).
Moreover, all data read at any time must be committed; i.e., REPEATABLE READ is a strictly stronger condition than READ COMMITTED.
Transaction Management
Start of DBMS Internals Chapter 17 (Garcia, Ullman, Jennifer Book)
40
Transaction Manager
It is normal to group one or more database operations into a transaction, which is a unit of work that must be executed atomically and in apparent isolation from other transactions. In order to assure that transactions are executed correctly and atomically, Transaction manager interacts with:
Log and Recovery Manager, Buffer Manager, Concurrency control Manager (Scheduler) Query Processor
Transaction Management
Deadlock Resolution
As transactions compete for resources through the locks that the scheduler grants, they can get into a situation where none can proceed as each needs something another has. Transaction manager has the responsibility to intervene and cancel one or more transactions to let the other proceed.
Query Processor
The portion of the DBMS that most affects the performance that the user sees is the query processor, it has two components: Query Compiler and Execution Engine. Query Compiler: translates the query into an internal form called a query plan which are implementations of relational algebra operations. It has 3 major units: A query Parser (builds parse tree from textual query) A query Preprocessor (performs semantic checks) A query Optimizer (finds best available query plan) Execution Engine: executes each of the steps in the chosen query plan.
Queries
Find all courses that Mary takes
SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name= Mary and S.ssn = T.ssn and T.cid = C.cid
What happens behind the scene ? Query processor figures out how to answer the query efficiently.
47
SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name= Mary and S.ssn = T.ssn and T.cid = C.cid
Students
cid=cid
sid=sid
name=Mary
Takes
Courses
Query Compiler
Query Compiler uses metadata and statistics about the data to decide which sequence of operations is likely to be the fastest. For example: existence of an index, which is a specialized data structure that facilitates access to data, given values for one or more components of that data, can make one plan much faster than another.
Execution Engine
It executes each of the steps in the chosen query plan. The execution engine interacts with most of other components of DBMS, either directly or through the buffers. It must get the data from the database (stored on disk) into buffers in order to manipulate the data. It needs to interact with the scheduler to avoid accessing data that is locked, and with the log manager to make sure that all database changes are properly logged.