You are on page 1of 50

ADBMS

MCA 4.5 Jan 10, 2012

Textbook(s)
Main textbook, available at the bookstore: Database Systems: The Complete Book, Hector Garcia-Molina, Jeffrey Ullman, Jennifer Widom Almost identical, and also available at the bookstore: A First Course in Database Systems, Jeff Ullman and Jennifer Widom Database Implementation, Hector Garcia-Molina, Jeff Ullman and Jennifer Widom
2

Other Texts
Database Management Systems, Ramakrishnan
very comprehensive

Database System Concepts , A. Silberschatz, H. F. Korth and S. Sudarshan, 6th Ed., McGRAW-HILL, ISBN 978-007-132522, International Edition, 2011 Fundamentals of Database Systems, Elmasri, Navathe
very widely used

Data on the Web, Abiteboul, Buneman, Suciu


XML and other new/advanced stuff 3

Course Focus
Main focus of this course is Database System Implementation i.e. how does one build a DBMS. This subject in turn can be divided into 3 parts: 1. Storage Management: how secondary storage is used efficiently to hold data and allow it to be accessed quickly. 2. Query Processing: how queries expressed in a very high-level language such as SQL can be executed efficiently. 3. Transaction Management: how to support transactions with the ACID properties.

Course Overview
Physical data storage
Blocks on disks, records in blocks, fields in records

Indexing & Hashing


B-Trees, hashing,

Query Processing
Methods to execute SQL queries efficiently

Crash Recovery
Failures, stable storage, logging policies, ...
5

Course Overview
Concurrency Control
Correctness, locks,

Transaction Processing
Logs, deadlocks,

Security & Integrity


Authorization, encryption,

Distributed Databases
Interoperation, distributed recovery,

Syllabus Chapter-wise
Ch [11] Ch.[12] Ch. [13] Ch. [14] Ch. [15] Ch. [16] Ch. [17] Ch. [18] Ch. [19] Ch. [20] Hardware File and System Structure Indexing and Hashing Indexing and Hashing Query Processing Query Optimization Crash Recovery Concurrency Control Transaction Processing Information Integration Review

Unit 1

Syllabus Data storage and File Structure: The Memory Hierarchy, Disks, Using Secondary Storage effectively, Accelerating access to Secondary Storage, Disk Failures, Recovery from Disk Crashes, RAID, Representing Data Elements, Indexing and Hashing: Indexes on Sequential Files, secondary Indexes, B-Trees, Hash Tables, Multidimensional and Bitmap Indexes Query Execution: Introduction to Physical-Query-Plan Operators, One-Pass Algorithms for Database Operations, Nested Loop Joins, Two-Pass Algorithms Based on Sorting and Hashing, Index-based Algorithms, Buffer Management, Algorithms Using more than 2 Passes, Parallel algorithms for Relational Operations, Query Optimization: Parsing, Estimating the cost of operations, query optimization. Advanced Transaction Management: Transactions in SQL, Coping with System Failures: Models for Resilient Operation, Undo Logging, Redo Logging, Undo/Redo Logging, Protecting Against Media Failures, Concurrency Control: Serial and Serializable Schedules, Conflict-Serializability, Enforcing Serializability by Locks, Locking Systems, Tree Protocol, Concurrency Control by Timestamps and Validation, Advanced Transaction Processing: Resolving Deadlocks, Distributed Databases, Commit and Locking, Long-duration Transactions Database System Architectures: Data Models: Review of Relational Data Model and Object based Model, Semi structured Data, XML and its Data Model, Objectorientation in Query Languages, Logical Query Languages, Centralized and ClientServer Architectures, Server System Architectures, Parallel Databases, Distributed Databases, Deductive databases Data Warehousing, Data Mining and Information Retrieval: DSS, OLAP, Data Warehousing, Data Mining, ID3 Algorithm, Classification, Association Rules, Clustering, IR Spatial Data Management: Time in databases, Spatial and Geographic Data, Multimedia Databases, Mobility and Personal Databases. Misc. Topics: Advanced Application Development

Weeks

2.

5 6 7

2 2 1

Simplified DBMS structure


Storage manager Query processor User/ Application

Buffers

Transaction processor

Permanent storage

Indexes User Data System Data

Why study DBMS implementation techniques?


Computer scientists core knowledge Techniques applicable in implementing DBMSlike systems Understanding of DBMS internals necessary for database administrators Note: This course is not about designing DBbased applications or about using some specific database systems
10

Database Systems
The big commercial database vendors:
Oracle IBM (with DB2) bought Informix recently Microsoft (SQL Server) Sybase

Some free database systems (Unix) :


Postgres Mysql Predator

11

Section 8.6 (Garcia, Ullman, Jennifer Book): Transactions in SQL (Review) Transactions Serializabilty example Atomicity example Read-only Transactions Dirty Reads Isolation Levels

Transactions
A transaction = sequence of statements or collection of one or more operations on the database that either all succeed, or all fail Transactions have the ACID properties: A = atomicity C = consistency I = independence D = durability
13

ACID Properties
A transaction is a unit of program execution that accesses and possibly updates various data items.To preserve the integrity of data the database system must ensure:

Atomicity. Either all operations of the transaction are properly reflected in the database or none are. Consistency. Transaction moves from a state where integrity holds, to another where integrity holds or relationships among values maintained Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

Transactions
Problem: An application must perform several writes and reads to the database, as a unit. Example: Two people attempt to book the last seat on a flight. Solution: Multiple actions of the application are bundled into one unit called Transaction.
Transactions guarantee certain properties to hold that prevent such problems.

Serializability
In applications like Banking/Airline Reservations, hundreds of operations per second may be performed on a single database. It is entirely possible that we could have 2 operations affecting the same account or flight, and for those operations to overlap in time. Consider the following two examples:
Serializability example (eg 8.26) Atomicity Example (e.g. 8.27)

Example 8.26
Consider a relation:
Flights(fltNum, fltDate, fltSeat, occupied)

Write a function chooseSeat() in PL/SQL to read relation Flights for flight number and seats available, Find if a particular seat is available, and make it occupied if so.

Serializability Example
Suppose 2 agents are trying to book the same seat for the same flight and date approx. at same times.

error
Each execution of chooseSeat() tells its customer that the seat belongs to them Both customers believe they have been granted the seat in question

Serial Transaction
An execution of functions operating on the same database is serial if one function executes completely before any other function begins. The execution is serializable if they behave as if they were run serially, even though their executions may overlap in time. Clearly, if 2 invocations of chooseSeat() are run i.e. one after another, serially or serializably, then error we saw can not occur.

Assuring Serializable Behavior


Practically it is often impossible to require that operations run one after the another, there are just too many of them and some parallelism is required. As a remedy, DBMS s adopt a mechanism for assuring serializable behavior, even if the execution is not serial, the result looks to user as if operations were executed serially.

Assuring Serializable Behavior


One common approach is for DBMS to lock elements of database so that 2 functions can not access them at the same time. If the function chooseSeat() were written to lock other operations out of Flights relation, the operations that did not access Flights could run in parallel.

Atomicity example
If 2 or more database operations are performed about the same time, it is possible for a single operation to put the database in an unacceptable state if there is a h/w or s/w crash while the operation is executing. Example 8.27:
Consider a relation Accounts(acctNo, balance). Write a function transfer() that inputs 2 accounts and an amount of money, checks that first acount has atleast that much money and if so moves the money from first account to second.

Transactions in SQL
Each SQL statement is normally a transaction by itself. As a default, transactions in SQL are executed in a serializable manner. START TRANSACTION command is used to start a transaction and COMMIT or ROLLBACK is used to end the transaction. In program interfaces, transactions begin whenever the database is accessed, and end when either a COMMIT or ROLLBACK statement is executed.

25

Read-only Transactions
Any transaction that reads and then write some data into the database, is prone to serialization problems. When a transaction only reads data and does not write data, we have more freedom to let the transaction execute in parallel with other transactions. For example, suppose we wrote a function that read data to determine whether a certain seat was available; we could execute many invocations of this function at once, without risk of permanent harm to the database.

Read-only Transactions in SQL


To tell SQL system next transaction is read-only use command: SET TRANSACTION READ ONLY; just before that transaction begins. We can also inform SQL that coming transaction may write data by command: SET TRANSACTION READ WRITE; which is default option and thus is unnecessary.

Dirty Read
Dirty data is a common term for data written by a transaction that has not yet committed. A Dirty read is a read of dirty data. The risk in reading dirty data is that the transaction that wrote it may eventually abort.

Dirty Read
Sometimes the dirty read matters and sometimes it doesn t so that it makes sense to risk an occasional dirty read and thus avoid: Time-consuming work by DBMS i.e. needed to prevent dirty read Loss of parallelism that results from waiting until there is no possibility of a dirty read

Dirty Read examples


Example 8.30: Consider the relation Accounts(acctNo, balance), suppose we want to transfer money from one account to another account, suppose transfers are implemented by a program P that executes the following sequence of steps:
1. Add money to account_2 2. Test if account_1 has enough money
a) b) If NO: ROLLBACK If YES: subtract money from account_1 and end.

Example 8.30
If program is executed serially, it doesn t matter that we have put money temporarily in account_2. Suppose dirty reads are possible, imagine there are 3 accounts A1(bal=$100), A2(bal=$200), A3(bal=$300), Suppose 2 transactions T1 and T2 execute program P, to transfer roughly at the same time:
T1: transfers $150 from A1 to A2 T2: transfers $250 from A2 to A3

Example 8.30
Here is a possible sequence of events: 1. T2 executes step 1 2. T1 executes step 1 3. T2 executes test of step2 4. T1 executes test of step2 5. T2 executes step 2b 6. T1 executes step 2a

Example 8.30
Here dirty read is a problem as it caused an account to negative balance. Although total amount of money has not changed i.e. still $600 among 3 accounts.

Example 8.31
Consider the relation Flights(fltNum, fltDate, fltSeat, occupied), find if a particular seat is available, and make it occupied if so. Use the following algorithm:
1. We find an available seat and reserve it by setting occupied to TRUE for that seat, if there is none abort. 2. We ask the customer for approval of the seat. If so we commit. If not we release the seat by setting occupied to FALSE and repeat step 1 to get another seat.

Example 8.31
If two transactions T1 and T2 are executing this algo at about the same time, T1 might reserve a seat S, which later is rejected by customer. T2 executes step1 at a time when seat S is marked occupied, the customer, customer for that transaction is not given the option to take seat S. The problem is that the dirty read has occurred, but here the problem is not too serious. This method of seat choosing with dirty reads allowed makes sense in order to speed up the avg. processing time for booking request.

Example 8.31
SQL allows us to specify that dirty reads are acceptable for a given transaction using the following command:
SET TRANSACTION READ WRITE ISOLATION LEVEL READ UNCOMMITTED;

first line declares that the transaction may write data Second line declares that the transaction may run with the isolation level read uncommitted i.e. its allowed to read dirty data suitable to be used by example 8.31.

SQL Isolation Levels


Isolation levels determine what a transaction is allowed to see. The declaration, valid for one transaction, is:
SET TRANSACTION ISOLATION LEVEL X;

where: X = SERIALIZABLE: this transaction must execute as if at a point in time, where all other transactions occurred either completely before or completely after.

SQL Isolation Levels


X = READ COMMITTED: this transaction can read only committed data.
Example: if transactions are as above, Sally could see the original Sells for statement 1 and the completely changed Sells for statement 2.

X = REPEATABLE READ: if a transaction reads data twice, then what it saw the first time, it will see the second time (it may see more the second time).
Moreover, all data read at any time must be committed; i.e., REPEATABLE READ is a strictly stronger condition than READ COMMITTED.

Transaction Management
Start of DBMS Internals Chapter 17 (Garcia, Ullman, Jennifer Book)

40

Transaction Manager
It is normal to group one or more database operations into a transaction, which is a unit of work that must be executed atomically and in apparent isolation from other transactions. In order to assure that transactions are executed correctly and atomically, Transaction manager interacts with:
Log and Recovery Manager, Buffer Manager, Concurrency control Manager (Scheduler) Query Processor

Transaction Management

Log and Recovery Manager


In order to assure durability every change is logged separately on disk. The log manager follows one of several policies designed to assure that no matter when a system failure or crash occurs, recovery manager will be able to examine the log of changes and restore the database to some consistent state. Log manager initially writes the log in buffers and negotiates with the buffer manager to make sure that buffers are written to disk, where data can survive a crash at appropriate times.

Concurrency-Control Manager or Scheduler


The scheduler must assure that the individual actions of multiple transactions are executed in such an order that the net effect is the same as if transactions had in fact executed in their entirety, one at a time or serially. A typical scheduler does its work by maintaining locks on certain pieces of the data. These locks prevent 2 transactions from accessing the same piece of data in ways that interact badly. Locks are generally stored in a main-memory lock table. The scheduler affects the execution engine (part of query processor) from accessing locked parts of the database.

Deadlock Resolution
As transactions compete for resources through the locks that the scheduler grants, they can get into a situation where none can proceed as each needs something another has. Transaction manager has the responsibility to intervene and cancel one or more transactions to let the other proceed.

Query Processor
The portion of the DBMS that most affects the performance that the user sees is the query processor, it has two components: Query Compiler and Execution Engine. Query Compiler: translates the query into an internal form called a query plan which are implementations of relational algebra operations. It has 3 major units: A query Parser (builds parse tree from textual query) A query Preprocessor (performs semantic checks) A query Optimizer (finds best available query plan) Execution Engine: executes each of the steps in the chosen query plan.

Queries
Find all courses that Mary takes

SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name= Mary and S.ssn = T.ssn and T.cid = C.cid
What happens behind the scene ? Query processor figures out how to answer the query efficiently.

47

Queries, behind the scene


Declarative SQL query Imperative query execution plan:
sname

SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name= Mary and S.ssn = T.ssn and T.cid = C.cid
Students

cid=cid

sid=sid

name=Mary

Takes

Courses

The optimizer chooses the best execution plan for a query

Query Compiler
Query Compiler uses metadata and statistics about the data to decide which sequence of operations is likely to be the fastest. For example: existence of an index, which is a specialized data structure that facilitates access to data, given values for one or more components of that data, can make one plan much faster than another.

Execution Engine
It executes each of the steps in the chosen query plan. The execution engine interacts with most of other components of DBMS, either directly or through the buffers. It must get the data from the database (stored on disk) into buffers in order to manipulate the data. It needs to interact with the scheduler to avoid accessing data that is locked, and with the log manager to make sure that all database changes are properly logged.

You might also like