You are on page 1of 6

Data Warehouse

A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of decisionmaking process
Modeling and analysis of data for decision makers, not for transaction processing (OLAP vs. OLTP) Constructed by integrating multiple heterogeneous data sources Keep data with a historical perspective Permanently store data imported from the operational data sources

Data Warehousing and OLAP


Dr. Weining Zhang

W. Zhang

Data Warehousing & OLAP

DW vs. DBMS

OLTP vs. OLAP


OLTP users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response
4

Wrapper/mediator vs. DW for information integration


Mediator is query driven. Data are stored in heterogeneous resources. Mediator distributes queries and integrates answers DW is update-driven. Data are loaded to DW for query processing

DW vs. Operational DBMS


DBMS is for OLTP (on-line transaction processing) and daily operation DW is for OLAP (on-line analytical processing) and decision-making

usage access unit of work # records accessed #users DB size metric

W. Zhang

Data Warehousing & OLAP

W. Zhang

Data Warehousing & OLAP

Why Separate DW From DBMS?

Data Cube Model

High performance for both systems


DBMS is tuned for OLTP Warehouse is tuned for OLAP

Fact table and dimension tables


Location
Location c1 c3 c1 c2 c1 c2 Product p1 p1 p2 p2 p1 p1 Amount 12 50 11 8 44 4

Sales
Time d1 d1 d1 d1 d2 d2

Special requirements for decision making


Query historical data not found in typical databases Consolidate (aggregate, summarize) data from heterogeneous sources Must reconcile inconsistency in formats representations, and codes for data from heterogeneous data sources

cid c1 c2 c2

city Dallas Toronto Chicago

state TX ON IL

country USA Canada USA

dimensions
W. Zhang Data Warehousing & OLAP 5 W. Zhang

Facts/measure
Data Warehousing & OLAP 6

Data Mining: Concepts and Techniques

Data Cube Model

Define DW Schema

View data as cubes


Location c1 c3 c1 c2 c1 c2 c1 all Product p1 p1 p2 p2 p1 p1 p1 all Amount 12 50 11 8 44 4 56 129
Data Warehousing & OLAP 7

Types of schemas
Star schema. A fact table with a set of simple dimension tables Snowflake schema. Refine a star schema by allowing some dimension to be modeled by a set of tables Fact constellation. Multiple fact tables share dimension tables. Viewed as a collection of stars, thus called galaxy schema or fact constellation

Sales
Time d1 d1 d1 d1 d2 d2 all all
W. Zhang

d2 d1 p1 p2 all

all

56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 8 c2 50 c3 81 all

c1

W. Zhang

Data Warehousing & OLAP

Example of Star Schema


time
time_key day day_of_the_week month quarter year

Defining a Star Schema


item
item_key item_name brand type supplier_type

dimensions Sales Fact Table time_key item_key branch_key

branch
branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Facts


location_key street city province_or_street country

define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)

W. Zhang

Data Warehousing & OLAP

W. Zhang

Data Warehousing & OLAP

10

Example of Snowflake Schema


time
time_key day day_of_the_week month quarter year

time

Example of Fact Constellation


item
Sales Fact Table time_key item_key branch_key branch
branch_key branch_name branch_type item_key item_name brand type supplier_type

item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_key

supplier
supplier_key supplier_type

time_key day day_of_the_week month quarter year

Shipping Fact Table time_key item_key shipper_key from_location

branch
branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Measures


location_key street city_key

location_key units_sold dollars_sold avg_sales Measures

location
location_key street city province_or_street country

to_location dollars_cost units_shipped shipper


shipper_key shipper_name location_key shipper_type
12

city
city_key city province_or_street country
11

W. Zhang

Data Warehousing & OLAP

W. Zhang

Data Warehousing & OLAP

Data Mining: Concepts and Techniques

Types of Aggregate of Measures


Distributive. Obtain identical result no matter applied to aggregated values or to fact values e.g., count(), sum(), min(), max(). Algebraic. Can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function, e.g., avg(), min_N(), standard_deviation(). Holistic. Has no constant bound on the storage size needed to describe a subaggregate, e.g.,median(), mode(), Rank().

W. Zhang Data Warehousing & OLAP 13

A Sample Concept Hierarchy


all region Europe all

...

North_America

country

Germany ...

Spain

Canada

...

Mexico

city office

Frankfurt

...

Vancouver ... L. Chan

Toronto

... M. Wind
14

W. Zhang

Data Warehousing & OLAP

Lattice of Cuboids

Cube-based OLAP Operations

Each cube is a cuboid Cuboids form a lattice based on dimensions in the cubes
all product time location 0-D(apex) cuboid 1-D cuboids

Roll-up, drill-down, dice, slice, pivot, etc


all 56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 c1 8 c2 50 c3 81 all Roll up Drill down all all 67 c1 Drill down on product all 56 11 11 c1
W. Zhang

d2 d1 p1 p2 all

12 c2

50 c3 all all 129 all

product,time

product,location

time, location 2-D cuboids 3-D(base) cuboid


15

Roll up on time all p1 4 8 c2 8 50 c3 p2

product, time, location


W. Zhang Data Warehousing & OLAP

Roll up on location

p1 p2

110 19 all Roll up on produce


16

Data Warehousing & OLAP

Cube-based OLAP Operations


d2 d1 p1 p2 all all 56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 c1 8 c2 50 c3 81 all pivot d1 p1 Slice on p2 d1 12 11 11 c1 8 c2 8 50 c3

Design of a Data Warehouse


A difficult and expensive process Involve business and technical considerations. Viewed from different perspectives:
What information are needed Models of data sources Models of warehouse data How to use the data

Dice on d1, d2; p1, p2; and c1, c2 d2 44 4 d1 12 12 11 p1 8 11 11 8 p2


W. Zhang

d1 c3 c2 c1
Data Warehousing & OLAP

50 8 11 p2 12 p1
17

Need many experiments and prototypes

c1

c2

W. Zhang

Data Warehousing & OLAP

18

Data Mining: Concepts and Techniques

Data Warehouse Design Process

Data Warehouse Architecture


other sources Metadata Monitor & Integrator OLAP Server Analysis Query Reports Data mining

Typically
Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record

Operational Extract Data DBs Transform Warehouse Load Refresh

Serve

Data Marts Data Sources


W. Zhang Data Warehousing & OLAP 19 W. Zhang

Data Storage

OLAP Engine Front-End Tools


20

Data Warehousing & OLAP

Three Data Warehouse Models

Data Warehouse Development


Distributed Data Marts Multi-Tier Data Warehouse

Enterprise warehouse
collects all of the information about subjects spanning the entire organization

Data Mart
a subset of corporate-wide data that is of value to a specific groups of users. Independent vs. dependent (directly from warehouse) data mart Data Mart

Data Mart

Enterprise Data Warehouse

Virtual warehouse
A set of views over operational databases Only some of the possible summary views may
W. Zhang Data Warehousing & OLAP 21 W. Zhang

Model refinement

Model refinement

Define a high-level corporate data model


Data Warehousing & OLAP 22

OLAP Server Architectures

Compute Cube in ROLAP

Relational OLAP (ROLAP)


Use RDBMS or ORDBMS to store data and OLAP middleware to support missing pieces Greater scalability

Extend SQL to have a cube operator


define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales

Multidimensional OLAP (MOLAP)


Store cubes in multidimensional arrays Fast indexing to summarized data

Compute each cuboid using a group by


Time consuming

Hybrid OLAP (HOLAP)


E.g., low level: relational, high-level: array

Pre-compute cuboids
For n dimensions, there are at least 2n cuboids, mach more if dimensions have hierarchies Takes too much space, is costly to maintain

Specialized SQL servers


Support SQL queries over star/snowflake schemas
W. Zhang Data Warehousing & OLAP 23

W. Zhang

Data Warehousing & OLAP

24

Data Mining: Concepts and Techniques

Partial Materialization

Compute Cube in MOLAP


Partition a multidimensional array into chunks that fits in memory Compress sparse array to conserve space How to compute the entire cube by reading each chunk only once?

Pre-compute some cuboids and compute other cuboids when they are needed
Select cuboid for materialization Tradeoff among space limits, maintenance cost, & usefulness to query processing Exploit materialized cuboids to answer queries Choose cuboids to use, using indexes, translate cube operations on chosen cuboids View (cuboid) maintenance Update the materialized cuboids when the source data is changed
W. Zhang Data Warehousing & OLAP 25

Read chunks in some ordering Different ordering needs different buffer space (chunk memory) A multi-way array aggregation tries to computer all K-D cuboids in parallel

W. Zhang

Data Warehousing & OLAP

26

Multi-way Array Aggregation


all A c3 61 62 63 64 c2 45 46 47 48 c1 29 130 2 31 2 32 2 1 1 1 1 c0 60 1 1 1 1 2 b3 13 14 15 16 1 44 56 1 28 240 b2 9 10 11 12 1 52 B 1 24 236 6 7 8 220 b1 5 2 2 3 4 b0 1 C a0 a1 a2 A a3 A B B

Multi-way Array Aggregation

Assume reading chunks 1 to 64 in order


For BC plane, compute 1 chunk at a time. Need to keep partial result for each cell in the chunk For AC plane, compute 4 chunks in parallel. Must keep partial results for each cell in the 4 chunks For AB plane, compute all 16 chunks in parallel. Must keep partial results for all cells in that plane Size=AB plane + 1 row of AC plane + 1 chunk of BC plane

C all

W. Zhang

all

C
Data Warehousing & OLAP 27 W. Zhang Data Warehousing & OLAP 28

Indexing Cube Using Bitmaps

Indexing Cube Using Join Index


Join index: JI(R-id, S-id) where R (R-id, ) joins with S (S-id, ) on a join condition Relates the values of the dimensions of a star schema to rows in the fact table.

Each value in the column has a bit vector


Each tuple has a bit in the bit vector The i-th bit is set if the i-th tuple of the base table has the value for the indexed column

Use bit operations to search for tuples


Index on Region Index on Type

Base table
Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe

Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 0 1 1 0 Dealer 2 0 1 0 2 0 1 Dealer 3 1 0 0 3 0 1 Retail 4 0 0 1 4 1 0 0 1 0 5 0 1 Dealer 5
Data Warehousing & OLAP 29

E.g. fact table Sales and two dimensions location and product A join index on location maintains for each distinct city a list of sales tuples recording the Sales in the city Join indices can span multiple dimensions

W. Zhang

W. Zhang

Data Warehousing & OLAP

30

Data Mining: Concepts and Techniques

Processing OLAP Queries

Data Warehouse Tools & Utilities


Determine cube operations to perform on available cuboids:


transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection

Data extraction:
get data from heterogeneous external sources

Data cleaning:
detect & rectify errors in the data

Determine to which materialized cuboid(s) the relevant operations should be applied. Exploring indexing structures and compressed vs. dense array structures in MOLAP

Data transformation:
convert data from host format to DW format

Load:
sort, summarize, consolidate, compute views, check integrity, & build indicies and partitions

Refresh:
propagate source updates to the warehouse
W. Zhang Data Warehousing & OLAP 32

W. Zhang

Data Warehousing & OLAP

31

DW and Data Mining


Summary

DW prepares data for mining


Integrated, consistent, cleaned

Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decisionmaking process

DW provides infrastructure for data collection and analysis OLAP is a simple form of mining & data exploration Mining should be a part of DW operations
Provide more powerful tools to analyze data and extract knowledge

A multi-dimensional model of a data warehouse


Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures

W. Zhang

Data Warehousing & OLAP

33

W. Zhang

Data Warehousing & OLAP

34

Summary
OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes

Partial vs. full vs. no materialization Multi-way array aggregation Bitmap index and join index implementations

Further development of data cube technology


Discovery-drive and multi-feature cubes From OLAP to OLAM (on-line analytical mining)

W. Zhang

Data Warehousing & OLAP

35

Data Mining: Concepts and Techniques

You might also like