Professional Documents
Culture Documents
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of decisionmaking process
Modeling and analysis of data for decision makers, not for transaction processing (OLAP vs. OLTP) Constructed by integrating multiple heterogeneous data sources Keep data with a historical perspective Permanently store data imported from the operational data sources
W. Zhang
DW vs. DBMS
W. Zhang
W. Zhang
Sales
Time d1 d1 d1 d1 d2 d2
cid c1 c2 c2
state TX ON IL
dimensions
W. Zhang Data Warehousing & OLAP 5 W. Zhang
Facts/measure
Data Warehousing & OLAP 6
Define DW Schema
Types of schemas
Star schema. A fact table with a set of simple dimension tables Snowflake schema. Refine a star schema by allowing some dimension to be modeled by a set of tables Fact constellation. Multiple fact tables share dimension tables. Viewed as a collection of stars, thus called galaxy schema or fact constellation
Sales
Time d1 d1 d1 d1 d2 d2 all all
W. Zhang
d2 d1 p1 p2 all
all
56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 8 c2 50 c3 81 all
c1
W. Zhang
branch
branch_key branch_name branch_type
define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)
W. Zhang
W. Zhang
10
time
item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_key
supplier
supplier_key supplier_type
branch
branch_key branch_name branch_type
location
location_key street city province_or_street country
city
city_key city province_or_street country
11
W. Zhang
W. Zhang
...
North_America
country
Germany ...
Spain
Canada
...
Mexico
city office
Frankfurt
...
Toronto
... M. Wind
14
W. Zhang
Lattice of Cuboids
Each cube is a cuboid Cuboids form a lattice based on dimensions in the cubes
all product time location 0-D(apex) cuboid 1-D cuboids
d2 d1 p1 p2 all
12 c2
product,time
product,location
Roll up on location
p1 p2
A difficult and expensive process Involve business and technical considerations. Viewed from different perspectives:
What information are needed Models of data sources Models of warehouse data How to use the data
d1 c3 c2 c1
Data Warehousing & OLAP
50 8 11 p2 12 p1
17
c1
c2
W. Zhang
18
Typically
Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record
Serve
Data Storage
Enterprise warehouse
collects all of the information about subjects spanning the entire organization
Data Mart
a subset of corporate-wide data that is of value to a specific groups of users. Independent vs. dependent (directly from warehouse) data mart Data Mart
Data Mart
Virtual warehouse
A set of views over operational databases Only some of the possible summary views may
W. Zhang Data Warehousing & OLAP 21 W. Zhang
Model refinement
Model refinement
Pre-compute cuboids
For n dimensions, there are at least 2n cuboids, mach more if dimensions have hierarchies Takes too much space, is costly to maintain
W. Zhang
24
Partial Materialization
Pre-compute some cuboids and compute other cuboids when they are needed
Select cuboid for materialization Tradeoff among space limits, maintenance cost, & usefulness to query processing Exploit materialized cuboids to answer queries Choose cuboids to use, using indexes, translate cube operations on chosen cuboids View (cuboid) maintenance Update the materialized cuboids when the source data is changed
W. Zhang Data Warehousing & OLAP 25
Read chunks in some ordering Different ordering needs different buffer space (chunk memory) A multi-way array aggregation tries to computer all K-D cuboids in parallel
W. Zhang
26
C all
W. Zhang
all
C
Data Warehousing & OLAP 27 W. Zhang Data Warehousing & OLAP 28
Base table
Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe
Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 0 1 1 0 Dealer 2 0 1 0 2 0 1 Dealer 3 1 0 0 3 0 1 Retail 4 0 0 1 4 1 0 0 1 0 5 0 1 Dealer 5
Data Warehousing & OLAP 29
E.g. fact table Sales and two dimensions location and product A join index on location maintains for each distinct city a list of sales tuples recording the Sales in the city Join indices can span multiple dimensions
W. Zhang
W. Zhang
30
Data extraction:
get data from heterogeneous external sources
Data cleaning:
detect & rectify errors in the data
Determine to which materialized cuboid(s) the relevant operations should be applied. Exploring indexing structures and compressed vs. dense array structures in MOLAP
Data transformation:
convert data from host format to DW format
Load:
sort, summarize, consolidate, compute views, check integrity, & build indicies and partitions
Refresh:
propagate source updates to the warehouse
W. Zhang Data Warehousing & OLAP 32
W. Zhang
31
Summary
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decisionmaking process
DW provides infrastructure for data collection and analysis OLAP is a simple form of mining & data exploration Mining should be a part of DW operations
Provide more powerful tools to analyze data and extract knowledge
W. Zhang
33
W. Zhang
34
Summary
OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes
Partial vs. full vs. no materialization Multi-way array aggregation Bitmap index and join index implementations
W. Zhang
35