628 views

Uploaded by mahendirana

- Data Mining:
- 09-datamining concepts
- Data Mining: Concepts and Techniques
- weka
- Data Structures and Algorithms in Python
- Data Mining
- Logistic Regression for Data Mining and High-Dimensional Classiﬁcation
- Data Mining: Concepts and Techniques
- Concepts and Techniques
- Rapid Miner 4.4 Tutorial
- Data Mining: Concepts and Techniques
- Data Mining
- Data Mining
- Social Networks and Data Mining
- Data Mining:
- Predictive Modeling With Clementine v11_1
- Handbook of Data Mining
- Data Mining
- Predictive Analysis Overview 2013
- Data Mining - Classification: Alternative Techniques

You are on page 1of 149

Concepts and

Techniques

— Chapter 4 —

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

©2008 Jiawei Han. All rights reserved.

08/10/09 Data Mining: Concepts and Techniques 2

Chapter 4: Data Cube

Technology

Multidimensional Databases

Summary

Efficient Computation of Data

Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Roadmap for Efficient

Computation

Preliminary cube computation tricks (Agarwal et al.’96)

Computing full/iceberg cubes: 3 methodologies

Top-Down: Multi-Way array aggregation (Zhao, Deshpande

& Naughton, SIGMOD’97)

Bottom-Up:

Bottom-up computation: BUC (Beyer & Ramarkrishnan,

SIGMOD’99)

H-cubing technique (Han, Pei, Dong & Wang:

SIGMOD’01)

Integrating Top-Down and Bottom-Up:

Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)

High-dimensional OLAP: A Minimal Cubing Approach (Li, et al.

VLDB’04)

Computing alternative kinds of cubes:

08/10/09 Data Mining: Concepts and Techniques 5

Cube: A Lattice of Cuboids

all

0-D(apex) cuboid

1-D cuboids

time,item 2-D cuboids

time,supplier item,supplier

time,location,supplier

3-D cuboids

time,item,location

time,item,supplier item,location,supplier

4-D(base) cuboid

time, item, location, supplier

Preliminary Tricks (Agarwal et al.

VLDB’96)

Sorting, hashing, and grouping operations are applied to the

dimension attributes in order to reorder and cluster related

tuples

Aggregates may be computed from previously computed

aggregates, rather than from the base fact table

Smallest-child: computing a cuboid from the smallest,

previously computed cuboid

Cache-results: caching results of a cuboid from which

other cuboids are computed to reduce disk I/Os

Amortize-scans: computing as many as possible cuboids

at the same time to amortize disk reads

Share-sorts: sharing sorting costs cross multiple cuboids

when sort-based method is used

Share-partitions: sharing the partitioning cost across

08/10/09 multiple cuboids Data

when hash-based

Mining: algorithms are used

Concepts and Techniques 7

Efficient Computation of Data

Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Multi-Way Array Aggregation

Array-based “bottom-up”

algorithm a ll

Using multi-dimensional chunks

No direct tuple comparisons A B C

Simultaneous aggregation on

multiple dimensions

A B A C BC

Intermediate aggregate values

are re-used for computing

A BC

ancestor cuboids

Cannot do Apriori pruning: No

iceberg optimization

08/10/09 Data Mining: Concepts and Techniques 9

Multi-way Array Aggregation for Cube

Computation (MOLAP)

Partition arrays into chunks (a small subcube which fits in

memory).

Compressed sparse array addressing: (chunk_id, offset)

Compute aggregates in “multiway” by visiting cube cells in

the order which minimizes the # of times to visit each cell,

and reduces memory63 access and storage cost.

C c2 45

c3 61 62 64

46 47 48

c1 29 30 31 32 What is the best

c0

b3 B13 14 15 16 60 traversing order

44

9

28 56 to do multi-way

b2

B 40

24 52 aggregation?

b1 5 36

20

b0 1 2 3 4

a0 a1 a2 a3

08/10/09 A Data Mining: Concepts and Techniques 10

Multi-way Array Aggregation for Cube

Computation

C c3 61

c2 45

62 63 64

46 47 48

c1 29 30 31 32

c0

B13 14 15 16 60

b3 44

B 28 56

b2 9

40

24 52

b1 5

36

20

b0 1 2 3 4

a0 a1 a2 a3

A

Multi-way Array Aggregation for Cube

Computation

C c3 61

c2 45

62 63 64

46 47 48

c1 29 30 31 32

c0

B13 14 15 16 60

b3 44

B 28 56

b2 9

40

24 52

b1 5

36

20

b0 1 2 3 4

a0 a1 a2 a3

A

Multi-Way Array Aggregation for Cube

Computation (Cont.)

Method: the planes should be sorted and

computed according to their size in ascending

order

Idea: keep the smallest plane in the main

memory, fetch and compute only one chunk at

a time for the largest plane

Limitation of the method: computing well only for

a small number of dimensions

If there are a large number of dimensions, “top-

down” computation and iceberg cube

computation methods can be explored

08/10/09 Data Mining: Concepts and Techniques 13

Efficient Computation of Data

Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Bottom-Up Computation (BUC)

a ll

SIGMOD’99)

A B C D

partitions and facilitates A B C D

1 a ll

iceberg pruning

If a partition does not 2 A 10 B 14 C 16 D

descendants can be pruned 3 A B 7 A C 9 A D 11 BC 13 BD 15 C D

If minsup = 1 ⇒ compute

4 A BC 6 A BD 8 A C D 12 BC D

full CUBE!

No simultaneous aggregation 5 A BC D

08/10/09 Data Mining: Concepts and Techniques 15

BUC: Partitioning

Usually, entire data set

can’t fit in main memory

Sort distinct values, partition into blocks

that fit

Continue processing

Optimizations

Partitioning

External Sorting, Hashing, Counting

Sort

Ordering dimensions to encourage

pruning

08/10/09

Cardinality,Data

Skew, Correlation

Mining: Concepts and Techniques 16

Efficient Computation of Data

Cubes

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

H-Cubing: Using H-Tree Structure

a ll

Bottom-up A B C D

computation

A B A C A D B C B D C D

Exploring an H-tree

structure A B C A B D A C D B C D

If the current A B C D

computation of an H-

tree cannot pass

min_sup, do not

proceed further

(pruning)

No simultaneous

aggregation

08/10/09 Data Mining: Concepts and Techniques 18

H-tree: A Prefix Hyper-tree

Quant-

Attr. Val. Side-link

Info

Sum:2285

Edu root

…

Hhd …

Bus …

Header … … edu hhd bus

Jan …

table Feb …

… …

Tor … Jan Mar Jan Feb

Van …

Mon …

… …

Tor Van Tor Mon

Cust_gr

Month City Prod Cost Price

p

Jan Tor Edu Printer 500 485 Q.I. Q.I. Q.I.

Quant-

Jan Tor Hhd TV 800 1200 Info

Camer Sum:

Jan Tor Edu 1160 1280

a 1765

Feb Mon Bus Laptop 1500 2500

Cnt: 2

Mar Van Edu HD 540 520 bins

… … … … … …

08/10/09 Data Mining: Concepts and Techniques 19

Computing Cells Involving “City”

Attr. Side-

Q.I.

Val.

Edu …

link From (*, *, Tor) to (*, Jan, Tor)

Header Hhd …

root

Bus …

Table … …

HTor Jan

Feb

…

… Edu. Hhd. Bus.

… …

Attr.

Val.

Quant-Info Side-link Jan. Mar. Jan. Feb.

Edu Sum:2285 …

Hhd …

Bus …

… … Tor. Van. Tor. Mon.

Jan …

Feb …

… … Q.I. Q.I. Q.I.

Quant-

Tor …

Van … Info

Sum:

Mon … 1765

… …

Cnt: 2

bins

08/10/09 Data Mining: Concepts and Techniques 20

Computing Cells Involving Month But No

City

2. Compute cells

Edu. Hhd. Bus.

involving month but

no

Attr.

Val.

city

Quant-Info Side-link

Edu. Sum:2285 … Jan. Mar. Jan. Feb.

Hhd. …

Bus. …

Q.I. Q.I. Q.I. Q.I.

… …

Jan. …

Feb. …

Mar. …

… …

Tor. …

Tor. Van. Tor. Mont.

Van. … Top-k OK mark: if Q.I. in a child

Mont. … passes top-k avg threshold, so does

… … its parents. No binning is needed!

08/10/09 Data Mining: Concepts and Techniques 21

Computing Cells Involving Only

Cust_grp

root

directly

Attr.

Quant-Info Side-link

Val. Sum:2285 Jan Mar Jan Feb

Edu

…

Hhd …

Bus … Q.I. Q.I. Q.I. Q.I.

… …

Jan …

Feb …

Mar …

… …

Tor … Tor Van Tor Mon

Van …

Mon …

… …

Efficient Computation of Data

Cubes

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Star-Cubing: An Integrating

Method

D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg

Cubes by Top-Down and Bottom-Up Integration, VLDB'03

Explore shared dimensions

E.g., dimension A is the shared dimension of ACD and AD

ABD/AB means cuboid ABD has shared dimensions AB

Allows for shared computations

e.g., cuboid AB is computed simultaneously as ABD

C /C D

Aggregate in a top-down

manner but with the

A C /A C A D /A B C /B C BD /B C D

bottom-up sub-layer

underneath which will allow A C D /A

A B C /A B C A B D /A B BC D

Apriori pruning

Shared dimensions grow in A B C D /a ll

bottom-up fashion Data Mining: Concepts and Techniques

08/10/09 24

Iceberg Pruning in Shared Dimensions

If the measure is anti-monotonic, and if the

aggregate value on a shared dimension does

not satisfy the iceberg condition, then all the

cells extended from this shared dimension

cannot satisfy the condition either

Intuition: if we can compute the shared

dimensions before the actual cuboid, we can

use them to do Apriori pruning

Problem: how to prune while still aggregate

simultaneously on multiple dimensions?

Cell Trees

similar to H-tree to

represent cuboids

Collapses common

prefixes to save memory

Keep count at node

Traverse the tree to

retrieve a particular tuple

Star Attributes and Star Nodes

Intuition: If a single-dimensional

aggregate on an attribute

value p does not satisfy the A B C D Count

iceberg condition, it is useless a1 b1 c1 d1 1

to distinguish them during the a1 b1 c4 d3 1

iceberg computation a1 b2 c2 d2 1

a2 b4 c3 d4 1

d2, d3

Solution: Replace such

attributes by a *. Such

attributes are star attributes,

and the corresponding nodes in

08/10/09 Data Mining: Concepts and Techniques 27

Example: Star Reduction

A B C D Count

Suppose minsup = 2

a1 b1 * * 1

Perform one-dimensional a1 b1 * * 1

aggregation. Replace attribute a1 * * * 1

values whose count < 2 with *. a2 * c3 d4 1

And collapse all *’s together a2 * c3 d4 1

attributes replaced with the

star-attribute A B C D Count

a1 b1 * * 2

With regards to the iceberg

a1 * * * 1

computation, this new table is a

a2 * c3 d4 2

loseless compression of the

original table

08/10/09 Data Mining: Concepts and Techniques 28

Star Tree

A B C D Count

Given the new a1 b1 * * 2

compressed table, it is a1 * * * 1

a2 * c3 d4 2

corresponding cell tree—

called star tree

Keep a star table at the

side for easy lookup of

star attributes

The star tree is a loseless

compression of the

original cell tree

08/10/09 Data Mining: Concepts and Techniques 29

Star-Cubing Algorithm—DFS on Lattice

Tree

a ll

BC D : 51

A /A B /B C /C D /D

b*: 33 b1: 26

ro o t: 5

c*: 14 c3: 211 c* : 27

A B /A B A C /A C A D /A B C /B C B D /B C D

a1: 3 a2: 2

d*: 15 d4 : 2 12 d*: 28

A B C /A B C A B D /A B A C D /A BC D

b*: 1 b1: 2 b*: 2

A BC D

Multi-Way

BC D A C D /A A B D /A B A B C /A B C

Aggregation

A BC D

Star-Cubing Algorithm—DFS on Star-

Tree

Multi-Way Star-Tree Aggregation

tree

At each new node in the DFS, create

corresponding star tree that are descendents of

the current tree according to the integrated

traversal ordering

E.g., in the base tree, when DFS reaches a1, the

ACD/A tree is created

When DFS reaches b*, the ABD/AD tree is

created

08/10/09 Data Mining: Concepts and Techniques 33

Multi-Way Aggregation (2)

backtracking

On every backtracking branch, the count in the

corresponding trees are output, the tree is

destroyed, and the node in the base tree is

destroyed

Example

When traversing from d* back to c*, the

a1b*c*/a1b*c* tree is output and destroyed

When traversing from c* back to b*, the

a1b*D/a1b* tree is output and destroyed

08/10/09 When at b*, jump toConcepts

Data Mining: b1 and repeat similar

and Techniques 34

Efficient Computation of Data

Cubes

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

The Curse of Dimensionality

high dimensionality!

A database of 600k tuples. Each dimension has

cardinality of 100 and zipf of 2.

Motivation of High-D OLAP

OLAP: A Minimal Cubing Approach, VLDB'04

Challenge to current cubing methods:

The “curse of dimensionality’’ problem

Full materialization: still significant overhead in

High-D OLAP is needed in applications

Science and engineering analysis

Fast High-D OLAP with Minimal

Cubing

of dimensions at a time

Semi-Online Computational Model

n Partition the set of dimensions into shell

fragments

n Compute data cubes for each shell fragment

while retaining inverted indices or value-

list indices

n Given the pre-computed fragment cubes,

08/10/09 dynamically compute cube

Data Mining: Concepts cells of the high-

and Techniques 38

Properties of Proposed Method

Reduces high-dimensional cube into a set of

lower dimensional cubes

Online re-construction of original high-

dimensional space

Lossless reduction

Offers tradeoffs between the amount of pre-

processing and the speed of online computation

Example Computation

Let the cube aggregation function be count

tid A B C D E

1 a1 b1 c1 d1 e1

2 a1 b2 c1 d2 e1

3 a1 b2 c1 d1 e2

4 a2 b1 c1 d1 e2

5 a2 b1 c1 d1 e3

(A, B, C) and (D, E)

1-D Inverted Indices

Build traditional invert index or RID list

a1 123 3

a2 45 2

b1 145 3

b2 23 2

c1 12345 5

d1 1345 4

d2 2 1

e1 12 2

e2 34 2

e3 5 1

Shell Fragment Cubes: Ideas

Generalize the 1-D inverted indices to multi-dimensional

ones in the data cube sense

Compute all cuboids for data cubes ABC and DE while

retaining the inverted indices

For example, shell Cell Intersection TID List List Size

fragment cube ABC a1 b1 1 2 3∩1 4 5 1 1

contains 7 cuboids:

a1 b2 1 2 3∩ 2 3 23 2

A, B, C

a2 b1 4 5∩ 1 4 5 45 2

AB, AC, BC

ABC

a2 b2 4 5∩ 2 3 ⊗ 0

computation stage

08/10/09 Data Mining: Concepts and Techniques 42

Shell Fragment Cubes: Size and

Design

Given a database of T tuples, D dimensions, and F shell

fragment size, the fragment cubes’ space requirement is:

D F

OT (2 −1)

For F < 5, the growth is sub-linear F

Fragment groupings can be arbitrary to allow for

maximum online performance

Known common combinations (e.g.,<city, state>)

should be grouped together.

Shell fragment sizes can be adjusted for optimal balance

between offline and online computation

08/10/09 Data Mining: Concepts and Techniques 43

ID_Measure Table

If measures other than count are present, store in

ID_measure table separate from the shell

fragments

tid count sum

1 5 70

2 3 10

3 8 20

4 5 40

5 2 30

The Frag-Shells Algorithm

fragments (P1,…,Pk).

bottom- up fashion.

08/10/09 Data Mining: Concepts and Techniques 45

Frag-Shells (2)

Dimensions D Cuboid

EF Cuboid

A B C D E F … DE Cuboid

Cell Tuple-ID List

d1 e1 {1, 3, 8, 9}

d1 e2 {2, 4, 6, 7}

d2 e1 {5, 10}

… …

ABC DEF

Cube Cube

Online Query Computation: Query

a1,a2 ,K ,an : M

A query has the general form

Each ai has 3 possible values

1. Instantiated value

2. Aggregate * function

3. Inquire ? function

D data cube.

Online Query Computation: Method

follows

n Divide the query into fragment, same as

the shell

n Fetch the corresponding TID list for each

fragment from the fragment cube

n Intersect the TID lists from each fragment

to construct instantiated base table

n Compute the data cube using the base

table with any cubing algorithm

08/10/09 Data Mining: Concepts and Techniques 48

Online Query Computation: Sketch

A B C D E F G H I J K L M N …

Instantiated Online

Base Table Cube

Experiment: Size vs.

Dimensionality (50 and 100

cardinality)

(100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2.

08/10/09 Data Mining: Concepts and Techniques 50

Experiment: Size vs. Shell-Fragment

Size

(100-D): 106 tuples, 100 dimensions, 2 skew, 25 cardinality.

08/10/09 Data Mining: Concepts and Techniques 51

Experiment: Run-time vs. Shell-Fragment

Size

fragment size 3, 3 instantiated dimensions.

08/10/09 Data Mining: Concepts and Techniques 52

Experiments on Real World Data

54 dimensions, 581K tuples

Shell fragments of size 2 took 33 seconds and

325MB to compute

3-D subquery with 1 instantiate D: 85ms~1.4

sec.

Longitudinal Study of Vocational Rehab. Data

24 dimensions, 8818 tuples

Shell fragments of size 3 took 0.9 seconds and

60MB to compute

08/10/09 5-D query with 0Mining:

Data instantiated D: 227ms~2.6

Concepts and Techniques 53

High-D OLAP: Further Implementation

Considerations

Incremental Update:

Append more TIDs to inverted list

Add <tid: measure> to ID_measure table

Incremental adding new dimensions

Form new inverted list and add new fragments

Bitmap indexing

May further improve space usage and speed

Inverted index compression

Store as d-gaps

Explore more IR compression methods

Chapter 4: Data Cube

Technology

Efficient Computation of Data Cubes

Exploration and Discovery in

Multidimensional Databases

Discovery-Driven Exploration of Data

Cubes

Sampling Cube

Prediction Cube

Regression Cube

Summary

08/10/09 Data Mining: Concepts and Techniques 55

Discovery-Driven Exploration of Data

Cubes

Hypothesis-driven

exploration by user, huge search space

Discovery-driven (Sarawagi, et al.’98)

Effective navigation of large OLAP data cubes

pre-compute measures indicating exceptions,

guide user in the data analysis, at all levels of

aggregation

Exception: significantly different from the value

anticipated, based on a statistical model

Visual cues such as background color are used

08/10/09 to reflect the degree of exception

Data Mining: Concepts and Techniques of each cell 56

Kinds of Exceptions and their

Computation

Parameters

SelfExp: surprise of cell relative to other cells at

same level of aggregation

InExp: surprise beneath the cell

PathExp: surprise beneath cell for each drill-

down path

Computation of exception indicator (modeling

fitting and computing SelfExp, InExp, and PathExp

values) can be overlapped with cube construction

Exception themselves can be stored, indexed and

retrieved like precomputed aggregates

08/10/09 Data Mining: Concepts and Techniques 57

Examples: Discovery-Driven Data

Cubes

Complex Aggregation at Multiple

Granularities: Multi-Feature Cubes

Multi-feature cubes (Ross, et al. 1998): Compute complex

queries involving multiple dependent aggregates at multiple

granularities

Ex. Grouping by all subsets of {item, region, month}, find

the maximum price in 1997 for each group, and the total

sales among all maximum price tuples

select item, region, month, max(price), sum(R.sales)

from purchases

where year = 1997

cube by item, region, month: R

such that R.price = max(price)

Continuing the last example, among the max price tuples,

find the min and max shelf live, and find the fraction of the

total sales due to tuple that have min shelf life within the set

08/10/09 Data Mining: Concepts and Techniques 59

Chapter 4: Data Cube Technology

Exploration and Discovery in

Multidimensional Databases

Discovery-Driven Exploration of Data

Cubes

Sampling Cube

X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun,

“Sampling Cube: A Framework for Statistical

OLAP over Sampling Data”, SIGMOD’08

Prediction Cube

08/10/09 Data Mining: Concepts and Techniques 60

Statistical Surveys and OLAP

Statistical survey: A popular tool to collect

information about a population based on a

sample

Ex.: TV ratings, US Census, election polls

An efficient way of collecting information (Data

collection is expensive)

Many statistical tools available, to determine

validity

Confidence intervals

Hypothesis tests

08/10/09 Data Mining: Concepts and Techniques 61

Surveys: Sample vs. Whole

Population

Data is only a sample of population

18

19

20

Problems for Drilling in Multidim.

Space

Data is only a sample of population but samples

could be small when drilling to certain

multidimensional space

Age\Education High-school College Graduate

18

19

20

Traditional OLAP Data Analysis

Model

Age Education Income

Full Data Warehouse

Data

Cube

Query semantics is population-based , e.g., What

is the average income of 19-year-old college

08/10/09 Data Mining: Concepts and Techniques 64

OLAP on Survey (i.e., Sampling)

Data

Semantics of query is unchanged

Input data has changed

18

19

20

OLAP with Sampled Data

Where is the missing link?

OLAP over sampling data but our analysis target

Idea: Integrate sampling and statistical knowledge

with traditional OLAP tools

Target

Population Population Traditional

OLAP

Sample Population Not Available

Challenges for OLAP on Sampling

Data

Computing confidence intervals in OLAP

context

No data?

Not exactly. No data in subspaces in

cube

Sparse data

Causes include sampling bias and query

selection bias

Curse of dimensionality

Survey data can be high dimensional

Over 600 dimensions in real world

08/10/09

example Data Mining: Concepts and Techniques 67

Example 1: Confidence Interval

What is the average income of 19-year-old high-school

students?

Return not only query result but also confidence

Age/Education High-school College Graduate

interval

18

19

20

Confidence Interval

Confidence interval at :

x is a sample of data set; is the mean of

sample

tc is the critical t-value, calculated by a look-up

is the estimated standard error of the

mean

Example: $50,000 ± $3,000 with 95% confidence

Treat points in cube cell as samples

Compute confidence interval as traditional

sample set

Return answer inDatathe

08/10/09 form

Mining: of

Concepts andconfidence

Techniques interval 69

Efficient Computing Confidence Interval

Measures

Both mean and confidence interval are algebraic

Why confidence interval measure is algebraic?

is algebraic

where both s and l (count) are algebraic

Thus one can calculate cells efficiently at more general

cuboids without having to start at the base cuboid each

time

Example 2: Query Expansion

What is the average income of 19-year-old college students?

18

19

20

Boosting Confidence by Query

Expansion

From the example: The queried cell “19-year-

old college students” contains only 2 samples

Confidence interval is large (i.e., low

confidence). why?

Small sample size

High standard deviation with samples

Small sample sizes can occur at relatively

low dimensional selections

Collect more data?― expensive!

Use data in other cells? Maybe, but have

to be careful

08/10/09 Data Mining: Concepts and Techniques 72

Intra-Cuboid Expansion: Choice 1

Expand query to include 18 and 20 year olds?

18

19

20

Intra-Cuboid Expansion: Choice 2

Expand query to include high-school and graduate students?

18

19

20

Intra-Cuboid Expansion

If other cells in the same cuboid satisfy

both the following

1. Similar semantic meaning

2. Similar cube value

Then can combine other cells’ data into

own to “boost” confidence

Only use if necessary

Bigger sample size will decrease

confidence interval

Intra-Cuboid Expansion (2)

Cell segment similarity

Some dimensions are clear: Age

Some are fuzzy: Occupation

How to determine if two cells’ samples come from the same

population?

Two-sample t-test (confidence-based)

Example:

Inter-Cuboid Expansion

If a query dimension is

Not correlated with cube value

But is causing small sample size by drilling

down too much

Remove dimension (i.e., generalize to *) and

move to a more general cuboid

Can use two-sample t-test to determine similarity

between two cells across cuboids

Can also use a different method to be shown later

Query Expansion

Query Expansion Experiments (2)

Real world sample data: 600 dimensions and

750,000 tuples

0.05% to simulate “sample” (allows error

checking)

Query Expansion Experiments (3)

Sampling Cube Shell: Handing “Curse

of Dimensionality”

Real world data may have > 600 dimensions

Materializing the full sampling cube is unrealistic

Solution: Only compute a “shell” around the full

sampling cube

Method: Selectively compute the best cuboids to

include in the shell

Chosen cuboids will be low dimensional

Size and depth of shell dependent on user

preference (space vs. accuracy tradeoff)

Use cuboids in the shell to answer queries

08/10/09 Data Mining: Concepts and Techniques 81

Sampling Cube Shell

Construction

Top-down, iterative, greedy algorithm

1. Top-down ― start at apex cuboid and

slowly expand to higher dimensions:

follow the cuboid lattice structure

2. Iterative ― add one cuboid per iteration

3. Greedy ― at each iteration choose the

best cuboid to add to the shell

Stop when either size limit is reached or

it is no longer beneficial to add another

cuboid

08/10/09 Data Mining: Concepts and Techniques 82

Sampling Cube Shell Construction

(2)

How to measure quality of a cuboid?

Cuboid Standard Deviation (CSD)

samples in ci

n(c ): # of samples in c , n(B): the total # of

i i

samples in B

small(c ) returns 1 if n(c ) ≥ min_sup and 0

i i

otherwise.

Measures the amount of variance in the cells of a

cuboid

Low CSD indicates high correlation with cube

08/10/09 83

Sampling Cube Shell Construction

(3)

Goal (Cuboid Standard Deviation Reduction:

CSDR)

Reduce CSD ― increase query information

Find the cuboids with the least amount of CSD

Overall algorithm

Start with apex cuboid and compute its CSD

Choose next cuboid to build that will reduce the

CSD the most from the apex

Iteratively choose more cuboids that reduce the

CSD of their parent cuboids the most (Greedy

selection―at each iteration, choose the cuboid

08/10/09 with the largest

DataCSDR to add

Mining: Concepts to the shell)

and Techniques 84

Computing the Sampling Cube

Shell

Query Processing

If query matches cuboid in shell, use it

If query does not match

1. A more specific cuboid

exists ― aggregate to

the query dimensions

level and answer query

2. A more general cuboid

exists ― use the value

in the cell

3. Multiple more general

cuboids exist ― use

the most confident

value

08/10/09 Data Mining: Concepts and Techniques 86

Query Accuracy

Sampling Cube: Sampling-Aware

OLAP

1. Confidence intervals in query processing

Integration with OLAP queries

Efficient algebraic query processing

2. Expand queries to increase confidence

Solves sparse data problem

Inter/Intra-cuboid query expansion

3. Cube materialization with limited space

Sampling Cube Shell

Chapter Summary

Efficient algorithms for computing data

cubes

Multiway array aggregation

BUC

H-cubing

Star-cubing

technology

Discovery-drive cube

References (I)

S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R.

Ramakrishnan, and S. Sarawagi. On the computation of

multidimensional aggregates. VLDB’96

D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view

maintenance in data warehouses. SIGMOD’97

R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional

databases. ICDE’97

K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and

Iceberg CUBEs.. SIGMOD’99

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional

Regression Analysis of Time-Series Data Streams, VLDB'02

G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional

Constrained Gradients in Data Cubes. VLDB’ 01

J. Han, Y. Cai and N. Cercone, Knowledge Discovery in Databases: An

Attribute-Oriented Approach, VLDB'92

J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg

Cubes With Complex Measures. SIGMOD’01

L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to

Summarize the Semantics of a Data Cube, VLDB'02

08/10/09 Data Mining: Concepts and Techniques 90

References (II)

X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal

Cubing Approach, VLDB'04

X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework

for Statistical OLAP over Sampling Data”, SIGMOD’08

K. Ross and D. Srivastava. Fast computation of sparse datacubes.

VLDB’97

K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex

aggregation at multiple granularities. EDBT'98

S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven

exploration of OLAP data cubes. EDBT'98

G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional

OLAP Data. VLDB'01

D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg

Cubes by Top-Down and Bottom-Up Integration, VLDB'03

D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of

Closed Cubes by Aggregation-Based Checking, ICDE'06

W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective

Approach to Reducing Data Cube Size. ICDE’02

Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based

algorithm for simultaneous

08/10/09 multidimensional

Data Mining: aggregates.

Concepts and Techniques 91

08/10/09 Data Mining: Concepts and Techniques 92

Explanation on Multi-way array

aggregation “a0b0” chunk

b0 c0 c1 c2 c3

a0b0c0

c1 a0 xxxx a0 x x x x

c2

c3

a0b1c0

c1 c0 c1 c2 c3

c2

c3 b0 x x x x

b1

a0b2c0

c1 b2

c2

c3 b3

a0b3c0

c1

c2

c3

…

08/10/09 Data Mining: Concepts and Techniques 93

a0b1 chunk

b1 c0 c1 c2 c3

a0b0c0

c1 a0 yyyy a0 xy xy xy xy Done with a0b0

c2

c3

a0b1c0

c1 c0 c1 c2 c3

c2

c3 b0 x x x x

b1 y y y y

a0b2c0

c1 b2

c2

c3 b3

a0b3c0

c1

c2

c3

…

08/10/09 Data Mining: Concepts and Techniques 94

a0b2 chunk

b2 c0 c1 c2 c3

a0b0c0

c1 a0 zzzz a0 xyz xyz xyz xyz Done with a0b1

c2

c3

a0b1c0

c1 c0 c1 c2 c3

c2

c3 b0 x x x x

b1 y y y y

a0b2c0

c1 b2 z z z z

c2

c3 b3

a0b3c0

c1

c2

c3

…

08/10/09 Data Mining: Concepts and Techniques 95

Table Visualization

b3 c0 c1 c2 c3

a0b0c0

c1 a0 uuuu a0 xyzu xyzu xyzu xyzu Done with a0b2

c2

c3

a0b1c0

c1 c0 c1 c2 c3

c2

c3 b0 x x x x

b1 y y y y

a0b2c0

c1 b2 z z z z

c2

c3 b3 u u u u

a0b3c0

c1

c2

c3

Table Visualization

…

b0 c0 c1 c2 c3

a1b0c0

c1 a1 xxxx a1 x x x x Done with a0b3

c2 Done with a0c*

c3

a1b1c0

c1 c0 c1 c2 c3

c2

c3 b0 xx xx xx xx

b1 y y y y

a1b2c0

c1 b2 z z z z

c2

c3 b3 u u u u

a1b3c0

c1

c2

c3

…

08/10/09 Data Mining: Concepts and Techniques 97

a3b3 chunk (last one)

…

b0 c0 c1 c2 c3

a3b0c0

c1 a3 uuuu a3 xyzu xyzu xyzu xyzu Done with a0b3

c2 Done with a0c*

c3 Done with b*c*

a3b1c0

c1 c0 c1 c2 c3

c2

c3 b0 xxxx xxxx xxxx xxxx

b1 yyyy yyyy yyyy yyyy

a3b2c0

c1 b2 zzzz zzzz zzzz zzzz

c2

c3 b3 uuuu uuuu uuuu uuuu

a3b3c0

c1

c2

c3

Finish

Memory Used

A: 40 distinct values

B: 400 distinct values

C: 4000 distinct values

Plane AB: Need 1 chunk (10 * 100 * 1)

Plane AC: Need 4 chunks (10 * 1000 * 4)

Plane BC: Need 16 chunks (100 * 1000 * 16)

Memory Used

A: 40 distinct values

B: 400 distinct values

C: 4000 distinct values

Plane CB: Need 1 chunk (1000 * 100 * 1)

Plane CA: Need 4 chunks (1000 * 10 * 4)

Plane BA: Need 16 chunks (100 * 10 * 16)

H-Cubing Example

H-table Output

H-Tree: 1

condition: ???

root: 4

(*,*,*,*) : 4

a1 3

a2 1 a1: 3 a2: 1

b1 3

b2 1 b1:2 b2:1 b1: 1

c1 2

c2 1

c1: 1 c2: 1 c3: 1 c1: 1

c3 1

In this example for clarity we will print the

resulting virtual tree, not the real tree

project C: ??c1

H-table Output

H-Tree: 1.1

condition: ??c1

root: 4

(*,*,*,*) : 4

a1 1 (*,*,*,c1): 2

a2 1 a1: 1 a2: 1

b1 2

b2 1 b1:1 b1: 1

project ?b1c1

H-table Output

H-Tree: 1.1.1

condition: ?b1c1

root: 4

(*,*,*) : 4

a1 1 (*,*,c1): 2

a2 1 a1: 1 a2: 1 (*,b1,c1): 2

aggregate: ?*c1

H-table Output

H-Tree: 1.1.2

condition: ??c1

root: 4

(*,*,*) : 4

a1 1 (*,*,c1): 2

a2 1 a1: 1 a2: 1 (*,b1,c1): 2

Aggregate ??*

H-table Output

H-Tree: 1.2

condition: ???

root: 4

(*,*,*) : 4

a1 3 (*,*,c1): 2

a2 1 a1: 3 a2: 1 (*,b1,c1): 2

b1 3

b2 1 b1:2 b2:1 b1: 1

Project ?b1*

H-table Output

H-Tree: 1.2.1

condition: ???

root: 4

(*,*,*) : 4

a1 2 (*,*,c1): 2

a2 1 a1: 2 a2: 1 (*,b1,c1): 2

(*,b1,*): 3

(a1,b1,*):2

Aggregate ?**

H-table Output

H-Tree: 1.2.2

condition: ???

root: 4

(*,*,*) : 4

a1 3 (*,*,c1): 2

a2 1 a1: 3 a2: 1 (*,b1,c1): 2

(*,b1,*): 3

(a1,b1,*):2

(a1,*,*): 3

Finish

3. Explanation of Star Cubing

ABCD-Tree ABCD-Cuboid Tree

root: 5

NULL

a1: 3 a2: 2

Step 1

BCD:5

root: 5

a1: 3 a2: 2

Step 2

BCD:5 (a1,*,*) : 3

root: 5

a1: 3 a2: 2

a1CD/a1:3

b*: 1 b1: 2 b*: 2

Step 3

ABCD-Tree ABCD-Cuboid Tree output

b*: 1

a1: 3 a2: 2

a1b*D/a1b*:1

Step 4

ABCD-Tree ABCD-Cuboid Tree output

b*: 1

a1: 3 a2: 2

c*: 1

b*: 1 b1: 2 b*: 2

a1CD/a1:3

c*: 1

a1b*D/a1b*:1

a1b*c*/a1b*c*:1

Step 5

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1

a1: 3 a2: 2

c*: 1

d*: 1

a1CD/a1:3

c*: 1

d*: 1

a1b*D/a1b*:1

d*: 1

a1b*c*/a1b*c*:1

Mine subtree ABC/ABC

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1

a1: 3 a2: 2

c*: 1

d*: 1

a1CD/a1:3

d*: 2 d4: 2

c*: 1

d*: 1

a1b*D/a1b*:1

but nothing to do

remove

a1b*c*/a1b*c*:1

Mine subtree ABD/AB

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1

a1: 3 a2: 2

c*: 1

d*: 1

c*: 2 c3: 2

a1CD/a1:3

d*: 2 d4: 2

c*: 1

d*: 1

a1b*D/a1b*:1

mine this subtree

d*: 1

but nothing to do

remove

Step 6

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1 b1: 2 (a1,b1,*):2

a1: 3 a2: 2

c*: 1

b1: 2 b*: 2

d*: 1

c*: 2 c3: 2

a1CD/a1:3

d*: 2 d4: 2

c*: 1

d*: 1

a1b1D/a1b1:2

Step 7

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1 b1: 2 (a1,b1,*):2

a1: 3 a2: 2

c*: 1 c*: 2

b1: 2 b*: 2

d*: 1

c*: 2 c3: 2

a1CD/a1:3

d*: 2 d4: 2

c*: 3

d*: 1

a1b1D/a1b1:2

a1b1c*/a1b1c*:2

Step 8

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1 b1: 2 (a1,b1,*):2

a1: 3 a2: 2

c*: 1 c*: 2

b1: 2 b*: 2

d*: 1 d*: 2

c*: 2 c3: 2

a1CD/a1:3

d*: 2 d4: 2

c*: 3

d*: 3

a1b1D/a1b1:2

d*: 3

a1b1c*/a1b1c*:2

Mine subtree ABC/ABC

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1 b1: 2 (a1,b1,*):2

a1: 3 a2: 2

c*: 1 c*: 2

b1: 2 b*: 2

d*: 1 d*: 2

c*: 2 c3: 2

a1CD/a1:3

d4: 2

c*: 3

d*: 3

a1b1D/a1b1:2

but nothing to do

remove

a1b1c*/a1b1c*:2

Mine subtree ABC/ABC

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 1 b1: 2

a1: 3 a2: 2

c*: 1 c*: 2

b1: 2 b*: 2

d*: 1 d*: 2

c3: 2

a1CD/a1:3

d4: 2

c*: 3

d*: 3

a1b1D/a1b1:2

mine this subtree

but nothing to do d*: 3

(all interior nodes *)

remove

Mine subtree ABC/ABC

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 1 b1: 2

a1: 3 a2: 2

c*: 1 c*: 2

b*: 2

d*: 1 d*: 2

c3: 2

a1CD/a1:3

d4: 2

c*: 3

d*: 3

but nothing to do

(all interior nodes *)

remove

Step 9

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

b*: 1 b1: 2 (a1,b1,*):2

a2: 2

c*: 1 c*: 2

(a2,*,*): 2

b*: 2

d*: 1 d*: 2

c3: 2

a2CD/a2:2

d4: 2

Step 10

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2

(a2,*,*): 2

a2: 2

c*: 1 c*: 2

b*: 2

d*: 1 d*: 2

c3: 2

a2CD/a2:2

d4: 2

a2b*D/a2b*:2

Step 11

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2

(a2,*,*): 2

a2: 2

c*: 1 c3: 2 c*: 2

b*: 2

d*: 1 d*: 2

c3: 2

a2CD/a2:2

d4: 2

c3: 2

a2b*D/a2b*:2

a2b*c3/a2b*c3:2

Step 11

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2

(a2,*,*): 2

a2: 2

c*: 1 c3: 2 c*: 2

b*: 2

d*: 1 d4: 2 d*: 2

c3: 2

a2CD/a2:2

d4: 2 c3: 2

d4: 2

a2b*D/a2b*:2

a2b*c3/a2b*c3:2

Mine subtree ABC/ABC

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2

(a2,*,*): 2

a2: 2

c*: 1 c3: 2 c*: 2

b*: 2

d*: 1 d4: 2 d*: 2

c3: 2

a2CD/a2:2

c3: 2

d4: 2

a2b*D/a2b*:2

mine subtree

nothing to do a2b*c3/a2b*c3:2

remove

Step 11

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2

(a2,*,*): 2

a2: 2

c*: 1 c3: 2 c*: 2

b*: 2

d*: 1 d4: 2 d*: 2

a2CD/a2:2

c3: 2

d4: 2

a2b*D/a2b*:2

mine subtree

nothing to do

remove

Mine Subtree ACD/A

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2

(a2,*,*): 2

a2: 2

c*: 1 c3: 2 c*: 2

a2CD/a2:2

c3: 2

d4: 2

mine subtree

AC/AC, AD/A

remove

Recursive Mining – step 1

ACD/A tree ACD/A-Cuboid Tree output

(a1,*,*) : 3

a2CD/a2:2 a2D/a2:2 (a1,b1,*):2

(a2,*,*): 2

c3: 2

d4: 2

Recursive Mining – Step 2

ACD/A tree ACD/A-Cuboid Tree output

(a1,*,*) : 3

a2CD/a2:2 a2D/a2:2 (a1,b1,*):2

(a2,*,*): 2

c3: 2

(a2,*,c3): 2

d4: 2 a2c3/a2c3:2

Recursive Mining - Backtrack

ACD/A tree ACD/A-Cuboid Tree output

(a1,*,*) : 3

a2CD/a2:2 a2D/a2:2 (a1,b1,*):2

(a2,*,*): 2

c3: 2 d4: 2 (a2,*,c3): 2

(a2,c3,d4): 2

d4: 2

a2c3/a2c3:2

Same as before

As we backtrack

recursively mine child

trees

Mine Subtree BCD

ABCD-Tree ABCD-Cuboid Tree output

BCD:5

root: 5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2

(a2,*,*): 2

c*: 1 c3: 2 c*: 2

(a2,*,c3): 2

(a2,c3,d4): 2

d*: 1 d4: 2 d*: 2

mine subtree

BC/BC, BD/B, CD

remove

Mine Subtree BCD

BCD-Tree BCD-Cuboid Tree output

BCD:5 (a1,*,*) : 3

(a1,b1,*):2

b*: 3 b1: 2 (a2,*,*): 2

(a2,*,c3): 2

c*: 1 c3: 2 c*: 2 (a2,c3,d4): 2

d*: 1 d4: 2 d*: 2

You may do it as an

exercise

Finish

ABCD-Tree ABCD-Cuboid Tree output

root: 5 (a1,*,*) : 3

(a1,b1,*):2

(a2,*,*): 2

(a2,*,c3): 2

(a2,c3,d4): 2

BCD tree patterns

Efficient Computation of Data

Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Computing non-monotonic measures

Compressed and closed cubes

08/10/09 Data Mining: Concepts and Techniques 135

Computing Cubes with Non-

Antimonotonic Iceberg Conditions

J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of

Iceberg Cubes With Complex Measures. SIGMOD’01

Most cubing algorithms cannot compute cubes

with non-antimonotonic iceberg conditions

efficiently

Example

CREATE CUBE Sales_Iceberg AS

SELECT month, city, cust_grp, AVG(price), COUNT(*)

FROM Sales_Infor

CUBEBY month, city, cust_grp

HAVING AVG(price) >= 800 AND COUNT(*) >= 50

How to push constraint into the iceberg cube

08/10/09 Data Mining: Concepts and Techniques 136

Non-Anti-Monotonic Iceberg

Condition

Anti-monotonic: if a process fails a condition,

continue processing will still fail

The cubing query with avg is non-anti-monotonic!

(Mar, *, *, 600, 1800) fails the HAVING clause

Cust_gr

Month City

p

Prod Cost Price CREATE CUBE Sales_Iceberg AS

Jan Tor Edu Printer 500 485 SELECT month, city, cust_grp,

Jan Tor Hld TV

Camer

800 1200

AVG(price), COUNT(*)

Jan Tor Edu 1160 1280

Feb Mon Bus

a

Laptop 1500 2500

FROM Sales_Infor

Mar Van Edu HD 540 520 CUBEBY month, city, cust_grp

… … … … … … HAVING AVG(price) >= 800 AND

COUNT(*) >= 50

08/10/09 Data Mining: Concepts and Techniques 137

From Average to Top-k Average

Let (*, Van, *) cover 1,000 records

Avg(price) is the average price of those 1000

sales

Avg50(price) is the average price of the top-50

sales (top-50 according to the sales price

Top-k average is anti-monotonic

The top 50 sales in Van. is with avg(price) <=

800 the top 50 deals in Van. during Feb. must

be with avg(price) <= 800

Month City

Cust_gr

p

Prod Cost Price

… … … … … …

Binning for Top-k Average

Binning idea

Avg50(c) >= 800

Large value collapsing: use a sum and a count to

summarize records with measure >= 800

If count>= 50, no need to check “small”

records

Small value binning: a group of bins

One bin covers a range, e.g., 600~800,

400~600, etc.

Register a sum and a count for each bin

08/10/09 Data Mining: Concepts and Techniques 139

Computing Approximate top-k average

Approximate avg50()=

Over 2800 t

20 (28000+10600+600*15)/50=95

800

600~80 0

1060 15 Top 50

0 0 2

400~60 1520 30

…0 …0 …

The cell may pass the HAVING clause

Month City Cust_grp Prod Cost Price

… … … … … …

Weakened Conditions Facilitate

Pushing

average iceberg cubes efficiently

Three pieces: sum, count, top-k bins

Use top-k bins to estimate/prune descendants

Use sum and count to consolidate current cell

weakest strongest

Approximate real avg50() avg()

avg50() Anti-monotonic, Not anti-

Anti-monotonic, but monotoni

can be computed computationally c

efficiently costly

08/10/09 Data Mining: Concepts and Techniques 141

Computing Iceberg Cubes with

Other Complex Measures

Key point: find a function which is weaker but

ensures certain anti-monotonicity

Examples

Avg() ≤ v: avgk(c) ≤ v (bottom-k avg)

Avg() ≥ v only (no count): max(price) ≥ v

Sum(profit) (profit can be negative):

p_sum(c) ≥ v if p_count(c) ≥ k; or otherwise, sumk(c) ≥ v

Others: conjunctions of multiple conditions

08/10/09 Data Mining: Concepts and Techniques 142

Efficient Computation of Data

Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Computing non-monotonic measures

Compressed and closed cubes

08/10/09 Data Mining: Concepts and Techniques 143

Compressed Cubes: Condensed or

Closed Cubes

Effective Approach to Reducing Data Cube Size, ICDE’02.

This is challenging even for icerberg cube: Suppose 100

dimensions, only 1 base cell with count = 10. How many

aggregate cells if count >= 10?

Condensed cube

Only need to store one cell (a1, a2, …, a100, 10), which

represents all the corresponding aggregate cells

Efficient computation of the minimal condensed cube

Closed cube

Dong Xin, Jiawei Han, Zheng Shao, and Hongyan Liu, “C-

Cubing: Efficient Computation of Closed Cubes by

Aggregation-Based Checking”, ICDE'06.

08/10/09 Data Mining: Concepts and Techniques 144

Exploration and Discovery in

Multidimensional Databases

Cube-Gradient Analysis

Cube-Gradient (Cubegrade)

multi-dimensional spaces

Query: changes of average house price in

Vancouver in ‘00 comparing against ’99

Answer: Apts in West went down 20%, houses

in Metrotown went up 10%

Cubegrade problem by Imielinski et al.

Changes in dimensions changes in measures

Drill-down, roll-up, and mutation

From Cubegrade to Multi-

dimensional Constrained

Gradients in Data Cubes

rules

Capture trends in user-specified measures

Serious challenges

Many trivial cells in a cube “significance

constraint” to prune trivial cells

Numerate pairs of cells “probe constraint” to

select a subset of cells to examine

Only interesting changes wanted “gradient

constraint” to capture significant changes

08/10/09 Data Mining: Concepts and Techniques 147

MD Constrained Gradient Mining

Significance constraint Csig: (cnt≥100)

Probe constraint Cprb: (city=“Van”,

cust_grp=“busi”, prod_grp=“*”)

Gradient constraint Cgrad(cg, cp):

(avg_price(cg)/avg_price(cp)≥1.3)

Probe cell: satisfied Cprb (c4, c2) satisfies Cgrad!

Dimensions Measures

Base cell cid Yr City Cst_grp Prd_grp Cnt Avg_price

c1 00 Van Busi PC 300 2100

Aggregated cell

c2 * Van Busi PC 2800 1800

Siblings c3 * Tor Busi PC 7900 2350

c4 * * busi PC 58600 2250

Ancestor

08/10/09 Data Mining: Concepts and Techniques 148

Efficient Computing Cube-

gradients

The set of probe cells P is often very small

Use probe P and constraints to find gradients

Pushing selection deeply

Set-oriented processing for probe cells

Iceberg growing from low to high dimensionalities

Dynamic pruning probe cells during growth

Incorporating efficient iceberg cubing method

- Data Mining:Uploaded bymahendirana
- 09-datamining conceptsUploaded byGiri Saranu
- Data Mining: Concepts and TechniquesUploaded bymahendirana
- wekaUploaded bydoinv
- Data Structures and Algorithms in PythonUploaded byAlan Christian
- Data MiningUploaded byTarun Dhiman
- Logistic Regression for Data Mining and High-Dimensional ClassiﬁcationUploaded bymachinelearner
- Data Mining: Concepts and TechniquesUploaded byNiladri Dey
- Concepts and TechniquesUploaded bymahendirana
- Rapid Miner 4.4 TutorialUploaded byejrobb81
- Data Mining: Concepts and TechniquesUploaded bymahendirana
- Data MiningUploaded bySubhadeep Das
- Data MiningUploaded byVadim M Zaripov
- Social Networks and Data MiningUploaded bySushil Kulkarni
- Data Mining:Uploaded bymahendirana
- Predictive Modeling With Clementine v11_1Uploaded byJasmin Maric
- Handbook of Data MiningUploaded byNancy Zamora Rubio
- Data MiningUploaded bymartin.wilfred7257
- Predictive Analysis Overview 2013Uploaded bySivaramakrishna Alpuri
- Data Mining - Classification: Alternative TechniquesUploaded byTran Duy Quang
- Application of Computational Intelligence in Data MiningUploaded byIon Mironescu
- data miningUploaded bySridhar Reddy Vulupala
- Intelligent Data AnalysisUploaded byJoseh Tenylson G. Rodrigues
- Data MineUploaded bysantanoop
- Mining of Massive DatasetsUploaded bykhurshedmemon
- Data MiningUploaded bySugan Pragasam
- Data Mining Frequent PatternUploaded byDanyel Olaru
- Visual Cluster Analysis in Data MiningUploaded bydavidmorenoazua
- Data-Mining-From-Data-to-Knowledge.pdfUploaded byRiko Rivanthio
- Oracle Data Mining ConceptsUploaded byNaresh Kumar

- CBC 5100 Color Bar Code ReaderUploaded byNordson Adhesive Dispensing Systems
- TOSHIBAS 350 352 450 452Uploaded byFareeduddin Ahmed
- Oracle AIX Tuning OptimizationUploaded byraltmannr
- Fuzzy Particle Swarm Optimization With Simulated Annealing and Neighborhood Information Communication for Solving TSPUploaded byEditor IJACSA
- Principal Components AnalysisUploaded byAyushYadav
- 0898716659Uploaded byAlruAlmu Bandar
- EMBLUploaded byVEENA DEVI
- FRANC3D V7 Commands PythonUploaded byalireza
- RPI-CM-V1_1-SCHEMATIC.pdfUploaded byRudolf Brander
- Automatic Code GenerationUploaded byBharggav Shorthand Classes
- eGate_UGUploaded byapi-3730244
- real-time-digital-signal-processing-implementations-and-applications.9780470014950.19963.pdfUploaded byMauro Sueth
- 29257146 Tekla Advanced Functions CopyUploaded byqaadil
- navier stokes fem matlabUploaded byEzKeezE
- 7 . CLASSIC Tools of QualityUploaded byAtul Kishore
- ccs-ReferenceManual-2015Uploaded byLuis Enrique Castillejo Gonzales
- Action Strings Manual EnglishUploaded byAlberto Guillén
- Comp Formulas a NovaUploaded byNibir Mahanta
- 8.ISO_IEC_17021-2_2016_-_Competence_requirements_-_21-22.12.2016Uploaded byVictoria Khanina
- two marks_2Uploaded byBenjamin Lancaster
- ResumeUploaded byHarsh Vardhan Singh
- Katalog_compl_2016_en_2016_03_09Uploaded byÄpriolia Su
- Expertpdf-pdfsplitUploaded byMichelle
- 64396156_S7-1200_DataLogging_DOCU_v1d0_enUploaded byHieu LE
- Central processing unitUploaded byᗬᗴᐻ ᔤᗩᕼᕢᖆᘍ
- Ma 1251- Numerical MethodsUploaded byRam Prakash
- PC Analyzer Debug Card ManualUploaded byTania Segovia
- Markov ChainsUploaded byYamin Laode
- ChaosUploaded byMary Adele Keenan
- Dennett Free WillUploaded byLeo Pacuare