You are on page 1of 56

Data Warehousing and Mining

Roadmap

What is a Warehouse?

more

What is a Warehouse?

Warehouse Architecture
Client Query & Analysis Client

Metadata

Warehouse

Integration

Source

Source

Source

Why a Warehouse?

?
Source Source

Query-Driven Approach

Client Mediator Wrapper Wrapper

Client

Wrapper

Source

Source

Source

Advantages of Warehousing

Advantages of Query-Driven

OLTP vs. OLAP


OLTP: On Line Transaction Processing Describes processing at operational sites
OLAP: On Line Analytical Processing Describes processing at warehouse

OLTP vs. OLAP


OLTP

OLAP

Data Marts

ROLAP vs. MOLAP


ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing

ROLAP

MOLAP

Implementing a Warehouse

Monitoring

Integrating

Processing

Managing

Design Issues

Tools required for:


design & edit: schemas, views, scripts, rules, queries, reports what-if scenarios (schema changes, refresh rates), capacity planning
Planning & Analysis

performance monitoring, usage patterns, exception reporting


Warehouse Management

Development

measure traffic (sources, warehouse, clients)


System & Network Management

reliable scripts for cleaning & analyzing data

Workflow Management

Data Mining

The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets

Data Mining is:

The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

Examples of Large Datasets

WALMART: 20M transactions per day

MOBIL: 100 TB geological databases

AT&T 300 M calls per day

NASA, EOS project: 50 GB per hour

Examples of Data mining Applications


Fraud detection: credit cards, phone cards

Marketing: customer targeting

Data Warehousing: Walmart

Astronomy

Molecular biology

How Data Mining is used

Identify the problem Use data mining techniques to transform the data into information Act on the information Measure the results

The Data Mining Process


2. Create a dataset: 1. Understand the domain
Select the interesting attributes Data cleaning and preprocessing

4. Interpret the results, and possibly return to 2

3. Choose the data mining task and the specific algorithm

Data Mining Tasks


Classification

Regression

Clustering:

Dependencies and associations Summarization

Data Mining Methods


1. Decision Tree Classifiers:
2. Association Rules:

Used for modeling, classification Used to find associations between sets of attributes Used to find temporal associations in time series used to group customers, web users, etc

3. Sequential patterns:
4. Hierarchical clustering:

Are All the Discovered Patterns Interesting?

Objective:

based on statistics and structures of patterns, e.g., support, confidence, etc.


Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.

Why Data Preprocessing?

Why can Data be Incomplete?

Why can Data be Noisy/Inconsistent?

Data Cleaning

Major Tasks in Data Preprocessing


Data cleaning
Data integration
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Integration of multiple databases or files

Data transformation
Data reduction Data discretization

Normalization and aggregation

Obtains reduced representation in volume but produces the same or similar analytical results Part of data reduction but with particular importance, especially for numerical data

How to Handle Missing Data?

How to Handle Noisy Data? Smoothing techniques

Simple Discretization Methods: Binning


number of values

Example: customer ages

Equi-width binning:

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width binning:

0-22

22-31 62-80 38-44 48-55 32-38 44-48 55-62

Cluster Analysis
salary

cluster

outlier

age

Regression
y (salary) Example of linear regression y=x+1

Y1

X1

x (age)

Data Integration

Data Transformation

Normalization: Why normalization?

Data Reduction Strategies

Data Compression

Data Compression

Original Data
lossless

Compressed Data

Original Data Approximated

Histograms

40 35

30 25

20 15 10

5 0
10000 30000 50000 70000 90000

Clustering

Sampling

Sampling
Raw Data Cluster/Stratified Sample

The number of samples drawn from each cluster/stratum is analogous to its size Thus, the samples represent better the data and outliers are avoided

Sampling

Raw Data

Example: Benefits for Healthcare Industry


Evidencebased medicine Policymaking in public health More value for money and cost saving Early detection and/or prevention of disease

Prevention of hospital errors

Management of pandemic diseases

Non-invasive diagnosis and decision support

Adverse drug event

Example: Usage in Digital Media Industry


Ad Targeting Yield Optimization Ad Sales Analysis Bid Price Optimization

Website Optimization

Attribution Analysis

Click Fraud Analysis

Network Usage Analysis

You might also like