Data Warehousing and Mining

Data Warehousing and Mining
Roadmap

What is a Warehouse?
more
What is a Warehouse?
Warehouse Architecture
Client Query & Analysis Client
Metadata
Warehouse
Integration
Source
Source
Source
Why a Warehouse?
?
Source Source
Query-Driven Approach
Client Mediator Wrapper Wrapper
Client
Wrapper
Source
Source
Source
Advantages of Warehousing
Advantages of Query-Driven
OLTP vs. OLAP

OLTP: On Line Transaction Processing Describes processing at operational sites
OLAP: On Line Analytical Processing Describes processing at warehouse
OLTP vs. OLAP

OLTP

OLAP

Data Marts

ROLAP vs. MOLAP

ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing
ROLAP
MOLAP
Implementing a Warehouse
Monitoring
Integrating
Processing
Managing
Design Issues
Tools required for:

design & edit: schemas, views, scripts, rules, queries, reports what-if scenarios (schema changes, refresh rates), capacity planning
Planning & Analysis
performance monitoring, usage patterns, exception reporting

Warehouse Management
Development
measure traffic (sources, warehouse, clients)

System & Network Management
reliable scripts for cleaning & analyzing data
Workflow Management
Data Mining
The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets
Data Mining is:
The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner
Examples of Large Datasets
WALMART: 20M transactions per day
MOBIL: 100 TB geological databases
AT&T 300 M calls per day
NASA, EOS project: 50 GB per hour
Examples of Data mining Applications

Fraud detection: credit cards, phone cards
Marketing: customer targeting
Data Warehousing: Walmart
Astronomy
Molecular biology
How Data Mining is used
Identify the problem Use data mining techniques to transform the data into information Act on the information Measure the results
The Data Mining Process

2. Create a dataset: 1. Understand the domain
Select the interesting attributes Data cleaning and preprocessing
4. Interpret the results, and possibly return to 2
3. Choose the data mining task and the specific algorithm
Data Mining Tasks

Classification
Regression
Clustering:
Dependencies and associations Summarization
Data Mining Methods

1. Decision Tree Classifiers:
2. Association Rules:
Used for modeling, classification Used to find associations between sets of attributes Used to find temporal associations in time series used to group customers, web users, etc
3. Sequential patterns:
4. Hierarchical clustering:
Are All the Discovered Patterns Interesting?
Objective:
based on statistics and structures of patterns, e.g., support, confidence, etc.

Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.
Why Data Preprocessing?
Why can Data be Incomplete?
Why can Data be Noisy/Inconsistent?
Data Cleaning
Major Tasks in Data Preprocessing

Data cleaning
Data integration
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Integration of multiple databases or files
Data transformation
Data reduction Data discretization
Normalization and aggregation
Obtains reduced representation in volume but produces the same or similar analytical results Part of data reduction but with particular importance, especially for numerical data
How to Handle Missing Data?
How to Handle Noisy Data? Smoothing techniques
Simple Discretization Methods: Binning

number of values
Example: customer ages
Equi-width binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width binning:
0-22
22-31 62-80 38-44 48-55 32-38 44-48 55-62
Cluster Analysis
salary
cluster
outlier
age
Regression
y (salary) Example of linear regression y=x+1
Y1
X1
x (age)
Data Integration
Data Transformation
Normalization: Why normalization?
Data Reduction Strategies
Data Compression
Data Compression
Original Data
lossless
Compressed Data
Original Data Approximated
Histograms
40 35
30 25
20 15 10
5 0
10000 30000 50000 70000 90000
Clustering
Sampling
Sampling
Raw Data Cluster/Stratified Sample
The number of samples drawn from each cluster/stratum is analogous to its size Thus, the samples represent better the data and outliers are avoided
Sampling
Raw Data
Example: Benefits for Healthcare Industry

Evidencebased medicine Policymaking in public health More value for money and cost saving Early detection and/or prevention of disease
Prevention of hospital errors
Management of pandemic diseases
Non-invasive diagnosis and decision support
Adverse drug event
Example: Usage in Digital Media Industry

Ad Targeting Yield Optimization Ad Sales Analysis Bid Price Optimization
Website Optimization
Attribution Analysis
Click Fraud Analysis
Network Usage Analysis

Data Warehousing and Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing and Mining

Uploaded by

Copyright:

Available Formats

Data Warehousing and Mining

Client Mediator Wrapper Wrapper

OLTP vs. OLAP

OLTP vs. OLAP

ROLAP vs. MOLAP

Tools required for:

performance monitoring, usage patterns, exception reporting

measure traffic (sources, warehouse, clients)

reliable scripts for cleaning & analyzing data

Data Mining is:

Examples of Large Datasets

WALMART: 20M transactions per day

MOBIL: 100 TB geological databases

AT&T 300 M calls per day

NASA, EOS project: 50 GB per hour

Examples of Data mining Applications

Marketing: customer targeting

Data Warehousing: Walmart

How Data Mining is used

The Data Mining Process

4. Interpret the results, and possibly return to 2

3. Choose the data mining task and the specific algorithm

Data Mining Tasks

Dependencies and associations Summarization

Data Mining Methods

Are All the Discovered Patterns Interesting?

based on statistics and structures of patterns, e.g., support, confidence, etc.

Why Data Preprocessing?

Why can Data be Incomplete?

Why can Data be Noisy/Inconsistent?

Major Tasks in Data Preprocessing

Integration of multiple databases or files

Normalization and aggregation

How to Handle Missing Data?

How to Handle Noisy Data? Smoothing techniques

Simple Discretization Methods: Binning

Example: customer ages

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

22-31 62-80 38-44 48-55 32-38 44-48 55-62

Normalization: Why normalization?

Data Reduction Strategies

Original Data Approximated

Example: Benefits for Healthcare Industry

Prevention of hospital errors

Management of pandemic diseases

Non-invasive diagnosis and decision support

Adverse drug event

Example: Usage in Digital Media Industry

Click Fraud Analysis

Network Usage Analysis

You might also like