You are on page 1of 95

Data Integration

Techniques for Extracting,


Transforming and Loading Data

Internal and Confidential

Agenda
Module 1 - Source Data Analysis & Modeling
Module 2 Data Capture Analysis & Design
Module 3 Data Transformation Analysis & Design
Module 4 Data Transport & Load Design
Module 5 Implementation Guidelines

December 14, 2010

Module 1
Source Data Analysis and
Modeling

Internal and Confidential

December 14, 2010

Module 1 Source Data Analysis & Modeling


Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills

December 14, 2010

Data Acquisition Concepts


What is Data Acquisition?
Data sources
Data acquisition is the process of moving data to and within the
warehousing environment
Goal oriented
Not an isolated activity. Like data modeling, it is driven by goals
external data Operational data
and purpose of the data warehouse
Source Driven / Target Driven
Get data from sources
Source Driven the necessary activities for getting data into the
Data Intake
warehousing environment from various sources
Target Driven activities for acquisition within the warehousing
ETL
environment
Staging
Data Acquisition activities
Identifying the set of data needed
Get data into warehouse
Choosing the extract approach
Extract data from the source
Data Distribution
Apply transformations
ETL
Data
Load data
warehouse
Get data for delivery
Information Delivery

ETL
Data
mart

ETL

Data
mart

ETL

Access

Data
mart

December 14, 2010

Module 1 Source Data Analysis & Modeling


Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills

December 14, 2010

Source Data Analysis Scope of Data Sources

Data acquisition issues

Identifying and understanding data sources


Mapping source data to target data
Deciding which data to capture
Effectively and efficiently capturing the data, and
Determining how and when to transform the data

Acquisition process design must pay attention to:


Ensuring data quality
Efficiently loading warehouse databases
Using tools effectively
Planning for error recovery

Entities
Customer
Order
Product..

Subjects
?
product

Customer

Finance

Process

HR

Organization

Business
Events

History?

Receive order
Ship order

Kinds
of
data?

Cancel
order.

2003
2004
2005

Enterprise
Events
Merger with..
Acquisition of..
Termination of.

December 14, 2010

Source Data Analysis Identification of


sources
Types of sources
Operational Systems
Secondary Systems
Backups, Logs, and Archives
Shadow Systems
DSS/EIS Systems, and
External Data
On-Going versus Single Load Sources

December 14, 2010

Source Data Analysis Evaluation of Sources


Qualifying criteria

Assessment Questions

Availability

How available and accessible is the data?


Are there technical obstacles to access?
Or ownership and access authority issues?

Understandability

How easily understood is the data?


Is it well documented?
Does someone in the organization have depth of knowledge?
Who works regularly with this data?

Stability

How frequently do data structures change?


What is the history of change for the data?
What is the expected life span of the potential data source?

Accuracy

How reliable is the data?


Do the business people who work with the data trust it?

Timeliness

When and how often is the data updated?


How current is the data?
How much history is available?
How available is it for extraction?

Completeness

Does the scope of data correspond to the scope of the data warehouse?
Is any data missing?

Granularity

Is the source the lowest available grain ( most detailed level) for this data?
December 14, 2010

Evaluation of Sources Origin of data


Original Point of Entry
Best practice technique is to evaluate the original point of entry. Is is this the very first place that the data is
recorded anywhere within the business?
If yes, then you have found the original point of entry. If no, then source may not be the original point of
entry. Ask the follow up question Can the element be updated in this file/table?
If not then this is not the original point of entry. If yes then the data element may be useful as a data
warehousing source data

CLAIMCUSTOM
POLICY
C U S TO M E R -N U M B E R
CUSTOM
C U S TO M E R -N A M E
CUSTOM
GENDER
DRIVER
Point of Origin?

December 14, 2010

Evaluation of Sources Origin of data


Original Point of Entry This practice has many benefits
Data timeliness and accuracy are improved
Simplifies the set of extracts from the source system
Business System of Record
To what system do the business people go when they are validating results?
If business identifies a system as the System of Record then it must be considered as a
probable warehousing data source
Data Stewardship
In organizations that have data stewardship program, involve the data stewards

December 14, 2010

Evaluation of Sources An example

So

ata
D
e
urc

re
o
t
S

trix
a
M

FIELD
ata
D
rce
u
So

en
m
Ele

trix
a
tM

CLAIM-NUMBER
December 14, 2010

Module 1 Source Data Analysis & Modeling


Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills

December 14, 2010

Source Data Modeling Overview of Warehouse data


modeling
Business Goals and Drivers
Contextual
Information Needs

Models

Warehousing subjects
Business Questions
Facts and Qualifiers
Targets Configuration

Source composition
Source subjects

Integrated Source
Data Model (ERM)

Source Data Structure Model

Source Data File


Descriptions

Source Data Files

Triage

Conceptual
Models

Staging, Warehouse, & mart


ER models
Data mart Dimensional
models

Staging Area Structure


Warehouse structure
Relational mart structures
Dimensional mart structures

Staging Physical Design


Warehouse Physical Design
Data Mart Physical Designs

Implemented warehousing
databases

Logical
Models

Structural
Models

Physical
Models

Functional
Databases

December 14, 2010

Source Data Modeling


Source Data Modeling Objectives
A single logical model representing a design view of all source data within scope
An entity relationship model in 3rd normal form ( a business model without
implementation redundancies)
Traceability from logical entities to the specific data sources that implement those
entities
Traceability from logical relationships to the specific data sources that implement those
relationships
Verification that each logical attribute is identifiable in implemented data sources
Source Data Modeling Challenges
Many data sources do not have data models

Where data models exist, they are probably out-dated and almost certainly not
integrated

Many source structures are only documented in code (e.g. COBOL definitions of
VSAM files)

Sometimes multiple and conflicting file descriptions exist for a single data structure

December 14, 2010

Business drivers
Information needs

(scope)

Contextual

Source Data Modeling The activities

What
kinds of
data stores

Source data

(analyze)

warehousing data
To
target
modeling

Each source

Source composition
model

Conceptual

Business goals

Does
source model
exist

no
Which
Modeling
Approach?

bottom-up

Logical

top-down

Source
Subject
model

integrate

(design)

yes

validate

Existing data
model
Source
Logical
Model (ERM)

Structure
Of data
Store (matrix)

Structural
(specify)
Physical

Existing
file desc

(optimize)
locate

extract

Functional
(Implement)

Existing data store

December 14, 2010

Module 1 Source Data Analysis & Modeling


Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills

December 14, 2010

Understand the Scope of Data Sources


(scope)

Context
ual

Business drivers
Information needs

What
kinds of
data stores

Source data

(analyze
)

Concept
ual

Source composition
model

Business goals

warehousing data
To
target
modeling

Each source

Which
Modeling
Approach?

Does
source model
exist

no
top-down

bottom-up

yes

Source
Subject
model

Identify & Name


Subjects

validate
Existing data
model

Associate
Subjects

Source composition model uses set notation to develop a subject area model
Classifies each source by the business subjects that it supports
Helps to understand
which subjects have a robust set of sources
which sources address a broad range of business subjects
Helpful to plan, size, sequence and schedule development of the DW increments
December 14, 2010

Composition Subject Model - Example


CUSTOMER

MIS customer table

REVENUE
APMS premium file

POLICY

LIS policy file


RPS policy file
APMS policy master

CLAIM
MIS product table

EXPENSE
INCIDENT
CPS claim master
LIS claim file

CPS claim
action file
CPS party file

CPS claim detail file

ORGANIZATION
PARTY
MARKETPLACE

MIS auto
Marketplace table
MIS residential
Marketplace table
December 14, 2010

Composition Subject Matrix Example

December 14, 2010

Module 1 Source Data Analysis & Modeling


Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills

December 14, 2010

Understanding source content - Integrated view


of Non-integrated sources
Source Data Modeling
Not within the charter of the warehousing program to redesign data sources
Understand the existing source data designs
Merging all of the designs into one representative model
The source logical data modeling process is not one of design but of integrating
To develop a logical source data model, you will need to integrate design information from
multiple inputs including
Merging and integrating existing data models
Extending the subject model that represents any source
Extracting design structures from source data stores with reverse engineering
techniques

December 14, 2010

Understanding Source content Using


existing models
(analyze)

Conceptual

Source composition
model

Which
Modeling
Approach?

yes

Does
source model
exist

no

top-down

bottom-up

Logical

Source
Subject
model

Check for Currency


& Accuracy

Existing data
model

Existing data
model

Existing data
model

Source
Logical
Model (ERM)

integrate

(design)

validate

Combine into
Single model

This modeling activity begins with collection of existing data models


Models must be validated to ensure accuracy and currency
Existing models jump start the process
Merging models identifying and resolving redundancy and conflict across source models
December 14, 2010

Understanding Source content Working Top


Down
(analyze)

Conceptual

Source composition
model

Does
source model
exist

no

Which
Modeling
Approach?

bottom-up

Logical

top-down

yes

Source
Subject
model

integrate

(design)

validate

Existing data
model

Existing data
model

Existing data
model

Source
Logical
Model (ERM)

Identify, Name &


Describe Entities

Identify, Name &


Describe Relationships

Identify, Name &


Describe Attributes

Map to Data Stores

December 14, 2010

Understanding Source Content Working


Bottom-Up
Derive the data model from the File descriptions
The source data element matrix serves as the tool to perform source data modeling
Source modeling and source assessment work well together and share the same set of
documentation techniques.
F ile

A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S

Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Policy
Policy
Policy
Policy

F ield

A ttrib u te
(w h a t fa ct?)

ID
(k ey ? )

PO LIC Y-N U M B ER
U n iqu e Policy ID
Yes
NAME
Policy H older N a m e
A D D R ES S
Policy H older A d dress
PR EM IU M -A M O U N T C ost of Policy Prem ium
PO LIC Y-TER M
C overa g e D u ration
B EG IN -D A TE
S tart d ate of covera ge
EN D -D A TE
End d ate of covera g e
D IS C O U N T-C D
Iden tify kin d of d iscounPa
t rtial
S C H ED U LE
B asis of discou nt am t
PO LIC Y-N U M B ER
U n iqu e Policy ID
Yes
C U S TO M ER -N U M B ERU n iqu e custom er ID Yes
V IN
V eh icle ID n um b er
Yes
MAKE
V eh icle M anu facturer

E n tity
R e la tio n ship
(w h at
(fo reig n k ey ?)
su b ject?)
PO LIC Y
C U S TO M ER
C U S TO M ER
PO LIC Y
PO LIC Y
PO LIC Y
PO LIC Y
D IS C O U N T
D IS C O U N T
PO LIC Y
C U S TO M ER
PO LIC Y-> C U S TO M ER
V EH IC LE
PO LIC Y-> V EH IC LE
V EH IC LE
December 14, 2010

Integrating Multiple Views Resolving Redundancy &


Conflict
Resolve Redundancy
And Conflict

Model States

Normalize

Verify Model

Examine sets of entities and relationships:


customer places order and person places order customer and person
are redundant
customer places order and customer sends order places redundant
with sends
Examine sets of entities and attributes
When differently named entities have a high degree of similarity in their
sets of attributes
December 14, 2010

Understanding Source content - Data Profiling


Looking at the Data

Look at the data to:


discover patterns
know how it is used
understand data quality
identify all data values

3 types of profiling:

classify and organize

Column profiling
Dependency profiling
Redundancy profiling

December 14, 2010

Module 1 Source Data Analysis & Modeling


Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills

December 14, 2010

Logical to Physical Mapping


Tracing Business data to Physical Implementation
Two way connection
What attribute is implemented by this field/column?
Which fields/columns implement this attribute?
Documenting the mapping
Extend the source data element matrix to include all data sources
Provides comprehensive documentation of source data, and detailed mapping from
the business view to implemented data elements

December 14, 2010

Module 1 Source Data Analysis & Modeling


Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills

December 14, 2010

Understanding the source systems A Team


Effort
Understanding the source systems feeding the warehousing environment is a critical success factor
All members of the warehousing team have a role in this effort
The acquisition (ETL) team is generally responsible to
document the source layouts
perform reverse engineering as needed
determine point of entry
identify system of record
and look at actual data values
The data modelers are likely to
create the single source logical model (using the inputs gathered by the acquisition team)
Business Analysts/representatives are involved to
look at the data and help understand the values
help to identify point of entry
and help to determine system of record

December 14, 2010

Module 2
Data Capture Analysis and
Design

Internal and Confidential

December 14, 2010

Module 2 Data Capture Analysis & Design


Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture

December 14, 2010

Data Capture Concepts An overview


Data capture activities involved in getting data out of sources
Synonym for Data Extraction

source
Extract

Data capture Analysis What to extract?

Performed to understand requirements for data capture


Which data entities and elements are required by target data stores?
Which data sources are needed to meet target requirements?

Transform

What data elements need to be extracted from each data source?

Data capture Design When to extract? How to extract?

Performed to understand and specify methods of data capture

Load

target

Timing of extracts from each data source


Kinds of data to be extracted
Occurrences of data (all or changes only) to be captured
Change detection methods
Extract technique (snapshot or audit trail)
Data capture method (push from source or pull from warehouse

December 14, 2010

Module 2 Data Capture Analysis & Design


Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture

December 14, 2010

Source/Target Mapping Mapping Objectives

Primary technique used to perform data capture


analysis

Mapping source data to target data

Three levels of detail


Entities
Data Stores
Data Elements

at
D

ai
av

source

ty
il i
b
a

Extract

Transform

Source/Target mapping

Load

The terms source and target describe roles


that a data store may play, not fixed
characteristics of a data store

Data
Req
uire
men
ts

target

December 14, 2010

Source and Target as Roles


Data sources

Source: Operational/external data

external data Operational data

Target: staging data


Data Intake

The terms source and target


describe roles that a data store

ETL

Staging

may play, not fixed characteristics


of a data store

Source: staging data


Target: data warehouse

Data Distribution

ETL

Source: data warehouse


Target: data marts

Data
warehouse

Information Delivery

ETL
Data
mart

ETL

Data
mart

December 14, 2010

ETL
Data
mart

Mapping Techniques
customer

Source & target data models


product

service

Map source entities


to target entities

Logical data
models

Map source data stores


to target data stores

Map source data elements to


target data elements
Structural &

design transformations

physical models

December 14, 2010

Source/Target Mapping: EXAMPLES


ENTITY MAPPING

DATA STORE
MAPPING

DATA ELEMENT
MAPPING

MEMB

December 14, 2010

Source/target Mapping: Full set of Data


elements
Elements added by business

s
es ns
n
si tio
u
B es
qu

nt
e
lem

e
ta x
a
D t ri
ma

e
bl ons
a
/t ti
le crip
i
F s
de

l
ca l
i
g e
Lo od
m

Source/target mapping

l
ca
i
ys gn
h
P esi
d

ap
et m
g
r
ta
ce/
r
u
So

Elements
added by triage
triage

transform design

December 14, 2010

Elements added by
transform logic

l
ca l
i
g e
Lo od
m

Module 2 Data Capture Analysis & Design


Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture

December 14, 2010

Source Data Triage


What to extract - opportunities
Source/target mapping analyzes need, triage analyzes opportunity
Triage is about extracting all data with potential value

What is Triage?
Source data structures are analyzed to determine the appropriate data elements for inclusion

Why Triage?
Ensure that a complete set of attributes is captured in the warehousing environment.
Rework is minimized

Triage and Acquisition


Performing triage is a joint effort between the acquisition team and the warehousing data modelers

December 14, 2010

The Triage Technique


Source systems needed for
increment

Select needed files


Identify elements addressing
known business questions
Eliminate Operational and
redundant elements
Take all other business
elements

First draft of element mapping for


staging area or atomic DW

December 14, 2010

Module 2 Data Capture Analysis & Design


Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture

December 14, 2010

Kinds of Data
Event Data

Custom

Member Nu
Membership
Reference Data

Source system
metadata

Source system keys

December 14, 2010

Data Capture Methods

ALL
DATA

CHANGED
DATA

PUSH TO
WAREHOUSE

Replicate
source
Files/tables

Replicate
Source changes
Or transactions

PULL FROM
SOURCE

Extract
source
Files/tables

Extract source
Changes or
transactions

December 14, 2010

Detecting Data Changes


how to
know
which data
has
changed???

Detecting changes at source


source date/time stamps
source transaction files/logs
replicate source data changes
DBMS logs
compare back-up files
Detecting changes after extract
compare warehouse extract
generations
compare warehouse extract to
source system

ALL
DATA

CHANGED
DATA

PUSH TO
WAREHOUSE

Replicate
source
Files/tables

Replicate
Source changes
Or transactions

PULL FROM
SOURCE

Extract
source
Files/tables

Extract source
Changes or
transactions

December 14, 2010

Module 2 Data Capture Analysis & Design


Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture

December 14, 2010

Timing issues
OLTP

Frequency of
Acquisition

Sources

Data
Extraction
Data
Transformation
Work
Tables

Warehouse
Loading

Latency of
Load

Intake
layer

Periodicity
Of
Data Marts

Data
Mart

December 14, 2010

Source System Considerations


OLTP

Sources

Data
Extraction

Work
Tables

How will I know when


source systems fail?

How long will it remain in


the steady state?

When is the data ready in


each source system ?
How will i know when its
ready?

How will I respond to


source system failures?

How will i recover from a


failure?
December 14, 2010

Handling time variance techniques and


methods
SNAPSHOT

Periodically posts records as of a specific point in time


Records all data of interest without regard to changes
Acquisition techniques to create snapshots
DBMS replication
Full File Unload or Copy

AUDIT TRAIL
Records details of each change to data of interest
Details may include date and time of change, how the change was detected, reason for
change, before and after data values, etc.
Acquisition techniques
DBMS triggers
DBMS replication
Incremental selection
Full file unload/copy

Important distinction between Snapshot and Audit trail :


Audit trail techniques only Changed data is extracted and loaded,
Snapshot all data is extracted and loaded, whether changed or not
December 14, 2010

Module 3
Data Transformation Analysis &
Design

Internal and Confidential

December 14, 2010

Module 3 Data Transformation Analysis &


Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes

December 14, 2010

Transformation concepts An overview


Data Transformation
Changes that occur to the data after it is extracted
Transformation processing removes

Complexities of operational environments


Conflicts and redundancies of multiple databases
Details of daily operations
Obscurity of highly encoded data

Transformation Analysis
Integrate disparate data
Change granularity of data
Assure data quality

Transformation Design
Specifies the processing needed to meet the requirements that are determined by
transformation analysis
Determining kinds of transformations

Selection
Filtering
Conversion
Translation
Derivation
Summarization
Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools
and technology
December 14, 2010

Module 3 Data Transformation Analysis &


Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes

December 14, 2010

Data Integration Requirements


Integration
Create a single view of the data
Integration & Staging Data
organize data by business subjects
ensure integrated identity through use of common, shared business keys
Integration & Warehouse data
implement data standards, including derivation of conformed facts and structuring
of conformed dimensions
ensure integration of internal identifiers where staging integrates real world keys,
the warehouse needs to do the same for surrogate keys
Integration & Data marts
Intended to satisfy business specific /department specific requirements

December 14, 2010

Data Granularity Requirements


Granularity
Each change of data grain, from atomic data to progressively higher levels of
summary achieved through transformation
Granularity & Staging Data
Staging data kept at atomic level
Granularity & warehouse data
In a 3 tier environment, warehouse should contain all common and standard
summaries
Granularity & Data marts
Derivation of summaries specific to individual needs

December 14, 2010

Data Quality Requirements


Data Cleansing
process by which data quality needs are met
range from filtering bad date to replacing data values with some alternative default or derived
values
Cleansing & Staging Data
the earlier the data is cleansed, the better the result
sometimes important for staging data to reflect what was contained in the source systems
Delay data cleansing transformation until data is moved from staging to warehouse
Keep both cleansed and un-cleansed data in staging area
Cleansing & Warehouse data
data not cleansed in staging is cleansed before loaded into the warehouse
Cleansing & Data marts
cleansing at data marts is not necessarily desirable, however as a practical matter may be
necessary

December 14, 2010

Module 3 Data Transformation Analysis &


Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes

December 14, 2010

Transformation Design - Approach


transformation requirements

Identify transformation rules


& logic

Determine transformation
sequences

Specify transformation
process

transformation specifications

December 14, 2010

Module 3 Data Transformation Analysis &


Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes

December 14, 2010

Kinds of transformations
This Transformation type

is Used to

Selection

Choose one source to be used among


multiple possibilities

Filtering

Choose a subset of rows from a source


data table, or a subset of records from a
source data file

Conversion and Translation

Change the format of data elements

Derivation

Create new data values, which can be


inferred from the values of existing data
elements

Summarization

Create new data values, which can


inferred from the values of existing data
elements

December 14, 2010

Selection
Choose among
alternative sources
based upon selection
rules

Extracted
Source # 1

Select

Extracted
Source # 2
sometimes from source 1

Transformed
Target data

sometimes from source 2

If membership type is individual use member name from the


membership master file, otherwise use member name from the
business contact table

December 14, 2010

Filtering
eliminate some
data from the target
set of data based on
filtering rules

Extracted
Source data

Filter

Some rows or values discarded

Transformed
Target data

If the last 2 digits of policy number are 04,27,46, or 89 extract


data for the data mart, otherwise exclude the policy and all
associated data

December 14, 2010

Conversion
Change data content
and/or format based
on conversion rules

Extracted
Source data

Convert

Value/format in is different than


value/format out

Transformed
Target data

For policy history prior to 1994, reformat from Julian date to


YYYYMMDD format. Default century to 19

December 14, 2010

Translation
Extracted
Source data

encode values in

decode data whose


values are encoded
based on rules for
translation

Translate

both encoded and decoded value out

Transformed
Target data

if membership-type-code is C translate to Business; If


membership-type-code is P, blank, or null translate to
Individual; otherwise translate to Unknown

December 14, 2010

Derivation
use existing data
values to create new
data based on
derivation rules

Extracted
Source data

Derive

new data values created

Transformed
Target data

More values out than in

Total Premium Cost = base-premium-amount + (sum of all


additional coverage amounts)-(sum of all discount amounts)

December 14, 2010

Summarization
Extracted
Source data

atomic or base
data in

Change data
granularity based on
rules of
summarization

Summarize

Summary data out

Transformed
Target data

for each store (for each product line (for each day (count the
number of transactions, accumulate the total dollar value of the
transactions)))
for each week (sum daily transaction count, sum daily dollar
total)

December 14, 2010

Identifying Transformation Rules

for any source-to-target data


element association, what needs
exist for:
selection?
filtering?
conversion?
translation?
derivation?
summarization?
December 14, 2010

Specifying Transformation Rules


cells expand
to identify
transformations
by type &
name

cleansing DTR027 (default value)


Derivation DTR008 (Derive
name)
ic
g
ns
o
lo f
i
o at ly
m te
or ara ted
f
s
n
an sep me
r
t is
u
DTR008(Derive Name) doc

DTR027(Default Membership Type)


If membership-type is null or
invalid assume family membership

If membership-type is family
separate name using comma
characters prior to comma in customer-last-name
after comma in customer-first-name
biz-name

insert
insert characters
else move name to customerDecember 14, 2010

Module 3 Data Transformation Analysis &


Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes

December 14, 2010

Dependencies and Sequences


Time Dependency when one transformation rule must execute before another
example: summarization of derived data cannot occur before the derivation
Rule Dependency when execution of a transformation rule is based upon the result of
another rule
example: different translations occur depending on source chosen by a selection
rule
Grain Dependency when developing one level of summary if based on results of a
previous summarization
example: quarters cant be summarized annually before months are summarized
on a quarterly basis

December 14, 2010

Dependencies and Sequences


4

2
Specify selection
Specify filtering

Specify conversion & translation


Specify derivation
Specify summarization

1. Identify the transformation rules


2. Understand rule dependency package as modules
3. Understand time dependency package as processes
4. Validate and define the test plan
December 14, 2010

Modules and Programs


DTR027(Default Membership Type)
If membership-type is null or
DTR008(Derive Name) If membership-type is family
invalid assume family membership

separate name using comma


insert characters prior to comma in customer-last-name
insert characters after comma in customer-first-name

else move name to customer-biz-name

Transformation Rules
Dependencie
s among
rules

Structures of Modules,
Programs, Scripts, etc.
December 14, 2010

Job Streams & Manual Procedures- completing the


ETL design

Transformation Rules and their


implementation
Extract &
Load
Dependencie
s

scheduling

execution
e
r
tra
ts

verification

Automated & Manual


Procedures

communication
December 14, 2010

Module 4
Data Transportation & Loading
Design

Internal and Confidential

December 14, 2010

Module 4 Data Transport & Load Design

Data Transport and Load Concepts


Data Transport Design
Database Load Design

December 14, 2010

Overview
Source Data

Extract

databa
s

Load
a hc m
r of t al p od er e h w

t r ops nart at ad

Transform
e load

Target Data
December 14, 2010

Module 4 Data Transport & Load Design

Data Transport and Load Concepts


Data Transport Design
Database Load Design

December 14, 2010

Data Transport Issues


Source Data

Extract

which platforms?
data volumes?
transport frequency?
network capacity?
ASCII vs EBCDIC
data security?
transport methods?

Load

t r ops nart at ad

Transform

Target Data

December 14, 2010

Data Transport Techniques


Source Data

Extract

Open FTP
Secure FTP
Alternatives to FTP
Data compression
Data Encryption
ETL tools

Load

t r ops nart at ad

Transform

Target Data

December 14, 2010

Module 4 Data Transport & Load Design

Data Transport and Load Concepts


Data Transport Design
Database Load Design

December 14, 2010

Database Load Issues


Source Data

which DBMS?
relational vs dimensional?
tables & indices?
load frequency?
load timing?
data volumes?
exception handling?
restart & recovery?
load methods?
referential integrity?

Extract

Transform
databa
s

Load

e load

Target Data
December 14, 2010

Populating Tables
Drop and rebuild the tables
Insert (only) rows into a table
Delete old rows and insert changed rows

December 14, 2010

Indexing
Load

Indices

Tables

update
at load?
dr o p &
rebuild?
index s
egment
ation?
December 14, 2010

Updating
allow
updating of
rows in
tables?

Load

Indices
Tables

Isnt the warehouse read only?


Updating Business Data
Updating Row level Metadata
December 14, 2010

Referential Integrity
RI is the condition where every reference to another table has a foreign key/primary key
match.
Three common options for RI
DBMS checking
Test load files before load using a tool/custom application
Test data base(s) after load using a tool/custom application

December 14, 2010

Timing Considerations
User Expectations
Data Readiness
Database synchronization

December 14, 2010

Exception Processing
Transform

Load

Suspend
exceptions

ok
Reports
Target
data
Log

Discard
December 14, 2010

Integrating with ETL processes


scheduling
dependencies

restart/recover
y
dependencies

scheduling

communicatio
n
dependencies

T C ART XE

execution
verification

scheduling

DA OL R OF S NART

process
metadata

execution
verification

parallel
processing

ts
e
r
tra

e
r
ts
tra
Loading as a part of single transform
& load job stream
Loads triggered by completion of
transform job stream
Loads triggered by verification of
transforms
Parallel ETL processing
Loading Partitions

scheduling
execution
verification

Updating summary tables

e
r
tra
ts

tool
capabilities

December 14, 2010

Module 5 Implementation
Guidelines

Internal and Confidential

December 14, 2010

Module 5 Implementation Guidelines

Data Acquisition Technology


ETL Summary

December 14, 2010

Technology in Data Acquisition


ETL Technology
Data Mapping

Data Transformation
Data Conversion

Storage Management
Metadata Management

mega na M es abat a D

Data Movement

gni daoL es abat a D

ss ecc A at a D

s met s y S ecr uo S

Data Cleansing

December 14, 2010

ETL - Critical Success Factors

1. Design for the Future, Not for the Present


2. Capture and store only changed data
3. Fully understand source systems and data
4. Allow enough time to do the job right
5. Use the right sources, not the easy ones
6. Pay attention to data quality
7. Capture comprehensive ETL metadata
8. Test thoroughly and according to a test plan
9. Distinguish between one-time and ongoing loads
10. Use the right technology for the right reasons
11. Triage source attributes
12. Capture atomic level detail
13. Strive for subject orientation and integration
14. Capture history of changes in audit trail form
15. Modularize ETL processing
16. Ensure that business data is non-volatile
17. Use bulk loads and/or insert-only processing
18. Complete subject orientation and integration
19. Use the right data structures (relational vs. dimensional)
20. Use shared transformation rules and logic
21. Design for distribution first, then for access
22. Fully understand each unique access need
23. Use DBMS update capabilities
24. Design for access before other purposes
25. Design for access tool capabilities
26. Capture quality metadata and report data quality

v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v

Cleansing

Granularity

Integration

Data
Transformation
Roles

Information
Delivery

Distribution

Intake

Data Store
Roles

v
v

v
v

v
v
v
v
v

v
v
v
v
v

v
v
v

v
v
v
v
v
v

v
v
v
v
v
v

v
v

v
v

v
v
v
v

v
v

v
v
v
v

v
v
v
v
v
v
v

v
v
v

v
v

v
v

v
v

v
v
v
December 14, 2010

Exercises
Exercise 1: Source Data Options
Exercise 2: Source Data Modeling
Exercise 3: Data Capture
Exercise 4: Data Transformation
Exercise 5: Data Acquisition Decision

December 14, 2010

You might also like