Professional Documents
Culture Documents
Agenda
Module 1 - Source Data Analysis & Modeling
Module 2 Data Capture Analysis & Design
Module 3 Data Transformation Analysis & Design
Module 4 Data Transport & Load Design
Module 5 Implementation Guidelines
Module 1
Source Data Analysis and
Modeling
ETL
Data
mart
ETL
Data
mart
ETL
Access
Data
mart
Entities
Customer
Order
Product..
Subjects
?
product
Customer
Finance
Process
HR
Organization
Business
Events
History?
Receive order
Ship order
Kinds
of
data?
Cancel
order.
2003
2004
2005
Enterprise
Events
Merger with..
Acquisition of..
Termination of.
Assessment Questions
Availability
Understandability
Stability
Accuracy
Timeliness
Completeness
Does the scope of data correspond to the scope of the data warehouse?
Is any data missing?
Granularity
Is the source the lowest available grain ( most detailed level) for this data?
December 14, 2010
CLAIMCUSTOM
POLICY
C U S TO M E R -N U M B E R
CUSTOM
C U S TO M E R -N A M E
CUSTOM
GENDER
DRIVER
Point of Origin?
So
ata
D
e
urc
re
o
t
S
trix
a
M
FIELD
ata
D
rce
u
So
en
m
Ele
trix
a
tM
CLAIM-NUMBER
December 14, 2010
Models
Warehousing subjects
Business Questions
Facts and Qualifiers
Targets Configuration
Source composition
Source subjects
Integrated Source
Data Model (ERM)
Triage
Conceptual
Models
Implemented warehousing
databases
Logical
Models
Structural
Models
Physical
Models
Functional
Databases
Where data models exist, they are probably out-dated and almost certainly not
integrated
Many source structures are only documented in code (e.g. COBOL definitions of
VSAM files)
Sometimes multiple and conflicting file descriptions exist for a single data structure
Business drivers
Information needs
(scope)
Contextual
What
kinds of
data stores
Source data
(analyze)
warehousing data
To
target
modeling
Each source
Source composition
model
Conceptual
Business goals
Does
source model
exist
no
Which
Modeling
Approach?
bottom-up
Logical
top-down
Source
Subject
model
integrate
(design)
yes
validate
Existing data
model
Source
Logical
Model (ERM)
Structure
Of data
Store (matrix)
Structural
(specify)
Physical
Existing
file desc
(optimize)
locate
extract
Functional
(Implement)
Context
ual
Business drivers
Information needs
What
kinds of
data stores
Source data
(analyze
)
Concept
ual
Source composition
model
Business goals
warehousing data
To
target
modeling
Each source
Which
Modeling
Approach?
Does
source model
exist
no
top-down
bottom-up
yes
Source
Subject
model
validate
Existing data
model
Associate
Subjects
Source composition model uses set notation to develop a subject area model
Classifies each source by the business subjects that it supports
Helps to understand
which subjects have a robust set of sources
which sources address a broad range of business subjects
Helpful to plan, size, sequence and schedule development of the DW increments
December 14, 2010
REVENUE
APMS premium file
POLICY
CLAIM
MIS product table
EXPENSE
INCIDENT
CPS claim master
LIS claim file
CPS claim
action file
CPS party file
ORGANIZATION
PARTY
MARKETPLACE
MIS auto
Marketplace table
MIS residential
Marketplace table
December 14, 2010
Conceptual
Source composition
model
Which
Modeling
Approach?
yes
Does
source model
exist
no
top-down
bottom-up
Logical
Source
Subject
model
Existing data
model
Existing data
model
Existing data
model
Source
Logical
Model (ERM)
integrate
(design)
validate
Combine into
Single model
Conceptual
Source composition
model
Does
source model
exist
no
Which
Modeling
Approach?
bottom-up
Logical
top-down
yes
Source
Subject
model
integrate
(design)
validate
Existing data
model
Existing data
model
Existing data
model
Source
Logical
Model (ERM)
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
A PM S
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Prem ium
Policy
Policy
Policy
Policy
F ield
A ttrib u te
(w h a t fa ct?)
ID
(k ey ? )
PO LIC Y-N U M B ER
U n iqu e Policy ID
Yes
NAME
Policy H older N a m e
A D D R ES S
Policy H older A d dress
PR EM IU M -A M O U N T C ost of Policy Prem ium
PO LIC Y-TER M
C overa g e D u ration
B EG IN -D A TE
S tart d ate of covera ge
EN D -D A TE
End d ate of covera g e
D IS C O U N T-C D
Iden tify kin d of d iscounPa
t rtial
S C H ED U LE
B asis of discou nt am t
PO LIC Y-N U M B ER
U n iqu e Policy ID
Yes
C U S TO M ER -N U M B ERU n iqu e custom er ID Yes
V IN
V eh icle ID n um b er
Yes
MAKE
V eh icle M anu facturer
E n tity
R e la tio n ship
(w h at
(fo reig n k ey ?)
su b ject?)
PO LIC Y
C U S TO M ER
C U S TO M ER
PO LIC Y
PO LIC Y
PO LIC Y
PO LIC Y
D IS C O U N T
D IS C O U N T
PO LIC Y
C U S TO M ER
PO LIC Y-> C U S TO M ER
V EH IC LE
PO LIC Y-> V EH IC LE
V EH IC LE
December 14, 2010
Model States
Normalize
Verify Model
3 types of profiling:
Column profiling
Dependency profiling
Redundancy profiling
Module 2
Data Capture Analysis and
Design
source
Extract
Transform
Load
target
at
D
ai
av
source
ty
il i
b
a
Extract
Transform
Source/Target mapping
Load
Data
Req
uire
men
ts
target
ETL
Staging
Data Distribution
ETL
Data
warehouse
Information Delivery
ETL
Data
mart
ETL
Data
mart
ETL
Data
mart
Mapping Techniques
customer
service
Logical data
models
design transformations
physical models
DATA STORE
MAPPING
DATA ELEMENT
MAPPING
MEMB
s
es ns
n
si tio
u
B es
qu
nt
e
lem
e
ta x
a
D t ri
ma
e
bl ons
a
/t ti
le crip
i
F s
de
l
ca l
i
g e
Lo od
m
Source/target mapping
l
ca
i
ys gn
h
P esi
d
ap
et m
g
r
ta
ce/
r
u
So
Elements
added by triage
triage
transform design
Elements added by
transform logic
l
ca l
i
g e
Lo od
m
What is Triage?
Source data structures are analyzed to determine the appropriate data elements for inclusion
Why Triage?
Ensure that a complete set of attributes is captured in the warehousing environment.
Rework is minimized
Kinds of Data
Event Data
Custom
Member Nu
Membership
Reference Data
Source system
metadata
ALL
DATA
CHANGED
DATA
PUSH TO
WAREHOUSE
Replicate
source
Files/tables
Replicate
Source changes
Or transactions
PULL FROM
SOURCE
Extract
source
Files/tables
Extract source
Changes or
transactions
ALL
DATA
CHANGED
DATA
PUSH TO
WAREHOUSE
Replicate
source
Files/tables
Replicate
Source changes
Or transactions
PULL FROM
SOURCE
Extract
source
Files/tables
Extract source
Changes or
transactions
Timing issues
OLTP
Frequency of
Acquisition
Sources
Data
Extraction
Data
Transformation
Work
Tables
Warehouse
Loading
Latency of
Load
Intake
layer
Periodicity
Of
Data Marts
Data
Mart
Sources
Data
Extraction
Work
Tables
AUDIT TRAIL
Records details of each change to data of interest
Details may include date and time of change, how the change was detected, reason for
change, before and after data values, etc.
Acquisition techniques
DBMS triggers
DBMS replication
Incremental selection
Full file unload/copy
Module 3
Data Transformation Analysis &
Design
Transformation Analysis
Integrate disparate data
Change granularity of data
Assure data quality
Transformation Design
Specifies the processing needed to meet the requirements that are determined by
transformation analysis
Determining kinds of transformations
Selection
Filtering
Conversion
Translation
Derivation
Summarization
Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools
and technology
December 14, 2010
Determine transformation
sequences
Specify transformation
process
transformation specifications
Kinds of transformations
This Transformation type
is Used to
Selection
Filtering
Derivation
Summarization
Selection
Choose among
alternative sources
based upon selection
rules
Extracted
Source # 1
Select
Extracted
Source # 2
sometimes from source 1
Transformed
Target data
Filtering
eliminate some
data from the target
set of data based on
filtering rules
Extracted
Source data
Filter
Transformed
Target data
Conversion
Change data content
and/or format based
on conversion rules
Extracted
Source data
Convert
Transformed
Target data
Translation
Extracted
Source data
encode values in
Translate
Transformed
Target data
Derivation
use existing data
values to create new
data based on
derivation rules
Extracted
Source data
Derive
Transformed
Target data
Summarization
Extracted
Source data
atomic or base
data in
Change data
granularity based on
rules of
summarization
Summarize
Transformed
Target data
for each store (for each product line (for each day (count the
number of transactions, accumulate the total dollar value of the
transactions)))
for each week (sum daily transaction count, sum daily dollar
total)
If membership-type is family
separate name using comma
characters prior to comma in customer-last-name
after comma in customer-first-name
biz-name
insert
insert characters
else move name to customerDecember 14, 2010
2
Specify selection
Specify filtering
Transformation Rules
Dependencie
s among
rules
Structures of Modules,
Programs, Scripts, etc.
December 14, 2010
scheduling
execution
e
r
tra
ts
verification
communication
December 14, 2010
Module 4
Data Transportation & Loading
Design
Overview
Source Data
Extract
databa
s
Load
a hc m
r of t al p od er e h w
t r ops nart at ad
Transform
e load
Target Data
December 14, 2010
Extract
which platforms?
data volumes?
transport frequency?
network capacity?
ASCII vs EBCDIC
data security?
transport methods?
Load
t r ops nart at ad
Transform
Target Data
Extract
Open FTP
Secure FTP
Alternatives to FTP
Data compression
Data Encryption
ETL tools
Load
t r ops nart at ad
Transform
Target Data
which DBMS?
relational vs dimensional?
tables & indices?
load frequency?
load timing?
data volumes?
exception handling?
restart & recovery?
load methods?
referential integrity?
Extract
Transform
databa
s
Load
e load
Target Data
December 14, 2010
Populating Tables
Drop and rebuild the tables
Insert (only) rows into a table
Delete old rows and insert changed rows
Indexing
Load
Indices
Tables
update
at load?
dr o p &
rebuild?
index s
egment
ation?
December 14, 2010
Updating
allow
updating of
rows in
tables?
Load
Indices
Tables
Referential Integrity
RI is the condition where every reference to another table has a foreign key/primary key
match.
Three common options for RI
DBMS checking
Test load files before load using a tool/custom application
Test data base(s) after load using a tool/custom application
Timing Considerations
User Expectations
Data Readiness
Database synchronization
Exception Processing
Transform
Load
Suspend
exceptions
ok
Reports
Target
data
Log
Discard
December 14, 2010
restart/recover
y
dependencies
scheduling
communicatio
n
dependencies
T C ART XE
execution
verification
scheduling
DA OL R OF S NART
process
metadata
execution
verification
parallel
processing
ts
e
r
tra
e
r
ts
tra
Loading as a part of single transform
& load job stream
Loads triggered by completion of
transform job stream
Loads triggered by verification of
transforms
Parallel ETL processing
Loading Partitions
scheduling
execution
verification
e
r
tra
ts
tool
capabilities
Module 5 Implementation
Guidelines
Data Transformation
Data Conversion
Storage Management
Metadata Management
mega na M es abat a D
Data Movement
ss ecc A at a D
s met s y S ecr uo S
Data Cleansing
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
Cleansing
Granularity
Integration
Data
Transformation
Roles
Information
Delivery
Distribution
Intake
Data Store
Roles
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
December 14, 2010
Exercises
Exercise 1: Source Data Options
Exercise 2: Source Data Modeling
Exercise 3: Data Capture
Exercise 4: Data Transformation
Exercise 5: Data Acquisition Decision