Professional Documents
Culture Documents
Data ware house is nothing but collection of transactional data and historical data and can
be maintained in DWH for analysis purpose.
They are 3 types of tools should be maintained in any data warehousing project
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool
ETL TOOLS:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica Power center
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
7. Pentaho Kettle
8.Talend
9. Inaplex Inaport
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also
A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.
STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
Note: A Data Modeler can design DWH in two ways
1. ER Modeling
2. Dimensional Modeling
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
1. Star Schema
2. Snow Flake Schema
3. Multi Star Schema (or) Hybrid (or) Galaxy
Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema
Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and its having transactions, provides
summarized information such a table called fact table.
DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and its shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
FACT TABLE TYPES:
There are 3 types of fact s are available in fact table
1. Additive facts
2. Semi additive facts
3. Non additive facts
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
SEMI ADDITIVE FACT:
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
NON ADDITIVE FACT:
If there is not possibility to add some value to the existing fact in the fact table is we
called as Non additive fact.
SNOW FLAKE SCHEMA:
Snow Flake schema maintains in dimension table normalized data .in this schema some
dimension tables are not directly maintained relation ship with fact table and those are
maintained relation ship with another dimension
In the event that databases are altered or new databases need to be integrated, a lot of
hand-coded work needs to be completely redone.
1. Data stage:
Data stage is a comprehensive ETL tool Or we can say Data stage is an data Integration
and transformation tool which enables collection and consolidation of data from several
sources,its transformation and delivery into one or multiple target systems
History begins in 1997 the first version of data stage released by VMRAK company
its a US based company
Mr. Lee scheffner is the father of data stage
Those days data stage we called as Data integrator
In 1997 Data integrator acquired by company called TORRENT
Again in 1999 INFORMIX Company has acquired this Data integrator from
TORRENT Company
In 2000 ASCENTIAL Company acquired this Data Integrator and after that Ascentaial
Data stage server Edition
From 6.0 to 7.5.1 versions they supports only Unix flavor environment
Because server configured on only Unix plot form environment
In 2004, a version 7.5.x2 is released that support server configuration for windows flat
Form also.
In 2004, December the version 7.5.x2 were having ASCENTIAL suite components
Profile stage,
Quality stage,
Audit stage,
Meta stage,
DataStage Px,
DataStage Tx,
DataStage MUS,
These are all Individual tools
In 2005, February the IBM acquired all the ASCENTIAL suite components and the
IBM released IBM DS EE i.e., enterprise edition.
In 2006, the IBM has made some changes to the IBM DS EE and the changes are the
integrated the profiling stage and audit stage into one, quality stage, Meta stage, and
8
Pentaho Kettle:
Pentaho is a commercial open-source BI suite that has a product called Kettle for data
integration.
It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI
The company started around 2001
It has a strong community of 13,500 registered users
It uses a stand-alone java engine that process the tasks for moving data between many
different databases and files
Talend:
Talend is an open-source data integration tool It uses a code-generating approach and
uses a GUI(implemented in Eclipse RC)
It started around October 2006
It has a much smaller community then Pentaho, but is supported by 2 finance
companies
It generates Java code or Perl code which can later be run on a server
Inaplex:
Inaplex is a small UK company
InaPlex is a producer of Customer Data Integration products for mid-market CRM
solutions
Inaplex mainly focuses on providing simple solutions for its Customers to integrate
their data into CRM and accounting Software like Sage and Goldmine
Any to Any:
Data stage can Extarct data from any source and can load data in to any target
Platform Independent:
A job can run in any processor is called platform independent
Data stage jobs can run on 3 types of processors
Three types of processor are there, they are
1. UNI Processor
2. Symmetric Multi Processor (SMP), and
3. Massively Multi Processor (MMP).
Node Configuration:
Node is a logical CPU ie.instance of physical CPU
The process of creating virtual CPUs is called Node configuration
Example:
ETL job requires executing 1000 records
In Uni processor it takes 10 mins to execute 1000 records
But in same thing SMP processor takes 2.5 mins to execute 1000 records
Configuration File:
What is configuration file? What is the use of this in data stage?
It is normal text file. it is having the information about the processing and storage
resources that are available for usage during parallel job execution.
The default configuration file is having like
Node: - it is logical processing unit which performs all ETL operations.
Pools: - it is a collection of nodes.
Fast Name: it is server name. by using this name it was executed our ETL jobs.
Resource disk:- it is permanent memory area which stores all Repository components.
Resource Scratch disk:-it is temporary memory area where the staging operation will be
performed.
Configuration file:
Example:
{
node "node1"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
node "node2"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
}
11
Note:
In a configuration file No node names has same name
Default Node pool is
At least one node must belong to the default node pool, which has a name of "" (zerolength string).
Pipeline parallelism:
Pipe:
Pipe is a channel through which data moves from one stage to another stage
Partition Parallelism:
Partitioning:
Partioning is a technique of dividing the data into chunks
Data stage supports 8 types of partitions
Partioning plays a important role in data stage
Every stage in Data stage associated with default partitioning technique
Defualt partinining technique is auto
Note:
Selection of partioning techniques is based on
1 .Data(Volume ,Type
2 .Stage
3. No of key Columns
5.Key column data type
Partitioning techniques are grouped in to two categories
1.Key Based
2.Key Less
Key Based Partitiong techniques:
1.Hash
2.Modulo
3.Range
4.DB2
Key Less Partioning techniques:
1.Random
2.Round robin
3.Entire
12
4.Same
Client Components:
The client component again classified into
Data stage Administartor
Data stage Manager
Data stage Director
Data stage Designer
Data stage Administrator:
Ds admin can create projects and delete the projects
Can give permissions to the users
Can define global parameters
Data stage Manager:
Datastage Manager can import and export the jobs
can create routines
Can configure configuration file
Data stage Director:
13
Server Components:
We have 3 server components
1. PX Engine: it is executing DataStage jobs and it automatically selects the partition
technique.
2. Repository: It contains the repository components
3.Package Installer:
Package Instaler has packs and Plug Ins
Processing Stage
Processing Stage
Processing Stage
Processing Stage
Stage Name
SCD(Slowly Change
Dimension)
FTP(File Transfer Protocal)
WTX(Webshere Transfer)
Surrogate Key
Processing Stage
Data Base Stage
Data Base Stage
Data Base Stage
Data Base Stage
Look up
IWAY
Classic Federation
ODBC Connector
Netteza
Not Available
Not Available
Not Available
Available
Available
(Normal Lookup,Sprase
Lookup)
Not Available
Not Available
Not Available
Not Available
Sql builder
Not Available
Note: Data base stages and Processing stage has Enhancements has done
Datastage Designer Window:
Its has Title BarIBM Infosphere Datastage and Quality stage Designer
17
Available
Available
Available
Available (Enhance ment done
File Stages:
---------------Sequential file stage:
===============
Sequential file stage is a file stage which is used to read the data sequentially or
Parallely.
If it is 1 file - It reads the data sequentially
If it is N files - It reads the data Parallely
Sequential file supports 1 Input link |1 Output Link | 1 reject link.
To read the data, we have read methods. Read methods are
a) Specific files
b) File Patterns
Specific File is for particular file
And File Pattern is used for Wild cards.
And in Error Mode. It has
Continue
Fail and
Output
If you select Continue-If any data type mismatch it will send the rest of the data to the
target
If you Select Fail- Job Abort or Any Data type mismatch
Output- It will send the mismatch data to Rejected data file.
Error data we get are
Data type Mismatch
Format Mismatch
Condition Mismatch
and we have the option like
Missing File Mode: In this Option
we have three sub options like
Depends
Error
Ok
(That means How to handle, if any file is missed)
Different Options usage in Sequential file:
----------------------------------------------------Read Method=Specific file Then execute in sequencial mode
18
19
Here RowNumberColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRowNumberColumn extra at output
Sequencialfile Options:
Filter
FileNameColumn
RowNumcolumn
Read Firstrows
NullFieldvalue
1. Filter Options
Sed command:
-------------------Sed: is a stream Editor for filtering and transforming text from standard input to
standard output
Sed 5qIt displays first 5 lines
Sed 2pIt displays all lines but 2nd line will displayed twice
Sed 1dit displays all records except first record
Sed 1d,2d it displays all lines except the first and second record
Sed n 2,4p here it prints only from record 2 to 4 only
Sed n e 2p e 3pIt displays only 2 nd 3rd line
Sed $d it is for deleting the trailer record
Grep commands:
----------------------Syntax:
grep string
Ex: grep bhaskar
1) grep v string Ex: grep v bhaskar it displays except bhaskar
2) grep i String Ex: grep - i bhaskar it ignores case sensitive
20
21
Job:
23
24
25
Tick the check box First line is column names and Click on Defines
Now Click on Ok and click on Close tab Now that file emp1.txt will show in the table
Definition list
26
Now Click on OK and again Click on OK .This is the way of procedure for importing
table definition
27
Emp2.txt
28
Job:
29
30
31
Job:
32
Output data:
33
34
Job:
Input Properties:
36
Columns:
Output Properties:
Outputdata:
37
InputProperties:
38
Columns:
Input Data:
39
Output Properties:
OutputData:
40
41
Job:
Input properties:
42
Input Columns:
Output properties:
43
Output Data:
Job:
44
Columns:
45
46
47
Job:
48
Columns:
49
Output Data:
50
Source filedata:
Job:
Columns:
51
52
1
2
3
Parameter name
prompt
type
Default Value
$APT_CLOBBER_OUTPUT
Inputpath
Inputfilename
automaticaly
overwrite
Inputfilepath
InputfileName
List
Pathname
String
FALSE
C:\Sourcefiles\Bhaskar\
Empdata1.txt
File Stages:
----------------Data set:
-----------Dataset is the parallel processing file stage which is used for staging the data when we
design dependent jobs.
By Default dataset is parallel processing stage
Dataset will be stored in the binary format.
If we use dataset for the jobs, data will be stored in the Data Stage. Thats is inside the
repository.
Dataset will over come the limitations of the sequential file.
Limitations of sequential files are
1) Memory limitations ( It can store up to 2 GB Memory in the file format )
2) Sequential ( By default it is Sequential )
3) Conversion Problem ( Every time when we run the job, it has to convert from one
format to another format)
4) Stores the data outside the Data stage ( Where in Dataset it stores the data inside the
Data stage)
Types of Datasets are 2 types
1) Virtual Dataset
2) Persistence Dataset
Virtual Dataset is the temporary dataset which is formed when passing in the link.
Persistence Dataset is the Permanent Dataset which is formed when loaded in the Target.
53
Help Text
Allows files or data s
over written if they al
exist
54
JOB:
55
56
57
you can see the data here by click on data viewer option:
58
Note: By using data set management we can we can open the dataset, it can show the
schema window it can show the data window, it can copy dataset and can delete dataset
DIFFRENCES BETWEEN SEQUENTIAL FILE AND DATA SET:
Sequential File stage
It executes in sequential Mode
Cannot Apply Part ion techniques
It supports all formats like.txt,.csv,.xls etc
59
60
Click on Columns
now click Doble click on Row no 1 it will show the below screen
For empno filed U select Type=cycle,intial value=1000,increment=1 and then click next
Again it will show the below screen for Name field Set the Algorithm=Cycle,and give
value=RafelNadal,value=JamesBlake,value=Andderadick
Similary click next it will show the window for hire datefield set Type=Random
61
62
63
64
JOB:
seqEmpData properties:
65
66
67
Output columns:
68
Output:
69
70
Output seqEmpdata:
JOB:
InputseqEmpData Properties:
71
Case-1:
Head stage Properties:
AllRows=False
Number of rows=5
Allpartitios=TRUE
72
Target Output_Sequentialdata:
73
OutputSequential data:
Job:
74
Input SeqEmpData:
75
Target OutputSequentialData:
Target Output_Sequentail_data:
76
Tail Stage:
1.The Tail Stage is a Development/Debug stage
2. It can have a single input link and a single output link
3. The Tail Stage selects the last N records from each partition of an input data set and
copies the selected records to an output data set. You determine which records are copied
by setting properties which allow you to specify:
The number of records to copy
The partition from which the records are copied
4.This stage is helpful in testing and debugging applications with large data sets. For
example, the Partition property lets you see data from a single partition to determine if
the data is being partitioned as you want it to be. The Skip property lets you access a
certain portion of a data set
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify the details about the single input set from
which you are selecting records.
Output Page. This is where you specify details about the processed data being output
from the stage
Tail stage properties:
1.Rows
2.Partitions
Rows:
No of rows[Per partition]=10(Default is 10 if we need more than 10 or less
than 10 we have to change the number)
Number of rows to copy from input to output per partition.
Partitions:
All Partition=True/False
When set to True copies rows from all partitions. When set to False, copies from
specific partition numbers, which must be specified.
77
JOB:
78
Tailstage Properties:
Output Columns:
79
Target Output_Sequentialdata:
Sample Stage:
1.The Sample stage is a Development/Debug stage.
2. It can have a single input link and any number of output links when operationg in
percent mode,
3. And a single input and single output link when operating in period mode
4.The Sample stage samples an input data set. It operates in two modes. In Percent mode,
it extracts rows, selecting them by means of a random number generator, and writes a
given percentage of these to each output data set. You specify the number of output data
sets, the percentage written to each, and a seed value to start the random number
generator. You can reproduce a given distribution by repeating the same number of
outputs, the percentage, and the seed value
5.In Period mode, it extracts every Nth row from each partition, where N is the period,
which you supply. In this case all rows will be output to a single data set, so the stage
used in this mode can only have a single output link
6.For both modes you can specify the maximum number of rows that you want to sample
from each partition.
The stage editor has three pages:
80
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify details about the data set being Sampled.
Output Page. This is where you specify details about the Sampled data being output
from the stage.
EXAMPLE JOB FOR SAMPLE STAGE:
Note: Sample stage we can Operate in Two Modes one is Period and Another one is
Percentage Mode
Input data:
JOB:
81
Output Columns:
82
Output data:
83
Note :Here out put we get only 3 records because we set option period[perpartion]=3 So
it takes every 3 rd record from input file data
Job:
84
85
87
88
89
Peek Stage:
1.The Peek stage is a Development/Debug stage.
2. It can have a single input link and any number of output links.
3.The Peek stage lets you print record column values either to the job log or to a separate
output link as the stage copies records from its input data set to one or more output data
sets.
4.Like the Head stage and the Tail stage (Sample stage), the Peek stage can be helpful for
monitoring the progress of your application or to diagnose a bug in your application.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify the details about the single input set from
which you are selecting records.
Output Page. This is where you specify details about the processed data being output
from the stage.
JOB:
91
92
Option outputmode=Joblog:
Job:
93
94
Here we set the option Peek output mode=job log so we can see the data at Logs only
Procedure for see the data at logs:
Go to the tools and run directornow click on view log it will show the screen like
95
in the above screen from bottom to 8th row u click it will show the log details
96
JOB:
97
98
99
100
peekoutput2 mappings:
peekoutput3 columns:
101
Peekoutput3 mappings:
peekout1 properties:
102
peekoutput1 data:
Peekoutput2 properties:
103
Peekoutput2 data:
Peekoutput2 properties:
104
Peekoutput3 data:
Peekoutput3 properties:
105
(
Field1 datatype(size),
Field2 datatype(size),
Field3 datatype(size),
Field4 datatype(size),
Field5 datatype(size)
Field6 datatype(size),
Field7 datatype(size),
Field8 datatype(size),
Field9 datatype(size),
Field10 datatype(size)
)
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE "Schemaname.Tablename"
TO Group "Groupname";
106
Switch:
1.Condition can put on Single Column
2.It Have 1-Input,128-Output, 1- default Reject Link
External Filter:
1.Here we can use All unix Filter commands
2.It Have 1-Input,1-Output, No-Reject Link
Columns
Load
Import ODBC Table Defination
DSN here select work Book
User ID and Password
Operating System
Add in ODBC
MS EXCEL Drivers
Name=EXE (DSN)
108
Click On output
Properties for Outputname=Emp
Columns:
109
Selection:
SQL:
110
Viwe Data:
Properties for Outputname=Emp
111
Columns:
112
SQL:
113
View Data:
Emp_Dataset Properties:
View data:
114
Dept_Data_set:Properties:
View data:
ENCODE STAGE:
1.It is processing stage that encodes the records into single format with the support of
command line.
2.It supports 1-input and 1-output.
Properties:
Stage
Input
Output
DECODE STAGE:
1.It is processing stage that decodes the encoded data.
2.It supports 1-intput and 1-output.
Properties:
Stage
Options: command line = (uncompress/gunzip)
Output
Load the Meta data of the source file.
115
Filter Stage:
1.The Filter stage is a processing stage.
2.It can have a single input link and a any number of output links and, optionally, a single
reject link.
3.The Filter stage transfers, unmodified, the records of the input data set which satisfy the
specified requirements and filters out all other records.
4.You can specify different requirements to route rows down different output links. The
filtered out records can be routed to a reject link, if required.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify details about the input link carrying the data to
be filtered.
Output Page. This is where you specify details about the filtered data being output
from the stage down the various output links.
Filter stage properties tab options
Predicates
Where Clause
Options
Output rejects=False
Output rejects=True
Set to true to output rejects to reject link.
116
117
Copy1 output :
Outputname=Emp_Copy
Output name=Emp_Copy_all
118
Out_Emp_Copyall_Dataset Properties:
outputdata:
Filter_3properties:
119
OutputName=OutputSalg1andsall3:
120
Output_Deptno10 PROPERTIES:
OUTPUT DATA:
Data_SET5 properties:
121
OUTPUT DATA:
Filter_10 PROPERTIES:
Output Mappings:Dslink15
122
Output columns:
output name=dslink13:
123
Dataset_14 PROPERTIES:
output
Datbase_12 Properties:
Output:
124
Switch stage:
1.The Switch stage is a processing stage.
2.It can have a single input link, up to 128 output links and a single rejects link.
The Switch stage takes a single data set as input and assigns each input row to an output
data set based on the value of a selector field.
3.The Switch stage performs an operation analogous to a C switch statement, which
causes the flow of control in a C program to branch to one of several cases based on the
value of a selector variable.
4.Rows that satisfy none of the cases are output on the rejects link.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify the details about the single input set from
which you are selecting rows.
Output Page. This is where you specify details about the processed data being output
from the stage.
Switch stage properties options
1.Input
2.User defined Mapping
3.Options
1.Input
Selector=Column Name
1.Auto can be used when there is as many distinct selector values as output links.
2.Hash means that rows are hashed on the selector column modulo the number of output
links and assigned to an output link accordingly. In this case, the selector column must be
of a type that is convertible to Unsigned Integer and may not be nullable.
3.User-defined Mapping means that the onus is on the user to provide explicit mapping
for values to outputs
2.User Defined Mapping
Case=?
Specifies user-defined mapping between actual values of the selector column and an
output link. Mapping is a string of the form: <Selector Value>[=<Output Link Label
125
Number>], The Link Label Number is not needed if the value is intended for the same
output link as specified by the previous mapping that specified a number. You must
specify an individual mapping for each value of the selector column you want to direct to
one of the output links, thus this property will be repeated as many times as necessary to
specify the complete mapping.
3.Options
If Not Find =Fail,Drop,Output
Fail means that an invalid selector value causes the job to fail; Drop drops the offending
row; Output sends it to a reject link.
EXAMPLE JOB FOR SWITCH STAGE:
JOB:
126
Oracle_Enterprise_0 properties:
127
Outputmappings:
Outputname=T1;
Output_Dataset_2 properties::
128
Output
Outputname=T2;
129
output:
Outputname=T3;
130
Output:
Arguments=?
Type: String
Any command-line arguments required.
Filter Command=?
Type: Pathname
Program or command to execute, which must be configured to accept input from stdin
and write its results to stdout.
Example Filter Commands:
sed1d2d
Grep bhaskar
Sed n e 2p-e 3p
EXAMPLE JOB FOR EXTERNAL FILTER:
Input data:
Job:
132
Sequential_File_7 Properties:
Columns:
133
Output columns:
Here we need to give the column names manually at output columns
134
OUTPUT:
135
JOB:
136
138
output:
Output Columns:
Dataset_8 Properties:
140
OUTPUT:
JOIN Queries:
Join is a query which combines the data from multiple tables
Types of joins:
1. Cartezion join
2. Equi join
3. Non equi join
4.Self join or inner join
5.Outer join
Left outer join
Right outer join
EMPNO ENAME
JOB
MGR
DEPTNO
bhaskar
analyst
444
10
222
prabhakar
clerk
333
20
333
pradeep
manager
111
10
444
srujana
engineer
222
40
LOC
marketing
hyderabad
20
finance
banglore
30
hr
bombay
Examples:
Cartezion join:
If we combine a data from multiple tables with out applying any condition then each
record in the first table will join with all records in the second table
SQL>select * from emp,dept
SQL>select empno,ename,job ,dname,loc from emp e,dept d
Equi join:
If we combine a data from multiple tables by applying equal no of conditions on multiple
tables then each record in the first table will join with one row in the second table.
This kind of join can be called as Equi join
SQL>select e.empno,e.ename,d.dname,d.loc from emp e,dept d where e.deptno=d.deptno
Inner join:
This will display all the records that have matched.
Ex:
SQL> select empno,ename,job,dname,loc from emp inner join dept using(deptno);
142
Join Stage:
These topics describe Join stages, which are used to join data from two input tables and
produce one output table. You can use the Join stage to perform inner joins, outer joins, or
full joins.
1.An inner join returns only those rows that have matching column values in both
input tables. The unmatched rows are discarded.
143
2.An outer join returns all rows from the outer table even if there are no matches. You
define which of the two tables is the outer table.
3.A full join returns all rows that match the join condition, plus the unmatched rows
from both input tables.
Unmatched rows returned in outer joins or full joins have NULL values in the columns of
the other link
1.Join stages have two input links and one output link.
2. The two input links must come from source stages. The joined data can be output to
another processing stage or a passive stage
144
145
Output:
LOOKUP JOB :
146
Output:
147
Output:
148
LOOKUP STAGE:
1.The Lookup stage is a processing stage.
2.It is used to perform lookup operations on a data set read into memory from any other
Parallel job stage that can output data
3. The most common use for a lookup is to map short codes in the input data set onto
expanded information from a lookup table which is then joined to the incoming data and
output. For example, you could have an input data set carrying names and addresses of
your U.S. customers. The data as presented identifies state as a two letter U. S. state
postal code, but you want the data to carry the full name of the state. You could define a
lookup table that carries a list of codes matched to states, defining the code as the key
column. As the Lookup stage reads each line, it uses the key to look up the state in the
lookup table. It adds the state to a new column defined for the output link, and so the full
state name is added to each address. If any state codes have been incorrectly entered in
the data set, the code will not be found in the lookup table, and so that record will be
rejected.
4/Lookups can also be used for validation of a row. If there is no corresponding entry in a
lookup table to the key's values, the row is rejected.
5.The Lookup stage is one of three stages that join tables based on the values of key
columns. The other two are:
Join stage - Join stage
Merge stage - Merge Stage
6.The three stages differ mainly in the memory they use, the treatment of rows with
unmatched keys, and their requirements for data being input
7. The Lookup stage can have a reference link, a single input link, a single output link,
and a single rejects link
Input Data:
149
ReferenceData:
LOOK UPJOB:
Lookup Faiure: Drop
If Lookup failure =Drop then Inner join will perform
Output:
150
2.LOOK UP JOB
Lookup Failure=Continue
If Lookup failure =Continue then Left outer join will perform
Output:
151
3.LOOKUP JOB:
Lookup Failure=Reject
If Lookup failure =Reject then the records which are not match with reference data
those records will send to the reject output Link
JOB:
Output:
Input Rejected :
LOOK UPJOB:
Lookup Faiure: Fail
If Lookup failure =Fail then If any of the input record not found in the reference file
Then the job will fails
152
MERGE STAGE:
Merge Stage Properties:
1.Merge keys
2.Options
1.Merge Keys
Key=?
Sort order=Ascending
Sort in Either ascending or descending order
2.Options:
Unmatched Master mode=Drop/keep
Warn on reject updates=True
Warn on unmatched master=True
Masterdata:
UpdateData:
153
JOB1:
Unmatched Master mode=Drop
Warn on reject updates=True
Warn on unmatched master=True
Type: List
Keep means that unmatched rows (those without any updates) from the master link are
output; Drop means that unmatched rows are dropped
Output:
Master_Rejects:
154
JOB2:
Unmatched Master mode=Keep
Warn on reject updates=True
Warn on unmatched master=True
Output:
Master_Records:
Note : If the options "Warn on Reject Updates = True" and "Warn on Unmatched Masters
= True" then the log file shows the warnings on Reject Updates and Unmatched Data
from Masters
Note : If the options "Warn on Reject Updates = False" and "Warn on Unmatched
Masters = False" then the log file do not shows the warnings on Reject Updates and
Unmatched Data from Masters.
155
Job:
156
Output:
157
Modify Job2:
Null Handle:
Inputdata:
Job:
CUSTDOB=date_from_timestamp(CUSTDOB)
ZIP=Handle_Null('ZIP','999999')
158
Output:
3.Modify Job
Drop Columns
Job:
159
ZIP=Handle_Null('ZIP','999999')
Specification=Drop CUSTID
Modify stage Input columna
Output:
160
3.Modify Job
Keep Columns
Inputdata:
Job:
161
KEEP CUSTID
Modify stage input columns
162
Output:
Copy Stage:
1.The Copy stage is a processing stage.
2.It can have a single input link and any number of output links.
3. The Copy stage copies a single input data set to a number of output data sets.
4. Each record of the input data set is copied to every output data set. Records can be
copied without modification or you can drop or change the order of columns (to copy
with more modification - for example changing column data types - use the Modify stage
5. Where you are using a Copy stage with a single input and a single output, you should
ensure that you set the Force property in the stage editor TRUE. This prevents
InfoSphere DataStage from deciding that the Copy operation is superfluous and
optimizing it out of the job
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
163
Input Page. This is where you specify details about the input link carrying the data to
be copied.
Output Page. This is where you specify details about the copied data being output
from the stage
Copy stage Properties tab Options:
1.Force=True/False
True to specify that DataStage should not try to optimize the job by removing the Copy
operation.
Input:
Output:
Job:
164
165
166
Aggregator Stage:
1.The Aggregator stage is a processing stage.
2.It classifies data rows from a single input link into groups and computes totals or other
aggregate functions for each group. The summed totals for each group are output from
the stage via an output link.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify details about the data being grouped or
aggregated.
Output Page. This is where you specify details about the groups being output from
the stage.
Aggregator stage general tab Options:
1.Grouping Keys:
2.Aggregations
3.Options
1.Gropuing Keys:
Group=Specifies an input column you are using as a grouping key.
Grouping Keys
Group
CaseSensitive=True/False
2. Aggrigations:
Aggregation type
Whether to perform calculation(s) on column(s), re-calculate on previously created
summary columns, or count rows.
Calculation
Count of Rows
Re-Calculation
Aggregation type=Calculation
Column for calculation=Column name
If u given the column name then it asks
167
168
Name of column to hold the standard error of data in the aggregate column.
11.sum of weights output column
->Decimal output=?
Name of column to hold the sum Of weights of data in the aggregate column. (See
Weighting Column.)
12.sum output column
->Decimal output=?
Name of column to hold the sum of data in the aggregate column.
13.Summary output column
->Decimal output=?
Name of sub record column to which to write the results of the reduce or rereduce
operation.
14.un corrected sum of squares ouput column
->Decimal output=?
Name of column to hold the uncorrected sum of squares for data in the aggregate column.
15.Variance output column
->Decimal output=?
->Variance Devisor=?
Name of column to hold the variance of data in the aggregate column.
16.weighting column
Increment the count for the group by the contents of the weight field for each record in
the group, instead of by 1. (Applies to: Percent Coefficient of Variation, Mean Value,
Sum, Sum of Weights, Uncorrected Sum of Squares.)
2.Options:
Allow null output=True/False
True means that NULL is a valid output value when calculating minimum value,
maximum value, mean value, standard deviation, standard error, sum, sum of weights,
and variance. False means 0 is output when all input values for calculation column
are NULL.
Method=Hash/Sort
Use hash mode for a relatively small number of groups; generally, fewer than about
1000 groups per megabyte of memory. Sort mode requires the input data set to have
been partition sorted with all of the grouping keys specified as hashing and sorting
keys.
Input data:
Requirement:
Output file11 data:
database properties;
170
171
output tab:
Aggrigator2 properties:
172
output tab:
173
output tab:
174
Aggrigator 4 properties:
Output tab:
175
176
Requirement:
Output file11 data:
JOB:
database properties;
177
Aggrigator 2 properties:
178
Output tab:
179
180
Aggrigator4 properties:
output tab:
181
Sorting:
Sorting can be done at different ways
1.If source is a Data base then we can use order by class by that we can sort the data
based on column names
2.If source is a Data base then we can use query like this
Select Distinct column(s)
From tabname
Order by Column(s)
Same above task can perform using link level sort
Step1: open the target sequential file and select partition and select the check box
perform sort
Stable
Unique
Here the data display order is Ascending with case sensitive and Nulls first
182
Here in the above screen shot if we can observe carefully 3 check box has to be selected
183
Output:
184
Click On output
Properties for Outputname=Emp
185
Columns:
Selection:
186
SQL:
Viwe Data:
187
Columns:
188
SQL:
189
View Data:
Emp_Dataset Properties:
190
View data:
Dept_Data_set:Properties:
191
View data:
ENCODE STAGE:
1.It is processing stage that encodes the records into single format with the support of
command line.
2.It supports 1-input and 1-output.
Properties:
Stage
Input
Output
DECODE STAGE:
1.It is processing stage that decodes the encoded data.
2.It supports 1-intput and 1-output.
Properties:
Stage
Options: command line = (uncompress/gunzip)
Output
Load the Meta data of the source file.
192
Parameter Set :
Procedure to create Parameter Set:
1. Choose File > New to open the New dialog box.
2. Open the Other folder, select the Parameter Set icon, and click OK.
3. The Parameter Set dialog box appears.
4. Enter the required details on each of the pages as detailed in the following
sections.
Parameter Set General tab
Use the General page to specify a name for your parameter set and to provide
descriptions of it.
Parameter Set Parameters tab:
Use this page to specify the actual parameters that are being stored in the parameter
Set
Parameter Set Value tab:
Use this page to optionally specify sets of values to be used for the parameters in this
parameter set when a job using it is designed or run.
1.Parameter Set General tab
Use the General page to specify a name for your parameter set and to provide
descriptions of it.
Parameter Sets
General
Parameters
Parameter set Name:
Ps_StagingDB
Short Description
Parameter Set cretaed for connecting for StagingDB
Short Description
Values
1
2
3
Paramater name
HostSerever
UserName
Password
Values
Prompt
Server
UserId
Password
193
Type
String
String
Encrypted
Default Vaue
Help Text
1
2
3
Values
HostSerever
DevdDB
ProdDB
TestDB
UserName
abreddy
abreddy
abreddy
Password
******
******
******
Limits
General
Value
DevdDB
ProdDB
TestDB
ABC
abreddy
******
InputData2:
Output:
194
Job:
Output:
195
Pivote Stage:
1.Pivot stage is an active stage,
2.Pivote stage is an processing stage
3.Maps sets of columns in an input table to a single column in an output table. This type
of mapping is called pivoting.
4. Pivot Stage converts columns in to rows.
Scenario:
Eg., Marks1 and Marks2,Marks3 are three columns.
Task : Convert all the columns in to one column ie Marks
Using Methodology : In the deviation field of the output column change the input
columns in to one column.
Eg., Column Name "Marks".
Derivation : Marks1 and Marks2,Marks3
Note : Column "Marks" is derived from the input columns Marks1 and Marks2 and
Marks3
OutputData:
196
Job:
-----
197
OutputData:
Output:
Job:
199
TRANSFORMER STAGE:
Trans former stage plays major role in data stage .it is used to modify the data, apply
some functions while populating data from source to target
It takes one input link and gives one (or) more than one output links.
It has 3 components
1. Stage variable
2. Constraints
3. Derivations (or) Expressions
1. Transformer stage can works as copy stage and filter stage
2. Transformer stage requires C++ Compiler .it convert high level data into machine
language
200
Double click on transformer stage drag and drop of required target columnsClick Ok
The order of execution of transformer stage is
1. Stage variable
2. Constraints
3. Derivations
Example:
How to work transformer as filter stage (or) how to apply constraints in the
transformer stage:
Double click on trans former stage double click on constraint again double on
particular link click on this window it provides all informations automatically and
view Constraints for reject link click other wise.
Example Derivation:
If Sale_Id <300 then Amount_Sold=Amount_Sold+300
Else if Sale_Id>300 and Sale_Id<600 then Amount _Sold=Amount_Sold+600
Else if Sale_Id>600 and Sale_Id<1000 then Amount _Sold=Amount_Sold+1000 Else
Amount_Sold=Amount_Sold+100
201
202
9. Utility
EXAMPLE JOBS OF TRANSFORMER STAGE:
1)EXAMPLE JOB FOR TRANSFORMER
JOB:1
Inputfile:
Output requirement
JOB:
203
Sequential file:
204
INPUTCOLUMNS:
OUTPUTLINK:
TARGET FILE:
205
Output requirement:
206
JOB:
207
Output2:
JOB:
INPUT:
208
Transformer1:
209
Left(INPUT.REC,1)
Transformer2:
Constrains logic:
210
OUTPUTINVC:
OUTPUTPRODID:
211
212
Input file:seqfile0:
Job:
213
Join properties:
214
215
2. REMOVE DUPLICATES:
Inputdata:
Output:
216
JOB:
217
-->
- <EMPINFO>
- <EMPDETAILS>
<EMPID>1</EMPID>
<NAME>BHASKAR</NAME>
<GENDER>MALE</GENDER>
<COMPANY>IBM</COMPANY>
<CITY>HYDERABAD</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>2</EMPID>
<NAME>PRADEEP</NAME>
<GENDER>MALE</GENDER>
<COMPANY>WIPRO</COMPANY>
<CITY>BANGLORE</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>3</EMPID>
<NAME>SRUJANA</NAME>
<GENDER>FEMALE</GENDER>
<COMPANY>INFOSYS</COMPANY>
<CITY>HYDERABAD</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>4</EMPID>
<NAME>KRISHNAVENI</NAME>
<GENDER>FEMALE</GENDER>
<COMPANY>TCS</COMPANY>
<CITY>PUNE</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>5</EMPID>
<NAME>SRIKARAN</NAME>
<GENDER>MALE</GENDER>
<COMPANY>COGNIZANT</COMPANY>
<CITY>CHENNAI</CITY>
</EMPDETAILS>
</EMPINFO>
218
JOB:
219
1. Validation settings
3. Transformation settings
220
3. Options
Options->Input->Columns
221
222
Job:
223
225
226
227
228
FTP STAGE:
File Transfer from one data stage file server to other file server:
Job:
229
230
231
Containers:
Containers are used to minimize the complexity of a job and for better under standing and
reusability purpose.
There are two types of containers are available in data stage.
1. Local container
2. Shared container
Local containers: it is used to minimize the complexity of job for better understanding
purpose only.
And It never used for reusability purpose and it limit is with in a job.
Shared container:
Shared containers used for both purposes like to minimize the complexity of a job and
reusability. Its limit is with in project.
Differences between local container and shared containers
Local Container:
1.
2.
3.
4.
Shared container:
1. Itis used to for both purposes minimize the complexity of job and reusability
2.It is limit is with in a project
3. It occupies some memory
4. It can not be deconstructed directly first need to convert into local then
Constructed
How to construct container:
Go to data stage designer-->open a specific job-->select required stages in ajob-->
Click on edit--> click on construct container then choose local or shared container
Note: if we want to deconstruct then right click on containerclcik deconstruct.
233
JOB SEQUENCE:
It is used to run all jobs in a sequence (or) in a order by considering its dependencies. it
has many activities.
How to go to job sequence:
Select job sequencedrag drop of required jobs from jobs in repositorygive
connectionsave itcompile it now sequentially these 3 jobs will be run.
234
TERMINATOR ACTIVITY:
It is used to send stop request to all running job.
Double click on wait for file activitygo to filefilename: select the file and set
timing(24 hours time only)
235
SEQUENCER:
It is used to connect one activity to another activity it takes more input links and gives
one output link
236
ROUTINE ACTIVITY:
It is used to execute a routine between two jobs
Double click on routine activitychoose routine nameif required parameter give
parameter.
237
name
Bhaskar
Mohan
Sanjeev
sal
1500
2000
2000
no
100
101
102
name
Bhaskar
Mohan
Srikanth
sal
1000
1500
2000
no
100
101
102
103
name
Bhaskar
Mohan
Srikanth
Sanjeev
sal
1500
2000
2000
2000
Type-I:
238
In SCD Type-I If a record exists in source table and not exists in target table then simply
insert a record into target table (103 record) if a record exists in both source and target
tables then simply update source record into target table(100,101)
Type-II:
While implementing SCD Type-II there are two extra columns are maintained in target
called Effective Start Date and Effective End Date .Effective start date is also part of
primary key.
If a record exists in source and not exists in target table then simply insert records into
target table. while inserting put Effective Start Date is equal to current date and effective
end date set null.
If a record exists in both source and target tables even though we are inserting a
source record into target table but before insert a record into target table the existing
record in target table update effective End Date=CurrentDate-1.
Now insert source record into target table effective start date=Current Date and Effective
End Date=Null
no
100
101
102
100
101
103
name
Bhaskar
Mohan
Srikanth
Bhaskar
Bhaskar
Sanjeev
sal
1000
1500
2000
1500
2000
2000
Effective_Strat_Date
2011-01-31
2011-01-31
2011-01-31
2011-02-01
2011-02-01
2011-02-01
Effective_End_Date
2012-04-05
2012-04-05
2012-04-05
Null
Null
Null
Type-III:
If a record exists in source and not exists in target table then simply insert records
Into target table. While inserting put Effective start Date is equal to Current Date and
Effective End Date set Null.
If a record exists in both source and target tables then check target table count group by
primary key if count=1 then update Effective End Date=Current Date-1 then simply
insert source record into target record.
If count greater than one then delete a record into target table group by primary key
where Effective End Date=Not Null. Now update target record Effective End
Date=Current Date-1 Then simply insert source record into target.
DATAWAREHOUSE:
Data ware house is nothing but collection of transactional data and historical data and can
be maintained in dwh for analysis purpose.
They are 3 types of tools should be maintained on any data warehousing project
239
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool
ETL TOOL:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also
A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
240
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.
STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
Note: A Data Modeler can design DWH in two ways
3. ER Modeling
4. Dimensional Modeling
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
4. Star Schema
5. Snow Flake Schema
6. Multi Star Schema (or) Hybrid (or) Galaxy
Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema
241
242
Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and its having transactions, provides
summarized information such a table called fact table.
DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
243
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and its shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
FACT TABLE TYPES:
There are 3 types of fact s are available in fact table
1. Additive facts
2. Semi additive facts
3. Non additive facts
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
SEMI ADDITIVE FACT:
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
NON ADDITIVE FACT:
If there is not possibility to add some value to the existing fact in the fact table is we
called as Non additive fact.
SNOW FLAKE SCHEMA:
Snow Flake schema maintains in dimension table normalized data .in this schema some
dimension tables are not directly maintained relation ship with fact table and those are
maintained relation ship with another dimension
244
Star schema
It maintains demoralized data in the
dimension table
Performance will be increased when
joining fact table to dimension table when
Data Profiling
Data Profiling:
1.Data profiling is the process of examining the data available in an existing data source
2.A data source usually a data base or a file
3. By doing data profiling we can collect the statistics and information about data
2)
3)
4)
5)
Data governance:
Is a quality control discipline for assessing, managing, using, improving, monitoring,
maintaining, and protecting organizational information?
245
2)
3)
2)
3)
What is a Domain
A simple example of a Domain is the list of United States state abbreviations. The
Domain could be implemented as a CHAR(2) and would contain the following
valid value set: AL, AK, AR, CA, CO, CT, DE, DC, FL, GA, HI, ID, IL, IN, IA,
KS, KY, LA, ME, MD, MA, MI, MN, MS, MO, MT, NE, NV, NH, NJ, NM, NV,
NC, ND, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VT, VA, WA, WV, WI, WY.
2
Many Columns can share the same Domain. Columns which share the same
Domain may be Synonym candidates.
A Domain is defined as the set of all valid values for a Column or set of Columns.
Domains contain target data type information, a user-defined list of valid values,
and a list of valid patterns. Each Schema has its own set of Domains.
Normalization:
Normalization is a process of decomposing a relation into smaller, well structured
relations without anomalies is called as narmalization
246
If you ran a Dependency profile for this table you would find the following
dependencies, among others
247
The List in the previous slide represents True dependencies Now lets take a look at
the below dependencies
The first one is not a true dependency because First Name does not positively
determine Last Name, in that BHASKAR could be REDDY or RAO.
Similarly FirstName + LastName doesnt uniquely determine PAN.
If you add the first list to the Dependency Model, you would get two keys:
EmployeeID
PAN
However, only one of them can be a primary key, the other key is called an alternate
key
3.CrossTable Profiling:
Cross table profiling compares column in a schema determines which ones contain
similar values. This profile can determine whether a column or set of columns is
appropriate to serve as a foreign key between the selected tables.
Cross Table profiling can find the following types of redundancies:
Redundant data to eliminate by creating Synonyms.
Redundant data that is intentionally redundant to improve database
performance. You may still want to synonym these Column pairs to allow the
normalizer to create a true third normal form (3NF).
Data that looks redundant but actually represents different business facts
(homonyms).
Synonyms:
Two or more Columns that have the same business meaning are called Synonyms
248
Both relations contain employee data, but they are defined separately to segregate public
and private information. The EmpID and EmployeeID Columns have the same business
meaning and can be meaningfully combined into a single Column. In contrast, look at
how the MgrID column is used in the Employee Table. Even though MgrID uses similar
values to EmployeeID, it represents a different role in the database. Therefore you would
not define MgrID and EmployeeID as Synonyms.
Normalization has the following impacts on Synonyms:
If two or more Columns made Synonyms represent the identical construct, they
will collapse into one Column in the normalized model.
If two Columns made Synonyms represent a parent-child relationship, they will
result in two Columns in two Tables, with one Column participating in a primary
Key and the other in the corresponding Foreign Key
249
1. Home
2. Overview
3. Investigate
4. Develope
5. Operate
1. Home: Contains system administration, security, configuration, and metadata
tasks
My home
Reports
Metadata Management
Data stores
Data schemas
Tables or Files
Data Fields
Data Clasess
User Classes
Contacts
Policies
Global Logical Variable
250
251
252
4. Create the IA project to create IA project have to be login with IA admin privileges
5. Import metadata from staging tables to IA environment
6. To Import Meta data have to be login with IA admin privileges
7. Add a data sources to created project
8. Adding the necessary users to the created project
9 .And also can Add the Groups to the created project
10. Assigning a project roles to the user or groups
11. The following are the four different project roles we have in IA
Information Analyzer Business Analyst
Information Analyzer Data operator
Information Analyzer Data steward
Information Analyzer Data operator
12.Running CA job for single or multiple columns
13. View Analysis results
14.Capture the analysis results where ever the data validation rules given in Data
profiling requirement template
15. Fill the Data profiling requirement sheet with all the columns where ever the Data
validation rules given
16.Generate the project required reports
17.Deliver the analysis results template and reports to the Focals
2. Enter User Name, Password, and Host Name of the services tier
3. After login to Information Analyzer main home screen click on file menu
4. Select open project and open project will display the list of created projects
5. Select the project for which analyze table want to run Column Analysis and click on
open
6. Now in the Information Analyzer workspace navigator menu select Investigate Tab
7. Select Column Analysis then it will display the below column analysis window
253
8. Now select the table and go to the Tasks and under the task will find the Run Column
Analysis option now click on Run column Analysis then it it will several minutes to
complete the column analysis once the column analysis completes then it will give the
status as Analyzed with analysis Run date.
9. Now find the below screen shot the CA status was now Analyzed
1. Now select the EMPID Column from the above list and now go to task and under task
select Open column analysis
254
And next click on view details then it will open the Column Analysis results window
2. Now we will find the column analysis results encountered in different tabs:
1. Overview:
2.Frequency Distribution:
3.Data Class:
255
4. Properties Analysis:
Properties has the fallowing Information
1.Data type
2.Data Length
3.Precision
4. Scale
5.Nallability
6.Cardinality Type
Note: If the format is Invalid then we have to change the status as violate and then have
to change the domain value status is Mark as invalid then automatically those Invalid
format associate values became change as Invalid in Domain and Completeness tab
256
you can create a report that summarizes the results of that job and then compare the
results with the results from your baseline data.
Baseline analysis:
When you want to know if your data has changed, you can use baseline analysis to
compare the column analysis results for two versions of the same data source. The
content and structure of data changes over time when it is accessed by multiple users.
When the structure of data changes, the system processes that use the data are affected.
To compare your data, you choose the analysis results that you want to set as the baseline
version. You use the baseline version to compare all subsequent analysis results of the
same data source. For example, if you ran a column analysis job on data source A on
Tuesday, you could then set the column analysis results of source A as the baseline and
save the baseline in the repository. On Wednesday, when you run a column analysis job
on data source A again, you can then compare the current analysis results of data source A
with the baseline results of data source A.
To identify changes in your data, a baseline analysis job evaluates the content and
structure of the data for differences between the baseline results and the current results.
The content and structure of your data consists of elements such as data classes, data
properties, primary keys, and data values. If the content of your data has changed, there
will be differences between the elements of each version of the data. If you are
257
monitoring changes in the structure and content of your data on a regular basis, you might
want to specify a checkpoint at regular intervals to compare to the baseline. You set a
checkpoint to save the analysis results of the table for comparison. You can then choose
to compare the baseline to the checkpoint or to the most recent analysis results.
If you know that your data has changed and that the changes are acceptable, you can
create a new baseline at any time
Comparing analysis results
To identify changes in table structure, column structure, or column content from the
baseline version to the current version, you can compare analysis results.
Before you begin
You must have InfoSphere Information Analyzer Business Analyst privileges and have
completed the following task.
Setting an analysis baseline
Over time, the data in your table might change. You can import metadata for the table
again, analyze that table, and then compare the analysis results to a prior version to
identify changes in structure and content. You can use baseline analysis to compare the
current analysis results to a previously set baseline.
Procedure
258
You must have InfoSphere Information Analyzer Business Analyst privileges and an
Information Analyzer Data Operator must have completed the following task.
Running a column analysis job
After you run a column analysis job and verify the results of that analysis, you can set the
analysis results as an analysis baseline. You set an analysis baseline to create a basis for
comparison. You can then compare all of the subsequent analyses of this table to the
baseline analysis to identify changes in the content and structure of the data.
Procedure
What to do next
You can now compare the analysis baseline to a subsequent analysis result of the table.
If you are monitoring changes in the structure and content of your data on a regular basis,
you might want to specify a checkpoint at regular intervals to compare to the baseline.
You set a checkpoint to save the analysis results of the table for comparison. You can then
choose to compare the baseline to the checkpoint or to the most recent analysis results.
A checkpoint can also save results at a point in time for analysis publication.
Procedure
259
To determine whether the content and structure of your data has changed over time, you
can use baseline analysis to compare a saved analysis summary of your table to a current
analysis result of the same table.
About this task
You can use baseline analysis to identify an analysis result that you want to set as the
baseline for all comparisons. Over time, or as your data changes, you can import
metadata for the table into the metadata repository again, run a column analysis job on
that table, and then compare the analysis results from that job to the baseline analysis.
You can continue to review and compare changes to the initial baseline as often as needed
or change the baseline if necessary.
To compare analysis results, you complete the following tasks:
1.Setting an analysis baseline
To establish which version of analysis results will be used as the baseline for
comparison, you must set an analysis baseline.
2.Comparing analysis results
To identify changes in table structure, column structure, or column content from the
baseline version to the current version, you can compare analysis results.
2. select Data quality and go to task under task find the New data rule definition
260
3. Click on new data rule definition then the below window will be pop up
4.Click on overview and provide the Data rule name in the name text field and short
description and Long description is optional
5. Goto Rule Logic and there we have to right the logic
1.Condtion:
IF
THEN
ELSE
AND
OR
NOT
2.(
261
((
(((
Example:
Once we written the logic then we have to be validate weather the logic is syntactically
correct or not if the logic is correct then have to click on and save and exist.
3. SourceData
Here source data is a field name
4. Condtion
Not
5. Type of check
=
>
<
>=
<=
<>
Contains-- >String containment
Exists->Null value test
Matches_Format-- >If country=India then zip code format=999999
,Matches_Regex
occurs
occurs>
occurs>=
occurs<
occurs<=
In_Reference_column
In_reference_List
Unique
Is_numeric
Is_Date
6. Reference Data:
Here in reference data we have to give the reference column name
7.)
))
)))
262
Rule Builder:
1. Business Problem:
Identifies when a column contains Data
Type of Check: exists
263
3. Business Problem:
Type of Check: matches_regex
Identify weather a column EMPID contains a value has numeric value and length
of EMPID is 4 and the format of EMPID should be 9999
Here in source we may have either upper or lower case then those values have to
Compare with reference data
3. Business Problem:
Type of Check: matches_format
If country code=India then have to be check weather the zip code format is 999999 or
not
Quality Stage:
1.Why investigate:
Discover trends and potential anomalies in data
Identify invalid and default values in a data
Verify the reliability of the data in the fields to be used as a matching criteria
Gain complte understanding of the data in a context
Investiage:
Verify the domain:
Review each field and verify the data matches the meta data
Identyfy the data formats and missing and default values
Identify the data anomalies:
Format
Structure
264
Content
Feature of investigate:
Analyze free form and single domain columns
Provide frequency distribution of distinct values and patterns
Investigaet methods:
Characte distcrete
Character concate nate:
Word invistgate:
INVESTIGATE STAGE:
265
Click on change mask and select for all the fields C mask
At out put it gives 5 columns
1.qsInvColumnName
2.qsInvPattern
3.qsInvSample
4.qsInvCount
5.QsInvPercent
At output it gives like below
Above screen shot is only for one field it gives same like another fields which ever u
selected
Job:
Click on change mask and select for all the fields C mask
At out put it gives 5 columns
1.qsInvColumnName
2.qsInvPattern
3.qsInvSample
4.qsInvCount
5.QsInvPercent
267
269
270
271
272
273
1.qsInvCount
2.qsInvWord
3.qsInvClassCode
Token report Output
2.STANDARDIZE STAGE:
Example job:
Open standardize stage and select rule set text field and select standardize rules folder
with in that folder select OTHER folder and select COUNTRY folder and select
COUNTRY
Next in literal text field enter ZQUSZQ
275
Add column which column u want to select from available data columns
Select column name <literal> and AddressLine1, AddressLine1, city, state, Zip
columns in selected columns list
You will get the below screen after entering all
Next click on OK
you will get the below screen
next click on Ok
Output:
At output it gives additional column ie ISOCOUNTRYCODE AND
identifierFlag_COUNTRY
276
Quality stage :
Investigate: 3 methods:
1.chardiscreate->C,T,X masks
2.Charconcatenate:C,T.X masks
Investigate default column names for Pattern Report:
1.Qsinvcolumn name:
2.QsInvPattern
3.QsInvsample
4.QsInvcount
5.Qsinvpercentgae
Investigate default column names for column Report:
1.QsInvcount
2.QsInvword
3.QsInvclasscode
Lab:
Chardiscreate C mask (select one or many columns)
Characterconcatenate C MASK(select two or more columns concate nate)
WordInvstgate: FullName:
Token Rpt
Pattern Rpt
WordInvestigate:Address(pass address line 1,address line2)
Token Rpt
Pattern Rpt
WordInvestigate:Area(city ,state,Zip)
Token Rpt
Pattern Rpt
2.Standardize stage:
1.country identifier:
--- >select the rule set from others COUNTRY
--- > pass the literal ZQUSZQ and add the columns addressline1,addressline 2,city
,state,zip
--- > filter the records where ever we have flag Y Those or US records
--- >split US, non US records into separate target
2. Apply the USPREP rule set to filter name components from address fields, and
area components from address fields
277
Columns
NameDomain_USPREP
AddressDomain_USPREP
AreaDomain_USPREP
278
Inputpattern InputName
+FI
DAMORA WILLIAM H
InputPattern
+,+
279
Development Projects.
Enhancement Projects
Migration Projects
Production support Projects.
-> The following are the different phases involved in a ETL project development life
cycle.
1) Business Requirement Collection ( BRD )
2) System Requirement Collection ( SRD )
3) Design Phase
a) High Level Design Document ( HRD )
b) Low level Design Document ( LLD )
c) Mapping Design
4) Code Review
5) Peer Review
6) Testing
a) Unit Testing
b) System Integration Testing.
c) User Acceptance Testing (UAT)
280
7) Pre - Production
8) Production (Go-Live)
Business Requirement Collection: ---------------------------------------------> The business requirement gathering start by business Analyst, onsite technical
lead and client business users.
-> In this phase, a Business Analyst prepares Business Requirement Document
(BRD) (or) Business Requirement Specifications (BRS)
-> BR collection takes place at client location.
-> The o/p from BR Analysis are
-> BRS: - Business Analyst will gather the Business Requirement and document in
BRS
-> SRS: - senior technical people (or) ETL architect will prepare the SRS which
contains s/w and h/w requirements.
The SRS will includes
a) O/S to be used ( windows or UNIX )
b) RDBMS required to build database ( oracle, Teradata etc )
c) ETL tools required ( Informatica,Datastage )
d) OLAP tools required ( Cognos ,BO )
The SRS is also called as Technical Requirement Specifications ( TRS )
Designing and Planning the solutions: ------------------------------------------------> The o/p from design and planning phase is
a) HLD ( High Level Design ) Document
b)LLD ( Low Level Design ) Document
HLD ( High Level Design ) Document : An ETL Architect and DWH Architect participate in designing a solution to build a
DWH.
An HLD document is prepared based on Business Requirement.
LLD ( Low Level Design ) Document : Based on HLD,a senior ETL developer prepare Low Level Design Document
The LLD contains more technical details of an ETL System.
An LLD contains data flow diagram (DFD), details of source and targets of each
mapping.
An LLD also contains information about full and incremental load.
After LLD then Development Phase will start
281
Development Phase (Coding):--------------------------------------------------> Based an LLD, the ETL team will create mapping (ETL Code)
-> After designing the mappings, the code ( Mappings ) will be reviewed by
developers.
Code Review:->
->
->
->
->
Peer Review:-> The code will reviewed by your team member (third party developer)
Testing:-------------------------------The following various types testing carried out in testing environment.
1) Unit Testing
2) Development Integration Testing
3) System Integration Testing
4) User Acceptance Testing
Unit Testing:-> A unit test for the DWH is a white Box testing, It should check the ETL procedure
and Mappings.
-> The following are the test cases can be executed by an ETL developer.
1) Verify data loss
2) No.of records in the source and target
3) Dataload/Insert
4) Dataload/Update
5) Incremental load
6) Data accuracy
7) verify Naming standards.
8) Verify column Mapping
-> The Unit Test will be carried by ETL developer in development phase.
-> ETL developer has to do the data validations also in this phase.
Development Integration Testing -
282
User Acceptance Testing (UAT):-> This test is carried out in the presence of client side technical users to verify the
data migration from source to destination.
Production Environment:---------------------------------> Migrate the code into the Go-Live environment from test environment ( QA
Environment ).
EXPLANATION:
We have to start with .Our projects are mainly onsite and offshore model
projects.Inthis project we have one staging area in between source to target
databases. In some project they wont use staging areas. Staging area simplify the
process..
Architecture
AnalysisRequirement GatheringDesign Development Testing Production
Analysis and Requirement Gathering: Output :Analysis Doc, Subject Area
100% in onsite,Business Analyst, project manager.
Gather the useful information for the DSS and indentifying the subject areas, identify
the schema objects and all..
Design: Output: Technical Design Docs, HLD, UTP ETL Lead, BA and Data Architect
80%onsite .( Schema design in Erwin and implement in database and preparing the
technical design document for ETL.
20% offshore: HLD & UTP
Based on the Technical specs.. developers has to create the HLD(high level design)
it will have he Informatica flow chart. What are the transformation required for that
mapping.
In some companies they wont have HLD. Directly form technical specs they will
create mappings. HLD will cover only 75% of requirement.
283
UTP Unit Test Plan.. write the test cases based on the requirement. Both positive and
negative test cases.
Development : output : Bugs free code, UTR, Integration Test Plan
ETL Team and offshore BA
100% offshore
Based on the HLD. U have to create the mappings. After that code review and code
standard review will be done by another team member. Based on the review
comments u have to updated the mapping. Unit testing based on the UTP. U have to
fill the UTP andEnter the expected values and name it as UTR (Unit Test Results). 2
times code reviewand 2 times unit testing will be conducted in this phase. Migrating
to testing repositoryIntegration test plan has to prepare by the senior people.
Testing : Output: ITR, UAT, Deployment Doc and User Guide
Testing Team, Business Analyst and Client.
80% offshore
Based on the integration test plan testing the application and gives the bugs list to
thedeveloper. Developers will fix the bugs in the development repository and again
migrated to testing repository. Again testing starts till the bugs free code.
20% Onsite
UAT User Accept Testing.Client will do the UAT.. this is last phase of the etl
project. If client satisfy with the product .. next deployment in production
environment.
Production: 50% offshore 50% onsite
Work will be distributed between offshore and onsite based on the run time of the
application. Mapping Bugs needs to fix by Development team. Development team will
support for warranty period of 90 days or based on agreement days..
In ETL projects Three Repositorys. For each repository access permissions and
location
will be different.
Development : E1
Testing : E2
Prduction : E3
284
1.Project Explanation:
Im giving generic explanation of the project. Any project either banking or sales or
insurance can use this
explanation.
First u have to start with
1) U have to first explain about objective of the project
and what is client expectations
2) u have to start where ur involvement and responsibility of ur job and limitations of
job.
Add some points from post Project Architecture reply like offshore and onsite model
and team structure.. etc.,
Main objective of this project is we are providing a system with all the information
regarding Sales / Transactions(sales if sales domain / transactions if bank domain or
insurance domain) of entire organizations all over the country US / UK ( based on the
client locationUS/UK/..). we will get the daily transaction data from all branches at
the end of the day. We have to validate the transactions and implement the business
logic based on the transactions type or transaction code. We have to load all
historical data into dwh and once finished historical data.We have to load Delta
Loads. Delta load means last 24 hrs transactions captured from the source system.In
other words u can call it as Change Data Capture (CDC). This Delta loads are
scheduled daily basis. Pick some points from What is Target Staging Area Post..
Source to Staging mappings, staging to warehousing.. based on ur comfort level.
.Each transaction contains Transaction code.. based on the transaction code u can
identify wheather that transaction belongs to sales, purchase / car insurance, health
insurance,/ deposit , loan, payment ( u have to change the words based on the
project..) etc., based on that code business logic will be change.. we validate and
calculate the measureand load to database.
In Informatica mapping, we first lookup all the transaction codes with code master
table to identify the transaction type to implement the correct logic and filter the
unnecessary transactions.. because in an organization there are lot of transactions
will be there but u have to consideronly required transactions for ur project.. the
transaction code exists in the code master table are only transactions u have to
consider and other transactions load into one tablecalled Wrap table and invalid
records( transaction code missing, null, spaces) to error table. For each dimension
table we are creating surrogate key and load into dwh tables.
SCD2 Mapping:
We are implementing SCD2 mapping for customer dimension or account dimension
to keep history of the accounts or customers. We are using SCD2 Date
method.before telling this u should know it clearly abt this SCD2 method..careful abt
it..
285
Responsibilities.. pick from Project architecture Post and tell according ur comfortable
level.. we are responsible foronly development and testing and scheduling we are
usingthird party tools..( Control M, AutoSys, Job Tracker, Tivoli or etc..) we simply
give the dependencies between each mapping and run time. Based on that
Informationscheduling tool team will schedule the mappings. We wont schedule in
Informatica .. thats itFinished
Please Let me know if u required more explanation
regarding any point reply
I have done my B.sc Computers from Osmania University in 2007. Ap.After that I
had an opportunity to work for a Wipro Technologies from Oct 2006 to Aug 2008
where I started off my career as an ETL developer. I have been with Wipro almost 2
years then I shifted to Ness Technologies In Aug 2008. Presently I am working with
Ness
Total I have 3.5 Years of experience in DWH using Data stage tool in development
and Enhancement projects. Primarily I worked on healthcare and manufacturing
domains.
In my Current project my roles & responsibilities are basically
I am working with onsite offshore model so we use to get the tasks from my
onsite team.
I involved into the preparation of source to target mapping sheet (tech Specs)
which tell us what is the source and target and which column we need to map
to target column and also what would be the business logic. This document
gives the clear picture for the development.
Preparation of Unit test cases also one of my responsibilities as per the business
requirement.
286
And also involved into Unit testing for the mappings developed by myself.
I use to source code review for the Data stage jobs developed by my team
members.
And also involved into the preparation of deployment plan which contains list
of
Data stage jobs they need to migrate based on this deployment team can
Once the code rollout to production we also work with the production support
team for 2 weeks where we parallel give the KT. So we also prepare the KT
document as well for the production team.
reload the staging tables for each session run. Before loading to staging tables we
are dropping indexes then after loading bulk data we are recreating indexes using
store procedures.
Then we extract all this data from stage & load it into the dimensions & facts on
top of dims and facts we have created some materialized views as per the report
requirement .Finally report directly pulls the data from MV .These reports
/dashboards performance always good because we are not doing any calculation at
287
how many RFQs created, how many RFQs approved, how many RFQs got responded
from the supply channels?
What is the budget?
What is budget approved?
Who is the approval manager pending with whom what is the feedback of the supply
channels from the past etc?
They dont have the BI design, so they are using the manual process to achieve the
above by exporting the excel sheet; so we can do the drill up, drill down & get all the
detailed reports by charts.
Prepared By
A.Bhaskar Reddy
Email:abreddy2003@gmail.com
91-9948047694
288