Professional Documents
Culture Documents
In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.
1) Descriptor Files contains the Schema Details and address of the data.
It stores the data in C:/Data/file.ds
In Director, We can
View the Jobs
View the Logs
Batch Jobs
Unlock Jobs
Scheduling Jobs
Monitor the JOBS
Message Handling
In Manager, We can
Import & Export the Jobs
Node Configuration
soft_com_1
e_id,e_name,e_job,dept_no
001,james,developer,10
002,merlin,tester,20
003,jonathan,developer,10
004,morgan,tester,20
005,mary,tester,20
soft_com_2
dept_no,d_name,loc_id
10,developer,200
20,tester,300
soft_com_3
loc_id,add_1,add_2
100,melbourne,victoria
200,brisbane,queensland
Table1
e_id,e_name,e_loc
100,andi,chicago
200,borny,Indiana
300,Tommy,NewYork
Table2
Bizno,Job
20,clerk
30,salesman
Normal Lookup:-- In Normal Look, all the reference records are copied to
the memory and the primary records are cross verified with the reference
records.
Sparse Lookup:--In Sparse lookup stage, each primary records are sent to
the Source and cross verified with the reference records.
Here, we use sparse lookup when the data coming have memory sufficiency
and the primary records is relatively smaller than reference date we go for
this sparse lookup.
Range LookUp:--- Range Lookup is going to perform the range checking
on selected columns.
For Example: -- If we want to check the range of salary, in order to find the
grades of the employee than we can use the range lookup.
table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo
And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one
target.
Take Job design as
We can see the customers information and there mobile plans ( for example)
If we like to find lowest plan taken by all customers
Take Job Design as
Seq.File--------Sort------Tx-----------------D.s
xyz_comp
e_id,e_name,e_add
100,jam,chicago
200,sam,newyork
300,tom,washington
400,jam,indiana
500,sam,sanfransico
600,jam,dellas
700,tom,dellas
Seq.File----Sort-----Tx-----R.d-----D.s
table definitions .
( This file should be same which you have given in the properties. )
Now in the Target Dataset - Give file name.
Now for the sorting process.
In the Target Open Dataset properties
And go to Partitioning ---- Select Partitioning type as Hash
In Available Columns Select Key Column ( E_Id for EXAMPLE) to be
sorted.
Click Perform Sort
Click Ok
Compile And Run
The data will be Sorted in the Target.
mult_add
e_id,e_name,e_add
10,john,melbourne
20,smith,canberra
10,john,sydney
30,rockey,perth
10,john,perth
20,smith,towand
If U like to get all multiple addresses of the customer into one single row
from multiple rows. .
We can perform this using Sort Stage, Transformer Stage and Remove
Duplicate Stage
Take Job Design as below
SeqFile----Sort-----Tx----R.D----D.S
e_id,e_name
10,em y
20, j ul y
30,re v o l
40,w a go n
Click ok
Compile and Run the data
You will get the data after removal of all the spaces between the characters,
before and after spaces also.
That's it you will get the data in 3 different columns in the output as
required.
After compile and Run the Job.
Customers
1 bhaskar 2000
2 ramesh 2300
3 naresh 2100
4 kiran 1900
5 sunitha 2000
They are exactly straight. They just have spaces in between the data.
Our requirement is to get the data into the three different columns from
single column.
Here The data is customers is the column name we are getting and we have
only
single column.
Now Take Job Design as below
Seq.File------------Tx-------------Ds
That's it
Give name for the file in the Target.
Now Compile and Run the Job.
You will get the Output as required.
samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40
Click ok
Give file name at the target file and
Compile and Run the Job to get the Output
Here '@' is called padding you want to get after the data
5 is the padlength Now click ok
Change_after_data
e_id,e_name,e_add
11,kim,syd
22,jim,canb
33,pim,syd
44,lim,canb
55,pom,perth
Change_before_data
e_id,e_name,e_add
11,kim,syd
22,jim,mel
33,pim,perth
44,lim,canb
55,pom,adeliade
66,shila,bris
Take Job Design as below
Compile and Run the Job now to get the required Output.
26. MERGER STAGE EXAMPLES:
Merge Stage is a Processing Stage which is used to perform the horizontal
combining. This is one of the stage to perform this operation like Join stage
and Lookup Stage. Only the difference between the stages are size variance
an Input requirements between them.
Example for Merge Stage
Sample Tables
MergeStage_Master
cars,ac,tv,music_system
BMW,avlb,avlb,Adv
Benz,avlb,avlb,Adv
Camray,avlb,avlb,basic
Honda,avlb,avlb,medium
Toyota,avlb,avlb,medium
Mergestage_update1
cars,cooling_glass,CC
BMW,avlb,1050
Benz,avlb,1010
Camray,avlb,900
Honda,avlb,1000
Toyota,avlb,950
MergeStage Update2
cars,model,colour
BMW,2008,black
Benz,2010,red
Camray,2009,grey
Honda,2008,white
Toyota,2010,skyblue
In Merge Stage Take cars as Key column. In Output Column Drag and Drop
all the columns to the output files.
Give File name to the Target/Output file and If you want you can give reject
links (n-1)
Compile and Run the Job to get the required output
d)Reject Mode :
In reject mode, If we like to restrict the job with error data we go with three
different options.
1) Continue
2) Fail
3) Output
1) Continue: It means it leaves the error data and loads rest of the data into
the target.
This is default option.
Memory Limit means it has less memory to save. If memory exceed's more
than 2
GB, the memory will be full. And we need to save in two files.
If we like to save the data in single file. In that case we can go with Dataset
file.
Sequential file processed in default in sequential file.
So sequential file have conversion problem, when the data transfer from
stage to
stage.
xyzbank2
e_id,e_name,e_loc
555,,flower,perth
666,paul,goldenbeach
777,raun,Aucland
888,ten,kiwi
33.SURROGATEKEY STAGE:
Surrogate Key is a unique identification key. It is alternative to natural key .
And in natural key, it may have alphanumeric composite key but the
surrogate is
always single numeric key.
Surrogate key is used to generate key columns, for which characteristics can
be
specified. The surrogate key generates sequential incremental and unique
integers for a
provided start point. It can have a single input and a single output link.
Type-3 SCD: In the Type-3 SCD, it will maintain the partial historical
information.
2. CD:
2) The cd command is used to change the directory.
Syntax is cd[Dir]
Ex: cd tech
3. FTP
3) The ftp command is used to transfer files to and from a remote server.
Syntax is
ftp[options][hostname]
Options will be like
d-debugging is enabled
g- is to interactive prompting is disabled
v- is to display all responses from the response
4. GREP
4) The grep command is used to search one or files or multiple files for line
and that contain a pattern.
Syntax is
grep[options][pattern]
Some of the options are as below
b- To display the blocked numbers at the beginning of each line
h-To display the matches lines, did not display the file names.
c- Is used to ignore case sensitive.
i-Is used to ignore case sensitive.
s-Is for silent node.
v- Is used to display all line that do not match.
w-Is to match whole word.
5. KILL
5) The Kill Command is used to Kill one process id or multiple id's
Syntax is
Kill [options]ids
Options are
l - lists the signal names
Signal - Means the signal number of name.
6. LS
6) The Is command is used to retrieve all the list in the directory.
Syntax is
ls[options][names]
Seq.File--------------Col.Gen------------------Ds
xyzbank
e_id,e_name,e_loc
555,flower,perth
666,paul,goldencopy
777,james,aucland
888,cheffler,kiwi
And in
2) Datastage 8.0.1 Version, there are 5 components. They are
a) Datastage Design
b) Datastage Director
c) Datastage Admin
d) Web Console
e) Information Analyzer
Here Datastage Manager will be integrated with the Datastage Design
option.
2) Datastage 7.X.2 Version is OS Dependent. That is OS users are Datastage
Users.
And in 8.0.1
2) this is OS Independent. That is User can be created at Datastage, but one
time dependant.
3) Datastage 7.X.2 version is File based Repository (Folder).
3) Datastage 8.0.1 Version is Datastage Repository.
7) Server is IIS
7) Sever is Websphere
1) Any to Any
That means Datastage can Extract the data from any source and can loads
the data into the any target.
2) Platform Independent
The Job developed in the one platform can run on the any other platform.
That means if we designed a job in the Uni level processing, it can be run in
the SMP machine.
3 )Node Configuration
Node Configuration is a technique to create logical C.P.U
Node is a Logical C.P.U
4) Partition Parallelism
Partition parallelim is a technique distributing the data across the nodes
based on the partition techniques. Partition Techniques are
a) Key Based
b) Key Less
5) Pipeline Parallelism
Pipeline Parallelism is the process, the extraction, transformation and
loading will be occurred simultaneously.
Re- Partitioning: The distribution of distributed data is Re-Partitioning.
Reverse Partitioning: Reverse Partitioning is called as Collecting.
Collecting methods are
Ordered
Round Robin
Sort Merge
Auto
a) Same: This technique is used in oder to do not alter the existing partition
technique in the previous stage.
b) Entire: Each Partition gets the entire dataset. That is rows are duplicated.
c) Round Robin: In Round Robin Technique rows are evenly distributed
among the Partition.
d) Random: Partition a row is assigned to is Random.
We Hash Partitioning technique when the key column data type is text.
1) Data Profiling
2) Data Quality
3) Data Transformation
4) Meta data management
Data Profiling:-
Data Profiling performs in 5 steps. Data Profiling will analysis weather the
source data is good or dirty or not.
And these 5 steps are
a).Column Analysis
b) Primary Key Analysis
c) Foreign Key Analysis
d) Cross domain Analysis
e) Base Line analysis
After completing the Analysis, if the data is good not a problem. If your data
is dirty, it will be sent for cleansing. This will be done in the second phase.
Data Quality:-
Data Quality, after getting the dirty data it will clean the data by using 5
different ways.
They are
a) Parsing
b) Correcting
c) Standardize
d) Matching
e) Consolidate
Data Transformation:-
After competing the second phase, it will gives the Golden Copy.
Golden copy is nothing but single version of truth.
That means , the data is good one now.
Repository:--
Repository is an environment where we create job, design, compile and run
etc.
Some Components it contains are
JOBS,TABLE DEFINITIONS,SHARED CONTAINERS, ROUTINES ETC
Server( engine):-- Here it runs executable jobs that extract , transform, and
load data into a datawarehouse.
Before continuing with the Datastage Installation you need to see your
system requirements.
Your system should have
a) Miminim of 2 GB Ram
b) Windows Server 2003 ( You can Install in another Windows Xp also, but
it is better to have Windows Server 2003 only)
c) You need to Install Oracle 9i/10g before the installation of Datastage.
d) You need to keep Fire wall off.
Open Your Cd to Install
1) Click on Install.exe
On the next Screen You can see
IBM Information Server
-----Client
-----Engine
-----Domain
-----Metadata Repository
-----Select All
Next
7) In Websphere Server Administrator Information
Username ------- admin
password ------- admin
confirm password--admin
Administrator
Click on Start
In services
a) ASB Agent Started Automatic
b) Datastage Engine Started Automatic
c) Datastage Telnet Started Automatic
d) DB2- DB2 Copy Started Automatic
e) IBM Websphere App Server v6 Started Automatic
History of Database:
In 1997, Vmark (is Top 100 Company). It is UK Based company. Mr.Lee
Cheffler was father of Datastage.
In that "Datastage" is called as Data Integrator
And This Product has been acquired by many companies. Then i has gone to
Torrent and gone to the hands of Informix.
Informix has the popular product Database and now they have Data
Integrator
In later years I.B.M acquired informix product Database in 2000.
Now Informix changed the name of the company as Ascential and they
changed the name of the product as Datastage server Jobs in 2000.
Later in 2002 Ascential Datastage integrated with Orchestrate (PX,UNIX).
Orchestrate is another tool.
Now Datastage got parallel capabilities from 2002 by integrating with
Orchestrate.
This software works on only in Unix Enovironment.
In 2004 December 7.5X2 Ascential Datastage integrated with MKS Tool kit.
This will be used to run the software in Windows environment. This tool
creates partial unix environment in Windows to run the Datastage software.
And the Ascential Suite Components are like
Profile Staging
Quality Staging
Audit Staging
Meta Staging
Datastage PX
Datastage TX
In 2005 I.B.M acquired Ascential everything (With Datstage)
and named it is I.B.M Datastage
I.B.M Datastage 7.5X2
Job prefixes are optional but they help to quickly identify the type of job and
can make job navigation and job reporting easier. Parallel jobs - par Server
jobs - ser Sequence jobs - seq Batch jobs - bat Mainframe jobs - Mfe
STAGE NAMES
The stage type prefix is used on all stage names so it appears on metadata
reports that do not include a diagram of the stage or a description of the
stage type. The name alone can be used to indicate the stage type.
Source and target stage names indentify the name of the entity such as a
table name or a sequential file name. The stage name strips out any dynamic
part of the name - such as a timestamp, and file extensions.
The prefix identifies the source type; the rest of the name indicates how to
find that source outside of DataStage or how to refer to that source in
another DataStage job.
Transformation stages
LINK NAMES
The link name describes what data is travelling down the link. Link names
turn up in process metadata via the link count statistics so it is very
important to use names that make process reporting user friendly.
Only some links in a job are important to project administrators. The link
naming convention has two types of link names: - Links of importance have
a five letter prefix followed by a double underscore followed by link details.
- Intermediate links have a link name without a double underscore.
Any project can add new links of importance, such as the output count of a
remove duplicates or aggregation stage.
You can then produce a pivot report against the link row count statistics to
show the row counts for a particular job using the five letter prefix as for
each type of row count.
Documented By
Bhaskar Reddy.A
Mail.abreddy2003@gmail.com
Contact:91-9916355577
15,18,