Professional Documents
Culture Documents
1.What are the sources systems and target systems in your project?
-> source: flatfiles,oracle,(any database).
-> target: db2(8.1px),orcle,sql-server.
2.what are the ETL stages you have used in your project?
-> transformer,sort,join,dataset,lookup,change capture,aggrigrater.
3.what are the difference between oracle and odbc enterprise stages?
->SEQ FILE:
-> DATASET:
6.what are the diff processing stages you have used in your project?
-> transformer,modify,changeapply,merge,sort,survgate.
7.what is the shared container? what is the purpose of using this stage?
-> create reusable object that many jobs within the project can include.
-> When we go for parallel shared container the logic can be reusable across many
jobs
-> this is stage is use to comibne maltiple input data(same metadata) to single outputdta.
1.CONTINOUS FUNNEL: NO PARTCULAR ORDER( LOAD WITHOUT
ORDERING).
2.SORT FUNNEL: IT LOAD IN PARTCULAR ORDER(ASCENDING OR
DECENDING).
3.SEQVANCE FUNNEL: IT READ FIRST INPUT FIRST AFTER SECOND INPUT
SECOND LOAD
11.suppose if i have souce data records and lookup data is about 50000 records.
which stage is preferred to used for this requirement?
-> look up
12.suppose if i have 50000 records in source and 100 records in lookup stage,
then what is the stage preferred for this requirement?
->lookup
-> Basic transformer does not run on multiple nodes wheras a Normal Transformer can
run on multiple nodes giving better performance.
-> Basic transfomer takes less time to compile than the Normal Transformer.
-> Basic transformer stage can only be used for SMP(Symmetric Multiprocessors)
systems and
not for MPP(Massively Parellel Processing) system or clusters systems.
USAGE :
-> A basic transformer should be used in Server Jobs.
-> Normal Transformer should be used in Parallel jobs as it will run on multiple nodes
here giving better performance.
15.what is the stage you have used to load the data into teradata table?
-> it explain how to capture the changes in the target over the period of the time.
-> it explain change data capture.
-> Type I: Replace the old record with a new record with updated data there by we lose
the history.
But data warehouse has a responsibility to track the history effectively where
Type I implementation fails.
-> Type II: Create a new additional dimension table record with new value. By this way
we can keep the history.
We can determine which dimension is current by adding a current record flag or
by time stamp on the dimensional row.
-> Type III: In this type of implementation we create a new field in the dimension table
which stores the old value of the dimension.
When an attribute of the dimension changes then we push the updated value to
the current field and old value to the old field.
-> normalized data is held in a very simple structure. The data is stored in tables.
-> Each table has a primary key and should contain data relating to one entity – so a
normalized customer table contains only data about customers.
-> We need to make logical connections between the entities (for example, this
customer placed these orders).
-> To do this we use a foreign key in the orders table to point to the primary key in the
customer table.
-> A relational database schema organized around a central table (fact table) joined to
a few smaller tables (dimension tables) using foreign key references.
-> The fact table contains raw numeric items that represent relevant business facts.
-> Star schema is a type of organising the tables such that we can retrieve the result
from the database easily and fastly in the warehouse environment.
Usually a star schema consists of one or more dimension tables around a fact table
which looks like a star,so that it got its name.
-> ODS can be described as a snap shot of the OLTP system.It acts as a source for
EDW(Enterprise datawarehouse).
-> ODS is more normalised than the EDW.Also ODS doesnt store any
history.Normally the Dimension tables remain at the ODS
(SCD types can be applied in ODS)where as the Facts Flow till the EDW.
-> edw maintain histrocialdata.
-> it is very use ful to decision making for enterprize.
24.WHAT IS DATAMART?
-> A data mart is a subset of an organizational data store, usually oriented to a specific
purpose or major data subject,
that may be distributed to support business needs.
28.what is the normal view and materialized view? when do u used which view?
VIEW
-> A view takes the output of a query and makes it appear like a virtual table. You can
use a view in most places where a table can be used.
-> All operations performed on a view will affect data in the base table and so are
subject to the integrity constraints and triggers of the base table.
-> A View can be used to simplify SQL statements for the user or to isolate an
application from any future change to the base table definition.
-> A View can also be used to improve security by restricting access to a predetermined
set of rows or columns.
MATERLIZED VIEW
-> Materialized views are schema objects that can be used to summarize, precompute,
replicate, and distribute data. E.g. to construct a data warehouse.
-> A materialized view provides indirect access to table data by storing the results of a
query in a separate schema object.
-> Unlike an ordinary view, which does not take up any storage space or contain any
data.
-> The existence of a materialized view is transparent to SQL, but when used for query
rewrites will improve the performance of SQL execution
-> An updatable materialized view lets you insert, update, and delete.
29.what it the buildops?
-> Buildops are good if users need custom coding but do not need dynamic (runtime-
based) input and output interfaces.
-> Buildop provides a simple means of extending beyond the functionality provided by
PX,
but does not use an existing executable (like the wrapper).
-> transformer
34.why transformer stage so costlier in development? how this stage different from
other stage?
-> it adds column to incoming data and generates mock data for these columns for each
data row processed
37.instead of using column generate stage can we generate a column in any stage?
38.what is RCP?
-> NO
-> the funnel stage copies multiple input data(same meta data) to a ingle output dataset
-> the operation is usepul for combining separate datasets into a single large dataset
-> a processing stage that combines datafrom multiple input links to a single output link
42.which one will give good performance between ODBC & Oracle Enterprise stages?
-> Basically Environment variable is predefined variable those we can use while
creating DS job.
-> We can set eithere as Project level or Job level.
-> Once we set specific variable that variable will be availabe into the project/job
-> create reusable object that many jobs within the project can include.
-> When we go for parallel shared container the logic can be reusable across many jobs
49.Have you used sequencer jobs? what are the different triggers you have used?
50.Explain about yourself, what is your work experience and how do you rate yourself on
datawaresing?
52.suppose if we have two flat files and if we want to build one scenario where one fact
table
and two dimensions then how do you build this and what trasformations you use there?
53.why you use shared container to read read unix flat file why not some other stage?
54.what is that some other stage?
56.what is ur target,sources?
-> target:db2
-> source:flatfiles
59.suppose take three jobs and if we want to run those jobs using sequencer how do you
use that and
explain.
-> arrange seqvansilly with links
61.what are different stages involve in the sequencer job and explain them?
64.suppose u have build summarised tables and if we need to improve performance what
do u do?
POLARIS-INTERVIEW
68. What is the size of database in your project?
-> 110gb
69. What is the size of data extracted in the extraction process?
-> lee then 1gb
70. How many data marts are there in your project?
-> 2
71. How many fact and dimension tables in your project?
-> 4 fact tables 10 dimension table
72. What is the size of fact table in your project?
73. How many dimension tables did you had in your project and name some dimensions?
-> 3 book,country,counter,subaccount,counterparty,security,legal
entity,intercompany,trnstype,project.
74. How many measures you have created?
-> 45
75. How do you enter into oracle?
JDA
76.What is difference between local & environment varialbles?
77.what is the routine? have you used the routines in your project?
78.what are after and before routines?
79.what are stage variables? where do we create stage variables? why do we use stage
variables?
80.how do we access the oracle database from datastage?
-> orcle enterprize,odbc
81.suppose if there is no client version in your local machine,the oracle is been installed
in unix box,then how do you get access the oracle connection.
83.suppose there is no datastage client components installed in your local machine then
how do you compile and run the job?
84.suppose a single job is accessed in three sequencer jobs, can we run three sequencer
jobs at a time?
-> all
86.how sequential file execute during the run time in PX? Either sequential or parallel
mode?
->seqvential
87.how do we make a sequential file stage to run in parallel mode?
-> no of readers for node =2
88. Have you built separate data marts for financial and manufacturing aspects?
90. Wether you were loading data into DWH or a staging area?If loading the date into
DWH, then how do you load the data into target table?
91. How many fact and dimension tables you have used?
98. Difference between data set stage and file set stage
99. If we are using SFS as source and hash files as reference, with transformer stage and
pulling data
into target without using any constraints, then which kind of join simulation we are
doing?
104. How many dimensions and fact tables you have developed?name some of the fact
and dimension tables
110.If the look-up table has more than 10,000 records and the source table has
considerably less number of records,which stage will you use?look-up or join
'
-> sprase lokup
111. What is configuration file?
113. What are the different types of partitions? Hwat is the default?
117. If I have a file containing date as one of the columns and I have 5 years of data.
Now I have to load the data at run time according to user requirements which is 6 months
of data. How to go for it?
-> using job parmater
DATA STAGE
118. What is a configuration file?
123. In data stage where do we store jobnames,parameters etc. what is the name of the
place/database where we store them
127. What is meant by shared containers? What are the different types of shared
containers?what is the difference between them
129. How to optimize and tune the parallel jobs for better performance
ORACLE
UNIX
134. What is shell scripting
DWH
135. What is the difference between a fact table and a dimension table
144. In a sequence of 5 jobs, if aborts at 3rd job. If we run the sequence, from where it
starts processing?
145. Difference between sparse and normal lookup and where we use sparse. Can give
one situation?
146. Data set will be stored in two formats. What are they?
147. How do you invoke the UNIX scripts in Data stage jobs?
149. When do you use separate sort stage and inbuilt sort utility in a stage? What is the
difference between these two?
150. Have you run the jobs with odd number of nodes? If not why?
158. How to call the stored procedures in data stage jobs? Can you specify constraints?
160. If a job has 3 stages and nodes=3 then how many process you get?
161. A sequential file has 100 records. How to use first 19 records in the job?
162. How to check the number of nodes while running the job in UNIX environment?
164. A transformer stage is running in parallel. How can you change to sequential?
166. Configuration file is specified at project level and at job level. Which one will
override the other?
167. In UNIX you don’t know some commands. While running the jobs you need the
commands, then how can you get them?
-> grep command:-> is used for finding any string in the file.
-> n a UNIX environment, you can also use the pd_start script to manually start and
stop the server processes.
174. Performance difference between join and merge. Which takes more memory?
-> Notification Activity - used for sending emails to user defined recipients from
within Datastage
176. A sequential file has 4 duplicate records. How do you get 3rd record?
179. If the job is aborted while running in UNIX, then what error message you get?
-> The default editor that comes with the UNIX operating system is called vi (visual
editor)
-> editing
-> ya
INTERVIEW QUESTIONS
-> In sub query the inner query is executed only once. Depeding upon the results of
inner query outer query is evaluated.
-> ex: select * from emp where deptno=( select deptno from dept where
dname='hyd');
-> In correlated subquery the inner query is evaluated once for each row processed
by the parent statement or outer query.
-> SELECT empnum, name FROM empl e WHERE sal > (SELECT avg(sal)
FROM emp WHERE department = e.department);
-> Normalization is process for reducing the redency of the data.there 5 normal
forms mainly used in 3 forms4th one boyesscoded normal form.
-> First Normal form (1NF): A relation is said to be in 1NF if it has only single
valued attributes, neither repeating nor
arrays are permitted.
-> Second Normal Form (2NF): A relation is said to be in 2NF if it is in 1NF and
every non key attribute is fully functional
dependent on the primary key.
-> Third Normal Form (3NF): We say that a relation is in 3NF if it is in 2NF and has
no transitive dependencies.
-> Eliminate Columns Not Dependent On Key - If attributes
do not contribute to a description of the key, remove them to a
separate table.
190. If I select from two tables, how do I assure which table is searched first?
->
191. What are triggers. What are Row level and Statement Level triggers?
-> Triggers are an action that is performed is a condition is met within the code.
-> Row-Level Triggers
-> Statement-Level Triggers
-> Before Trigger
-> After Trigger
-> Schema Triggers
-> Database level Triggers
-> row-level trigger: A row-level trigger fires once for each row that is affected by a
triggering event
-> Statement-Level Triggers: fires once per triggering statement regardless of the
number of rows affected by the triggering event
192. Implicit and Explicit Cursors. Does it work for and update statement?
-> PL/SQL declares a cursor implicitly for all SQL data manipulation statements,
including quries that return only one row
-> explicit cursor or use a cursor FOR loop However,queries that return more than
one row.
-> An implicit cursor is used for all SQL statements Declare, Open, Fetch, Close. An
explicit cursors are used to process multirow
SELECT statements An implicit cursor is used to process INSERT, UPDATE,
DELETE and single row SELECT. .INTO statements.
194. Write a sql to fetch the sumof top 5 salaries department wise?
->
-> If you have lots of updates, long running SQL and too small UNDO, the ORA-
01555 error will appear.
-> The SNAPSHOT_TOO_OLD erro will come ... the query will goves into
cartesion product or infinite loop
-> Other case is suppose ur updated loarge no of rows at atime without saveing the
records .
In this case u can use commit statement for every 500 records then u can avoid this
problam.
Or ask DBA to extend the table space for this segment.
-> 1. Functions must return a value while procedure may or may not return any
value.
-> 2. Function can return only a single value at a time while procedure can return
one, many or none.
-> 3. Function can be called from sql statements while procedures can't.
197. What if I give a return parameter in Function as NULL?
-> x := OSCommand_Run('/home/test/myoscommand.sh')
199. A view has three base tables. One of the column of a table is removed. What
happens to the view.
-> "UTL_FILE" allows PL/SQL programs to both read and write to any operating
system files that are accessible from the server
on which your database instance is running.
-> dervations
-> constrains
-> experssions for constraints and dervations can referance
-> job parameters
-> funcations
-> system varaiables and cinstraints
-> stage variables
-> external routines
236. what is the fast multi node & fast default node
->
238. what is the shared container?how & where do u use shared containers?
-> change capture stage catch holds of changes from two different datasets and
generates a new column called change
code....change code has values
0-copy
1-insert
2-delete
3-edit/update
242. what is the defferences between server job & parallel job & main frame jobs?
-> Server jobs. These are available if you have installed DataStage Server.
They run on the DataStage Server connecting to other data sources as
necessary.
-> Parallel jobs. These are only available if you have installed Enterprise Edition.
Server jobs can be run on SMP MPP machines.Here performance is low
i.e speed is less
-> Parallel jobs can be run only on cluster machines .Here performance is high i.e
speed is high
-> These run on DataStage servers that are SMP MPP or cluster systems. They can
also run on a separate z/OS (USS) machine if required.
243.what is the defferences between server shared container & parallel shared containers?
-> Repository resides in a spcified data base. it holds all the meta data rawdata
mapping information and all the respective mapping information.
-> it contain jobs,routines,tabledefinations etc.......
-> change capture stage catch holds of changes from two different datasets and
generates a new column called change
code....change code has values
0-copy
1-insert
2-delete
3-edit/update
-> U can edit the order of the input and output links from the Link ordering tab in the
stages.
-> Main difference lies in parellism Datastage uses parellism concept through node
configuration where Informatica does not
-> Partitioning - Datastage PX provides many more robust partitioning options than
informatica.
You can also re-partition the data whichever way you want.
-> Parallelism - Informatica does not support full pipeline parallelism (although it
claims).
-> File Lookup - Informatica supports flat file lookup but the caching is horrible.
DataStage supports hash files lookup filesets datasets for much more
efficient lookup.
-> Merge/Funnel - Datastage has a very rich functionality of merging or funnelling the
streams.
In Informatica the only way is to do a Union which by the way is always
a Union-all.
-> sort
-> aggrigrater
-> remove duplicate
-> surogate key
-> transformer
258. if input & output column datatypes are not matched and not used proper conversion
process
then there is an error like this.
259. When checking operator: When binding output schema variable "outRec": When
binding output
interface field "CPSC_CODE" to field "CPSC_CODE": Converting nullable source to
non-nullable
-> result; fatal runtime error could occur (use modify operator to specify value to
which null
should be converted)
-> Stage Variable - An intermediate processing variable that retains value during read
and doesnt pass the value into target column.
-> system variable - System variables have a predefined type and structure that cannot
be changed. When an expression is stored into a system variable,
it is converted to the variable type, if necessary and possible.
-> write the routine in C or C++ create the object file and place object in lib directory.
-> now open disigner and goto routines configure the path and routine names
-> Routines are stored in the Routines branch of the DataStage Repository, where you
can create, view or edit.
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
-> A shared container can be shared between jobs and is great for re-usability.
-> local container can only be used in the job in which it is created,
it is useful for simplifying job design by breaking the job into different sections.
267. please explain any ETL process that you have developed?
269. If ur are doing any changes in shared-container will it reflect in all the jobs wherever
you used this shared- container?
-> ya
-> Transactions not only ensure the full completion (or rollback) of the statements that
they enclose but also isolate the data modified by the statements.
-> The isolation level describes the degree to which the data being updated is visible to
other transactions.
276. What is before job subroutine/After job subroutine? When do you use them?
-> Before-stage subroutine Contain the name (and value) of a subroutine that is
executed before the stage starts to process any data.
-> After-stage subroutine Contain the name (and value) of a subroutine that is
executed after the stage has processed the data
279. What is Clear Status File and when do you use it?
281. can i join a flat file and oracle and load into oracle?is this possible?
-> ya
283. while loading some data into target suddenly thier is a problem loading process
stopped how can u start loading from the records that were left?
......................
285. Is it possible to rollback a set of jobs when after some jobs are executed, if the latest
job fails?
-> no
286. Could DS generate data dictionary of source database and target database ?
-> ya
287. What are the various reports that could be generated using this tool ?
288. Could DS show record length, total db size for source and target based on existing
source data to arrive at target
database sizing?
-> no
-> ya
290. Compare ETL features of DS with that of its competitors ? Chart is essential.
292. What other reports could DS provide apart from Impact Analysis Document ? Is
customizable reporting feature available?
293. Could DS generate test cases to verify the veracity of mapping etc at design time to
validate the mapping and transforms?
-> ya
294. Is there any mechanism available in Transformer Stage to show the left out
nand un mapped fields from source and target stages?
-> RCP
-> yes
296. Does Meta Data export of Copy book and that data in the same available in DS ?
-> yes
peter.zeglis@ascentialsoftware.com
Learn it from ValueCap. They are ex-developers of PX. They can conduct on-site px
course (with all bell and whistles).
lucy.luzza@ascentialsoftware.com
-> computer controle pannel -> admin tools -> data source -> system dsn-> dsn
name,tsn name, pwd
302. Your output from ETL is used to analyses the data for others? So, what type of
data ware house it is?
303. Your project is slowly changing dimension, then how frequently you are getting
your data? i.e weekly/monthly
-> flatfiles,oracle
306. What is the difference between primary key and surrogate key?
307. Where we use the primary key and where we use surrogate key?
-> If your goal is to compare two tables, send the output of a select * from my_table
order by my_key to a file
for each table and, then, use a compare utility (like UNIX diff).
-> passing
314. If the data in text file, is you use to hard code the parameters or pass ?
317. If the data is in 10 files and to load in table, which stage will you prefer?
SEQFILE:
319. How can you see the dataset in UNIX and windows?
-> create reusable object that many jobs within the project can include.
-> When we go for parallel shared container the logic can be reusable across many
jobs
322. You have the sequential files in funnel stage, the records of first file has to come
first in target
and second file records next and finally third. How you can achieve this?
323. What is the difference between remove duplicate & sort stage?
-> REMOVE DUPLICATE: just remove the the duplicate data based on key
-> SORT: just sot the data(ascending or desending) based on key. remove duplicates
also posible in sort stage.